Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Contract/Collapse Combination


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Contract/Collapse Combination
Date   Tue, 22 May 2012 21:55:35 +0100

Thanks for these extra comments. I will focus on those I think I
understand. I did misread you on 10^15 or so. The best way to get a
unique identifier for cross-combinations of variables is something
like

bysort <varlist> : gen id = _n == 1
replace id = sum(id)

which is canned as -egen-'s -group()- function.

The limits on size of variables' values can't bite here, as the number
of cross-combinations in the data can't exceed the number of
observations.

Nick

On Tue, May 22, 2012 at 9:15 PM, Lucas <[email protected]> wrote:
> I guess I too did not clarify everything because I hoped attention
> would focus on the problem I identified, not my reasons for why
> workarounds don't work.  But, to clarify:
>
> The 15 variable ID is not 10^15.  The confusion stems from reading my
> note as concerning a 15 DIGIT identifier.  However, I did not
> reference a 15-digit identifier, I referenced an identifier made out
> of the *collective* digits of 15 variables.  Some of the fifteen
> variables are continuous, which means they have lots of categories,
> which means they account for lots of digits.  Thus, expanding to make
> an id out of them *will* exceed the size of the largest number allowed
> in stata.  I do not maintain this is a general reality, it is a
> reality in my data.
>
> Second, it is not a bug if a user asks their machine to do something
> with stata but the user has insufficient memory to do the task.  I've
> seen that issue discussed here, and I explicitly asked that stata add
> an ability to use the disk as RAM, and when I asked they declined.
> Their thinking was it would be too slow.  My thinking was slow beats
> impossible.  But, I am not a decision-maker at stata, so I respect
> their decision, their allocation of programming effort, and live with
> the consequences.  Why this seems troubling to someone, I am not sure.
>
> I saw your post about joint frequencies, and will try the solution.  I
> have not had a chance because I am running something else on stata at
> the moment.
>
> Thanks a bunch.
> Sam
>
> On Tue, May 22, 2012 at 10:07 AM, Nick Cox <[email protected]> wrote:
>> I am finding it very difficult to work out what you are seeking in this thread.
>>
>> First, it really wasn't clear to me from your post that you fully understood the precision problem. Your explanation for why the 15-digit identifier didn't work is below. Here it is again: "it will not work for 15 variables of various types, because the id# will exceed the largest value allowed in stata". But that is wrong, as 10^15 is certainly allowed in Stata. I didn't correct that explicitly, but I pointed to the deeper question of precision, which I guessed was at the root of what you were trying.
>>
>> Second, I understood you earlier as implying that -contract- can not produce reproducible results. Now you seem to imply that this can't be a bug. I'm lost here.
>>
>> BTW, I made a suggestion in an earlier post that you don't need  StataCorp or anybody else to hit -contract-. You just need to apply -contract- to get joint frequencies, and then everything you want is implicit in that reduced dataset.
>>
>> Nick
>> [email protected]
>>
>> Lucas
>>
>> Nick,
>>
>> A composite 6-digit identifier is not a problem.  I indicated I did
>> not think it possible to make such an identifier for each cell of
>> 15-way crosstab.  So, we are not disagreeing.
>>
>> I don't think contract is buggy.  I think a simple (conceptually,
>> perhaps not computer "programmingly") extension of contract to allow
>> multiple (or at least 2) frequency counts seems a good idea if
>> possible, and consistent with the stata-proposed solution of
>> addressing slow estimation on big data with collapsing data and using
>> frequency counts.
>>
>> I won't alert stata--they are listening anyway, and they can easily
>> come back at me and say I should get more memory.  And, of course, I'd
>> agree.  But, still, we'd be left with a command seemingly within
>> whispering distance of providing a general solution to a common task,
>> but not going that final distance.
>>
>> Thanks, though.
>> Sam
>>
>> On Tue, May 22, 2012 at 9:37 AM, Nick Cox <[email protected]> wrote:
>>> The solution here of producing a composite identifier looks likely to fail. You are putting a very big number into a -float- variable and expect to retain every last bit of precision. See
>>>
>>> http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/
>>>
>>> for why that is a bad idea.
>>>
>>> As for the rest, you seem to be claiming that -contract- is buggy. That is important if true, and you should send in a report containing incontrovertible evidence to Stata tech-support.
>>>
>>> Nick
>>> [email protected]
>>>
>>> Lucas
>>>
>>> Brendan,
>>>
>>> My original note indicated exactly the solution you propose, of doing
>>> it twice and merging.  But this is incredibly risky, because there is
>>> no way to assure every combination appears in both files.  Even the
>>> "zero" option apparently cannot assure this.  Believe me, I tried this
>>> with about 6 variables, and the file sizes do not equate across
>>> runs--not to mention that one has to be pretty certain everything is
>>> sorted exactly right.  I do not know *why* the problem occurred, it
>>> occurred, and perhaps it is that the file is so big, that problems
>>> emerge that do not exist for smaller datasets (e.g., sorted cases fall
>>> out of sorts, as it were).
>>>
>>> At any rate, my response was to make an id based on the 6 variables:
>>>
>>> gen id=(x1*10000)+(x2*1000)+. . .+(x6) ;
>>>
>>> This works for 6 dichotomous variables; it will not work for 15
>>> variables of various types, because the id# will exceed the largest
>>> value allowed in stata.
>>>
>>> THUS, it seems a more general solution is needed, that does not
>>> require a later merge.
>>>
>>> As for your collapse example, it is unclear, as you start with data
>>> that is already collapsed.  The problem is the data is not collapsed,
>>> and the aim is to get it into the collapsed form.
>>>
>>> On Tue, May 22, 2012 at 7:50 AM, Brendan Halpin <[email protected]> wrote:
>>>> On Tue, May 22 2012, Lucas wrote:
>>>>
>>>>> Is there a way to use the contract command and obtain frequencies for
>>>>> TWO variables rather than just ONE?  A corollary question would be, Is
>>>>> there a way to use the contract command and obtain the count of 1's on
>>>>> TWO separate dichotomous variables?
>>>>
>>>> That is what my example achieves, though using -collapse- instead of
>>>> -contract-.
>>>>
>>>> Another way of doing it would be to separate the data by entercol, and
>>>> -contract- or -collapse- it twice, once for entercol==1 and once for
>>>> entercol==0, and then merge the resulting files by the 15 crosstab
>>>> variables.
>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index