Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: upper limit on fweights? overflowing into missing values?


From   Richard Williams <[email protected]>
To   [email protected], [email protected]
Subject   Re: st: upper limit on fweights? overflowing into missing values?
Date   Thu, 01 Aug 2013 22:23:18 -0400

At 05:13 PM 8/1/2013, László Sándor wrote:
Thanks, Richard.

An "observation" was one security for one year. Total holdings of each
would have been the weights. I wanted to calculate how one data
source's prices were correlated with another. Lines/observations with
larger holdings were more important for me in this regard, but as the
larger holdings don't imply a more precise "average price" coming from
a larger sample, these still don't strike me as a case for analytical
weights. So I thought -fweights- would do the trick for -pwcorr-.

Again, I don't find it helpful to think of each dollar in these assets
being individual observations or not. To me this is a straw man.

But yes, there are assets with more than 2 billion held in them, many.

And I still think my confusion and ask for help was legitimate even if
the calculation obviously works after a conversion into millions (and
some rounding, which of course shows why "millions" were not really
the unit/level of observation here). Getting missing values back
confused me, as I thought there were some missing prices somewhere
(sure there are) and maybe -casewise- was not doing its job or I
missed something. I still don't follow if you think it is obvious that
a user should have known that this will only work in millions, so
Stata does not warn them.

Sorry, I didn't mean to imply that you should have known this. It took me several minutes to figure out what I think is going on. But in fairness to StataCorp, I think it is a highly unusual problem so it doesn't surprise me they haven't done more about it or been more explicit in their documentation. And also in fairness to StataCorp, it does say that the number of observations is limited to around 2 billion. Given that fweights tell you the number of duplicated observations, it doesn't surprise me that the limit for the number of fweighted cases is the same. But I certainly agree that just adding a sentence to that effect in the help for fweights would be a lot easier than expecting people to piece things together on their own.

I hope this is settled, I am glad StataCorp will help future users
with similarly underdeveloped intuition for implicit limits and thus
the right scale for the unit of observation?

I have to admit, I still don't understand why you think fweights are appropriate, nor am I clear yet on what the fweights are. If total holdings = dollar value of the security, then I don't know why dollars would be viewed as duplicated observations. I would use fweights if I could in fact create a separate record for each fweighted case (but don't need to because some cases have the exact same values). But the fact that i don't understand it doesn't mean it is wrong, it may just mean that i don't understand it!

It sort of sounds to me like you want to use iweights. The help says "iweights, or importance weights, are weights that indicate the "importance" of the observation in some vague sense." So, maybe you think that richer portfolios should carry more weight than smaller portfolios. But, rightly or wrongly, pwcorr won't let you use iweights. Regress does, but it has the 2 billion cases limit. If you just want to get at relative importance, then dividing portfolio dollars by 1 million would seem to do the trick.

But, "right scale for the unit of observation" seems like an odd term to me, or else something that researchers have to figure out on their own. What is the unit of analysis? If I have California and Wyoming in my sample (the states with the largest and smallest populations in the USA) does my N = 2? Or do I weight each state by its population size? Or what? It depends on the nature of my data and the problem I am trying to address. I think in most cases I would say my N is 2 (hopefully I have some other states in there too!) but I would include things like population size as control variables in an analysis.

Anyway it is an interesting problem. Not one that I expect to encounter myself unless I switch to genetics or start analyzing the number of molecules in the Universe. But there are bound to be some people who need to deal with monster Ns, and it will be nice if Stata can accommodate them.

Laszlo

On Thu, Aug 1, 2013 at 4:26 PM, Richard Williams
<[email protected]> wrote:
> I still don't understand what the fweights are supposed to represent, i.e.
> what is an observation in these data? If it is the dollar value of the
> portfolio, you could simply measure the value in millions of dollars rather
> than dollars. Or, if it is the number of shares of stock, it could be
> measured in 1000s of shares rather than shares. If you can be clear on what
> an observation is that might help.
>
> Like Nick, I could also see where aweights might be right. According to the
> docs, "Analytic aweights are typically appropriate when you are dealing with > data containing averages. For instance, you have average income and average
> characteristics on a group of people.  The weighting variable contains the
> number of persons over which the average was calculated (or a number
> proportional to that amount)." aweights might be used for things like states
> of the United States. Or maybe even countries. The world population is in
> the 7 billion range, so if you had one record per country with things like,
> say, average income, the aweights could be the population size. Stata should
> be able to handle that fine even thought it couldn't handle 7 billion
> records for every person in the world. If, say, you have 100 portfolios with
> a total of 100 billion shares of stock, and for each portfolio you have the
> average value of a share of stock, along with other characteristics of the
> portfolio (e.g. how managed) aweights would sound right to me.
>
> I agree that the documentation should be better and I am glad that Stata
> says it is going to work on it. But, this seems like a wildly esoteric
> problem to me. How many people have 4 billion cases? I don't think many do.
> I can see how this one has slipped through the cracks for the 25+ years
> Stata has been around.
>
> And in this case, I am not sure that you have 4 billion cases either. Again, > if you can clarify what an observation is, that may help. If it is something > like dollar value, that doesn't really strike me as being cases, but even if > it is it seems easy enough to rescale into millions or thousands or whatever
> in order to make the problem manageable.
>
>
>
> At 01:35 PM 8/1/2013, László Sándor wrote:
>>
>> Thanks, Nick.
>>
>> Then maybe I have a terrible understanding of what aweights are. My
>> larger portfolios are not simply more precisely priced, they are,
>> well, larger. I think that enters a pwcorr calculation differently,
>> though maybe not.
>>
>> On semantics: I think an observation is anchored in the actual data in
>> Stata. But whether the weighting is sensible should not depend on
>> whether my dollar-by-dollar comparison uses larger numbers than an
>> investor-by-investor comparison. And I definitely disagree with the
>> notion that the current (undocumented) limits are fine because no one
>> would have this many "observations." Yes, no one would have this many
>> lines in Stata, but fweights are exactly there to talk about larger
>> populations than the aggregates in the data, and the dollar values can
>> easily get this large, even without "genetics." I would push back on
>> monetary amounts not being populations/observations so it is fine that
>> Stata silently overflows if it encounters them.
>>
>> So let's root for more documentation soon.
>>
>> On Tue, Jul 30, 2013 at 8:54 AM, Nick Cox <[email protected]> wrote:
>> > On the contrary, it seems to me that "what is an observation?" is more
>> > than semantic here: it is the nub of the issue!
>> >
>> > It's your problem but this sounds to me like a case for analytic
>> > weights. The use of frequency weights is also suspect unless the
>> > weights are integers (without artifice or rounding).
>> >
>> > As I've said or implied in earlier posts, this all should be a bit
>> > better documented.
>> > Nick
>> > [email protected]
>> >
>> >
>> > On 30 July 2013 13:34, László Sándor <[email protected]> wrote:
>> >> Thanks, Richard.
>> >>
>> >> Stata tech support got back to me and suggested something similar:
>> >> that some operations with fweights do overflow with such large
>> >> weights, others don't. I am not sure whether we shall call it
>> >> hard-coded as a restriction on some number somewhere or simply the C
>> >> implementation of -mf_quadcross- or something.
>> >>
>> >> I think I tried to describe my use case: I wanted to calculate stats
>> >> on portfolios, and it makes sense to weight by the size of them. As
>> >> pwcorr does not allow iweights, and pweights and aweights do something
>> >> completely different, I thought I'd use fweights. It blows up unless I
>> >> rescale the portfolios into thousands, millions or billions.
>> >>
>> >> Not a big deal, but Stata's (non-existent) error message, help and
>> >> documentation were not exactly helpful in resolving this. StataCorp
>> >> says they will address this.
>> >>
>> >> I think what an observation is is a semantic issue here, not very
>> >> helpful. Is an entire portfolio "one observation" or a single share in
>> >> each, or each dollar behind each? I am not sure this should matter
>> >> neither for us nor Stata.
>> >>
>> >> Best,
>> >>
>> >> Laszlo
>> >>
>> >> On Mon, Jul 29, 2013 at 9:53 AM, Richard Williams
>> >> <[email protected]> wrote:
>> >>> Just to sum up my current thinking/guesses on this:
>> >>>
>> >>> * the maximum number of observations in Stata is 2,147,483,647
>> >>> * Nonetheless, fweighted data sets can have more observations than
>> >>> that
>> >>> * However, not all routines will work when the fweighted data has more
>> >>> than
>> >>> 2,147,483,647 cases. You can do some simple descriptive things, but
>> >>> you
>> >>> can't do more complicated things like regression or correlations.
>> >>> * As to why that is, I am guessing that some routines have the
>> >>> 2,147,483,647
>> >>> limit hardcoded in. Or, maybe there just isn't enough precision to
>> >>> handle
>> >>> calculations when the N is larger than that.
>> >>> * Given that most people don't have more than 2,147,483,647 cases (and
>> >>> even
>> >>> if they did, their computer memory couldn't handle them) StataCorp
>> >>> probably
>> >>> hasn't spent a lot of time worrying about this.
>> >>> * Still, an added sentence or two in the fweights documentation or
>> >>> elsewhere
>> >>> warning about limits might be a good idea.
>> >>>
>> >>> I am curious what the original author is doing that requires analyzing
>> >>> 4
>> >>> billion+ cases. Some sort of genetic research maybe? I've certainly
>> >>> never
>> >>> heard of any kind of Survey research having an N that large.
>> >>>
>> >>>
>> >>>
>> >>> At 06:53 PM 7/28/2013, Nick Cox wrote:
>> >>>>
>> >>>> This is interesting, but in principle I don't see that Stata's limit
>> >>>> on # of observations has any bearing on how big frequency weights can
>> >>>> be. I can imagine people wanting to use frequency weights to subvert
>> >>>> the limit on number of observations.
>> >>>>
>> >>>> A different point is that if there is a limit on how big weights can
>> >>>> be it should be documented e.g. at -help limits-.
>> >>>> Nick
>> >>>> [email protected]
>> >>>>
>> >>>>
>> >>>> On 29 July 2013 00:46, Richard Williams
>> >>>> <[email protected]>
>> >>>> wrote:
>> >>>> > According to -help limits-, the maximum number of observations is
>> >>>> > 2,147,483,647. Your weights give you more than 4 billion cases,
>> >>>> > well above
>> >>>> > that. Further, the help also says that this is a theoretical
>> >>>> > maximum; memory
>> >>>> > availability will certainly impose a smaller maximum.
>> >>>> >
>> >>>> > On my computer, I specified [fw = 1073741823] on the pwcorr command
>> >>>> > and
>> >>>> > it ran. Then I specified [fw = 1073741824] and it did not run.
>> >>>> > These numbers
>> >>>> > put you just below and just above the maximum number of cases that
>> >>>> > Stata
>> >>>> > allows.
>> >>>> >
>> >>>> > So in short, it appears that your fweighted cases can't exceed the
>> >>>> > 2
>> >>>> > billion+ that Stata allows, and memory restrictions may hold you to
>> >>>> > even
>> >>>> > less than that.
>> >>>> >
>> >>>> > Also, you probably need to specify that the fweight variable is
>> >>>> > type
>> >>>> > long, e.g.
>> >>>> >
>> >>>> > input y x long fw
>> >>>> >
>> >>>> > Sent from my iPad
>> >>>> >
>> >>>> > On Jul 27, 2013, at 12:36 PM, László Sándor <[email protected]>
>> >>>> > wrote:
>> >>>> >
>> >>>> >> Hi,
>> >>>> >> If you care, here is an example that silently produces missing
>> >>>> >> values.
>> >>>> >> I notified Stata Support.
>> >>>> >>
>> >>>> >> input y x fw
>> >>>> >> 2 1 2147483621
>> >>>> >> 1 2 2147483621
>> >>>> >> end
>> >>>> >> de
>> >>>> >> pwcorr y x [fw=fw]
>> >>>> >> exit
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Laszlo
>> >>>> >>
>> >>>> >> On Sun, Jul 21, 2013 at 5:08 PM, Nick Cox <[email protected]>
>> >>>> >> wrote:
>> >>>> >>> I'd suggest documenting your problems with a reproducible example
>> >>>> >>> and
>> >>>> >>> sending Stata tech support.
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Nick
>> >>>> >>> [email protected]
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> On 21 July 2013 21:55, László Sándor <[email protected]> wrote:
>> >>>> >>>> Hi,
>> >>>> >>>> in Stata/MP 12.1 I am getting missing values with using -pwcorr-
>> >>>> >>>> with
>> >>>> >>>> -fweights- though the feature works fine with other data or if I
>> >>>> >>>> scale
>> >>>> >>>> my weights down. Is it possible to simply have too large
>> >>>> >>>> fweights,
>> >>>> >>>> e.g. if they cannot be of type -long- anymore?
>> >>>> >>>>
>> >>>> >>>> If so, why doesn't Stata warn me about this?
>> >>>> >>>>
>> >>>> >>>> I vaguely remember some Statalist of Stata blog discussion of
>> >>>> >>>> this,
>> >>>> >>>> but I could not even Google it up, and Stata still did not warn
>> >>>> >>>> me?
>> >>>> >>>>
>> >>>> >>>> Actually, why didn't Stata complain that I did not have integer
>> >>>> >>>> fweights if obviously the variable wasn't of type byte, int or
>> >>>> >>>> long?
>> >>>> >>>>
>> >>>> >>>> Thanks,
>> >>>> >>>>
>> >>>> >>>> Laszlo
>> >>>> >>>>
>> >>>> >>>> *
>> >>>> >>>> *   For searches and help try:
>> >>>> >>>> *   http://www.stata.com/help.cgi?search
>> >>>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >>>
>> >>>> >>> *
>> >>>> >>> *   For searches and help try:
>> >>>> >>> *   http://www.stata.com/help.cgi?search
>> >>>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >>
>> >>>> >> *
>> >>>> >> *   For searches and help try:
>> >>>> >> *   http://www.stata.com/help.cgi?search
>> >>>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >
>> >>>> > *
>> >>>> > *   For searches and help try:
>> >>>> > *   http://www.stata.com/help.cgi?search
>> >>>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> > *   http://www.ats.ucla.edu/stat/stata/
>> >>>>
>> >>>> *
>> >>>> *   For searches and help try:
>> >>>> *   http://www.stata.com/help.cgi?search
>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>
>> >>>
>> >>> -------------------------------------------
>> >>> Richard Williams, Notre Dame Dept of Sociology
>> >>> OFFICE: (574)631-6668, (574)631-6463
>> >>> HOME:   (574)289-5227
>> >>> EMAIL:  [email protected]
>> >>> WWW:    http://www.nd.edu/~rwilliam
>> >>>
>> >>>
>> >>>
>> >>> *
>> >>> *   For searches and help try:
>> >>> *   http://www.stata.com/help.cgi?search
>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>> *   http://www.ats.ucla.edu/stat/stata/
>> >>
>> >> *
>> >> *   For searches and help try:
>> >> *   http://www.stata.com/help.cgi?search
>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >> *   http://www.ats.ucla.edu/stat/stata/
>> >
>> > *
>> > *   For searches and help try:
>> > *   http://www.stata.com/help.cgi?search
>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> > *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> -------------------------------------------
> Richard Williams, Notre Dame Dept of Sociology
> OFFICE: (574)631-6668, (574)631-6463
> HOME:   (574)289-5227
> EMAIL:  [email protected]
> WWW:    http://www.nd.edu/~rwilliam
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  [email protected]
WWW:    http://www.nd.edu/~rwilliam


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index