Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: upper limit on fweights? overflowing into missing values?

From	Richard Williams <[email protected]>
To	[email protected], [email protected]
Subject	Re: st: upper limit on fweights? overflowing into missing values?
Date	Thu, 01 Aug 2013 22:23:18 -0400

At 05:13 PM 8/1/2013, LÃ¡szlÃ³ SÃ¡ndor wrote:

Thanks, Richard.

An "observation" was one security for one year. Total holdings of each
would have been the weights. I wanted to calculate how one data
source's prices were correlated with another. Lines/observations with
larger holdings were more important for me in this regard, but as the
larger holdings don't imply a more precise "average price" coming from
a larger sample, these still don't strike me as a case for analytical
weights. So I thought -fweights- would do the trick for -pwcorr-.

Again, I don't find it helpful to think of each dollar in these assets
being individual observations or not. To me this is a straw man.

But yes, there are assets with more than 2 billion held in them, many.

And I still think my confusion and ask for help was legitimate even if
the calculation obviously works after a conversion into millions (and
some rounding, which of course shows why "millions" were not really
the unit/level of observation here). Getting missing values back
confused me, as I thought there were some missing prices somewhere
(sure there are) and maybe -casewise- was not doing its job or I
missed something. I still don't follow if you think it is obvious that
a user should have known that this will only work in millions, so
Stata does not warn them.

Sorry, I didn't mean to imply that you shouldhave known this. It took me several minutes tofigure out what I think is going on. But infairness to StataCorp, I think it is a highlyunusual problem so it doesn't surprise me theyhaven't done more about it or been more explicitin their documentation. And also in fairness toStataCorp, it does say that the number ofobservations is limited to around 2 billion.Given that fweights tell you the number ofduplicated observations, it doesn't surprise methat the limit for the number of fweighted casesis the same. But I certainly agree that justadding a sentence to that effect in the help forfweights would be a lot easier than expectingpeople to piece things together on their own.

I hope this is settled, I am glad StataCorp will help future users
with similarly underdeveloped intuition for implicit limits and thus
the right scale for the unit of observation?

I have to admit, I still don't understand why youthink fweights are appropriate, nor am I clearyet on what the fweights are. If total holdings= dollar value of the security, then I don't knowwhy dollars would be viewed as duplicatedobservations. I would use fweights if I could infact create a separate record for each fweightedcase (but don't need to because some cases havethe exact same values). But the fact that i don'tunderstand it doesn't mean it is wrong, it mayjust mean that i don't understand it!

It sort of sounds to me like you want to useiweights. The help says "iweights, or importanceweights, are weights that indicate the"importance" of the observation in some vaguesense." So, maybe you think that richerportfolios should carry more weight than smallerportfolios. But, rightly or wrongly, pwcorr won'tlet you use iweights. Regress does, but it hasthe 2 billion cases limit. If you just want toget at relative importance, then dividingportfolio dollars by 1 million would seem to do the trick.

But, "right scale for the unit of observation"seems like an odd term to me, or else somethingthat researchers have to figure out on their own.What is the unit of analysis? If I haveCalifornia and Wyoming in my sample (the stateswith the largest and smallest populations in theUSA) does my N = 2? Or do I weight each state byits population size? Or what? It depends on thenature of my data and the problem I am trying toaddress. I think in most cases I would say my Nis 2 (hopefully I have some other states in theretoo!) but I would include things like populationsize as control variables in an analysis.

Anyway it is an interesting problem. Not one thatI expect to encounter myself unless I switch togenetics or start analyzing the number ofmolecules in the Universe. But there are bound tobe some people who need to deal with monster Ns,and it will be nice if Stata can accommodate them.

Laszlo

On Thu, Aug 1, 2013 at 4:26 PM, Richard Williams
<[email protected]> wrote:
> I still don't understand what the fweights are supposed to represent, i.e.
> what is an observation in these data? If it is the dollar value of the
> portfolio, you could simply measure the value in millions of dollars rather
> than dollars. Or, if it is the number of shares of stock, it could be
> measured in 1000s of shares rather than shares. If you can be clear on what
> an observation is that might help.
>
> Like Nick, I could also see where aweights might be right. According to the

> docs, "Analytic aweights are typicallyappropriate when you are dealing with> data containing averages. For instance, youhave average income and average

> characteristics on a group of people.  The weighting variable contains the
> number of persons over which the average was calculated (or a number

> proportional to that amount)." aweights mightbe used for things like states

> of the United States. Or maybe even countries. The world population is in
> the 7 billion range, so if you had one record per country with things like,

> say, average income, the aweights could bethe population size. Stata should

> be able to handle that fine even thought it couldn't handle 7 billion

> records for every person in the world. If,say, you have 100 portfolios with

> a total of 100 billion shares of stock, and for each portfolio you have the
> average value of a share of stock, along with other characteristics of the
> portfolio (e.g. how managed) aweights would sound right to me.
>
> I agree that the documentation should be better and I am glad that Stata
> says it is going to work on it. But, this seems like a wildly esoteric
> problem to me. How many people have 4 billion cases? I don't think many do.
> I can see how this one has slipped through the cracks for the 25+ years
> Stata has been around.
>

> And in this case, I am not sure that you have4 billion cases either. Again,> if you can clarify what an observation is,that may help. If it is something> like dollar value, that doesn't really strikeme as being cases, but even if> it is it seems easy enough to rescale intomillions or thousands or whatever

> in order to make the problem manageable.
>
>
>
> At 01:35 PM 8/1/2013, LÃ¡szlÃ³ SÃ¡ndor wrote:
>>
>> Thanks, Nick.
>>
>> Then maybe I have a terrible understanding of what aweights are. My
>> larger portfolios are not simply more precisely priced, they are,
>> well, larger. I think that enters a pwcorr calculation differently,
>> though maybe not.
>>
>> On semantics: I think an observation is anchored in the actual data in
>> Stata. But whether the weighting is sensible should not depend on
>> whether my dollar-by-dollar comparison uses larger numbers than an
>> investor-by-investor comparison. And I definitely disagree with the
>> notion that the current (undocumented) limits are fine because no one
>> would have this many "observations." Yes, no one would have this many
>> lines in Stata, but fweights are exactly there to talk about larger
>> populations than the aggregates in the data, and the dollar values can
>> easily get this large, even without "genetics." I would push back on
>> monetary amounts not being populations/observations so it is fine that
>> Stata silently overflows if it encounters them.
>>
>> So let's root for more documentation soon.
>>
>> On Tue, Jul 30, 2013 at 8:54 AM, Nick Cox <[email protected]> wrote:
>> > On the contrary, it seems to me that "what is an observation?" is more
>> > than semantic here: it is the nub of the issue!
>> >
>> > It's your problem but this sounds to me like a case for analytic
>> > weights. The use of frequency weights is also suspect unless the
>> > weights are integers (without artifice or rounding).
>> >
>> > As I've said or implied in earlier posts, this all should be a bit
>> > better documented.
>> > Nick
>> > [email protected]
>> >
>> >
>> > On 30 July 2013 13:34, LÃ¡szlÃ³ SÃ¡ndor <[email protected]> wrote:
>> >> Thanks, Richard.
>> >>
>> >> Stata tech support got back to me and suggested something similar:
>> >> that some operations with fweights do overflow with such large
>> >> weights, others don't. I am not sure whether we shall call it
>> >> hard-coded as a restriction on some number somewhere or simply the C
>> >> implementation of -mf_quadcross- or something.
>> >>
>> >> I think I tried to describe my use case: I wanted to calculate stats
>> >> on portfolios, and it makes sense to weight by the size of them. As
>> >> pwcorr does not allow iweights, and pweights and aweights do something
>> >> completely different, I thought I'd use fweights. It blows up unless I
>> >> rescale the portfolios into thousands, millions or billions.
>> >>
>> >> Not a big deal, but Stata's (non-existent) error message, help and
>> >> documentation were not exactly helpful in resolving this. StataCorp
>> >> says they will address this.
>> >>
>> >> I think what an observation is is a semantic issue here, not very
>> >> helpful. Is an entire portfolio "one observation" or a single share in
>> >> each, or each dollar behind each? I am not sure this should matter
>> >> neither for us nor Stata.
>> >>
>> >> Best,
>> >>
>> >> Laszlo
>> >>
>> >> On Mon, Jul 29, 2013 at 9:53 AM, Richard Williams
>> >> <[email protected]> wrote:
>> >>> Just to sum up my current thinking/guesses on this:
>> >>>
>> >>> * the maximum number of observations in Stata is 2,147,483,647
>> >>> * Nonetheless, fweighted data sets can have more observations than
>> >>> that
>> >>> * However, not all routines will work when the fweighted data has more
>> >>> than
>> >>> 2,147,483,647 cases. You can do some simple descriptive things, but
>> >>> you
>> >>> can't do more complicated things like regression or correlations.
>> >>> * As to why that is, I am guessing that some routines have the
>> >>> 2,147,483,647
>> >>> limit hardcoded in. Or, maybe there just isn't enough precision to
>> >>> handle
>> >>> calculations when the N is larger than that.
>> >>> * Given that most people don't have more than 2,147,483,647 cases (and
>> >>> even
>> >>> if they did, their computer memory couldn't handle them) StataCorp
>> >>> probably
>> >>> hasn't spent a lot of time worrying about this.
>> >>> * Still, an added sentence or two in the fweights documentation or
>> >>> elsewhere
>> >>> warning about limits might be a good idea.
>> >>>
>> >>> I am curious what the original author is doing that requires analyzing
>> >>> 4
>> >>> billion+ cases. Some sort of genetic research maybe? I've certainly
>> >>> never
>> >>> heard of any kind of Survey research having an N that large.
>> >>>
>> >>>
>> >>>
>> >>> At 06:53 PM 7/28/2013, Nick Cox wrote:
>> >>>>
>> >>>> This is interesting, but in principle I don't see that Stata's limit
>> >>>> on # of observations has any bearing on how big frequency weights can
>> >>>> be. I can imagine people wanting to use frequency weights to subvert
>> >>>> the limit on number of observations.
>> >>>>
>> >>>> A different point is that if there is a limit on how big weights can
>> >>>> be it should be documented e.g. at -help limits-.
>> >>>> Nick
>> >>>> [email protected]
>> >>>>
>> >>>>
>> >>>> On 29 July 2013 00:46, Richard Williams
>> >>>> <[email protected]>
>> >>>> wrote:
>> >>>> > According to -help limits-, the maximum number of observations is
>> >>>> > 2,147,483,647. Your weights give you more than 4 billion cases,
>> >>>> > well above
>> >>>> > that. Further, the help also says that this is a theoretical
>> >>>> > maximum; memory
>> >>>> > availability will certainly impose a smaller maximum.
>> >>>> >
>> >>>> > On my computer, I specified [fw = 1073741823] on the pwcorr command
>> >>>> > and
>> >>>> > it ran. Then I specified [fw = 1073741824] and it did not run.
>> >>>> > These numbers
>> >>>> > put you just below and just above the maximum number of cases that
>> >>>> > Stata
>> >>>> > allows.
>> >>>> >
>> >>>> > So in short, it appears that your fweighted cases can't exceed the
>> >>>> > 2
>> >>>> > billion+ that Stata allows, and memory restrictions may hold you to
>> >>>> > even
>> >>>> > less than that.
>> >>>> >
>> >>>> > Also, you probably need to specify that the fweight variable is
>> >>>> > type
>> >>>> > long, e.g.
>> >>>> >
>> >>>> > input y x long fw
>> >>>> >
>> >>>> > Sent from my iPad
>> >>>> >
>> >>>> > On Jul 27, 2013, at 12:36 PM, LÃ¡szlÃ³ SÃ¡ndor <[email protected]>
>> >>>> > wrote:
>> >>>> >
>> >>>> >> Hi,
>> >>>> >> If you care, here is an example that silently produces missing
>> >>>> >> values.
>> >>>> >> I notified Stata Support.
>> >>>> >>
>> >>>> >> input y x fw
>> >>>> >> 2 1 2147483621
>> >>>> >> 1 2 2147483621
>> >>>> >> end
>> >>>> >> de
>> >>>> >> pwcorr y x [fw=fw]
>> >>>> >> exit
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Laszlo
>> >>>> >>
>> >>>> >> On Sun, Jul 21, 2013 at 5:08 PM, Nick Cox <[email protected]>
>> >>>> >> wrote:
>> >>>> >>> I'd suggest documenting your problems with a reproducible example
>> >>>> >>> and
>> >>>> >>> sending Stata tech support.
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> Nick
>> >>>> >>> [email protected]
>> >>>> >>>
>> >>>> >>>

>> >>>> >>> On 21 July 2013 21:55, LÃ¡szlÃ³SÃ¡ndor <[email protected]> wrote:

>> >>>> >>>> Hi,
>> >>>> >>>> in Stata/MP 12.1 I am getting missing values with using -pwcorr-
>> >>>> >>>> with
>> >>>> >>>> -fweights- though the feature works fine with other data or if I
>> >>>> >>>> scale
>> >>>> >>>> my weights down. Is it possible to simply have too large
>> >>>> >>>> fweights,
>> >>>> >>>> e.g. if they cannot be of type -long- anymore?
>> >>>> >>>>
>> >>>> >>>> If so, why doesn't Stata warn me about this?
>> >>>> >>>>
>> >>>> >>>> I vaguely remember some Statalist of Stata blog discussion of
>> >>>> >>>> this,
>> >>>> >>>> but I could not even Google it up, and Stata still did not warn
>> >>>> >>>> me?
>> >>>> >>>>
>> >>>> >>>> Actually, why didn't Stata complain that I did not have integer
>> >>>> >>>> fweights if obviously the variable wasn't of type byte, int or
>> >>>> >>>> long?
>> >>>> >>>>
>> >>>> >>>> Thanks,
>> >>>> >>>>
>> >>>> >>>> Laszlo
>> >>>> >>>>
>> >>>> >>>> *
>> >>>> >>>> *   For searches and help try:
>> >>>> >>>> *   http://www.stata.com/help.cgi?search
>> >>>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >>>
>> >>>> >>> *
>> >>>> >>> *   For searches and help try:
>> >>>> >>> *   http://www.stata.com/help.cgi?search
>> >>>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >>
>> >>>> >> *
>> >>>> >> *   For searches and help try:
>> >>>> >> *   http://www.stata.com/help.cgi?search
>> >>>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> >> *   http://www.ats.ucla.edu/stat/stata/
>> >>>> >
>> >>>> > *
>> >>>> > *   For searches and help try:
>> >>>> > *   http://www.stata.com/help.cgi?search
>> >>>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> > *   http://www.ats.ucla.edu/stat/stata/
>> >>>>
>> >>>> *
>> >>>> *   For searches and help try:
>> >>>> *   http://www.stata.com/help.cgi?search
>> >>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>>> *   http://www.ats.ucla.edu/stat/stata/
>> >>>
>> >>>
>> >>> -------------------------------------------
>> >>> Richard Williams, Notre Dame Dept of Sociology
>> >>> OFFICE: (574)631-6668, (574)631-6463
>> >>> HOME:   (574)289-5227
>> >>> EMAIL:  [email protected]
>> >>> WWW:    http://www.nd.edu/~rwilliam
>> >>>
>> >>>
>> >>>
>> >>> *
>> >>> *   For searches and help try:
>> >>> *   http://www.stata.com/help.cgi?search
>> >>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >>> *   http://www.ats.ucla.edu/stat/stata/
>> >>
>> >> *
>> >> *   For searches and help try:
>> >> *   http://www.stata.com/help.cgi?search
>> >> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> >> *   http://www.ats.ucla.edu/stat/stata/
>> >
>> > *
>> > *   For searches and help try:
>> > *   http://www.stata.com/help.cgi?search
>> > *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> > *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> -------------------------------------------
> Richard Williams, Notre Dame Dept of Sociology
> OFFICE: (574)631-6668, (574)631-6463
> HOME:   (574)289-5227
> EMAIL:  [email protected]
> WWW:    http://www.nd.edu/~rwilliam
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  [email protected]
WWW:    http://www.nd.edu/~rwilliam


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: upper limit on fweights? overflowing into missing values?
  - From: László Sándor <[email protected]>
- Re: st: upper limit on fweights? overflowing into missing values?
  - From: Richard Williams <[email protected]>
- Re: st: upper limit on fweights? overflowing into missing values?
  - From: László Sándor <[email protected]>

Prev by Date: Re: st: RE: Stata 13 linepatternstyle
Next by Date: st: Réf.: st: Re: st: Réf.: Re: st: lyapunov exponent
Previous by thread: Re: st: upper limit on fweights? overflowing into missing values?
Next by thread: st: Update to program - tuples - on SSC
Index(es):
- Date
- Thread