Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: comparing xtdes-like patterns for variables


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: comparing xtdes-like patterns for variables
Date   Thu, 1 Nov 2012 13:30:31 +0000

I've done a quick hack of a program to show where the missings lie.
Its effectiveness in showing structure seems likely to diminish with
dataset size.

Example:

sysuse nlsw88
missingplot



*! 1.0.0 NJC 1 November 2012
program missingplot
	version 8.2
	syntax [varlist] [if] [in] [ , all varnames * ]
	
	quietly {
		marksample touse, novarlist
		count if `touse'
		if r(N) == 0 error 2000
	
		local y = 0
		tempvar obsno
		gen long `obsno' = _n if `touse'
		label variable `obsno' "observations"
		local toomany = 0

		foreach v of local varlist {
			local include = 1
			if "`all'" == "" {
				count if `touse' & missing(`v')
				if r(N) == 0 local include = 0
			}

			if `include' {
				local ++y

				if `y' > 20 {
					local toomany = 1
					continue, break
				}

				tempvar ynew
				gen `ynew' = `y' if missing(`v')

				if "`varnames'" != "" {
					local which "`v'"
				}
				else {
					local which : var label `v'
					if `"`which'"' == "" local which "`v'"
				}

				local call `call'  `y'  `"`which'"'
				local Y `Y' `ynew'
			}
		}
	}	

	if "`Y'" == "" {
		di as txt "nothing to plot!"
		exit 0
	}

	if `toomany' {
		di as txt "note: only first 20 variables plotted"
	}

	scatter `Y' `obsno' if `touse', ///
	yla(`call', ang(h) noticks) ytitle("")    ///
	legend(off) mcolor(blue ..) ms(dh ..) `options'
end


On Thu, Nov 1, 2012 at 12:46 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> Sorry for previous premature send.
>
> If you had several variables you could try something like this
>
> local y = 0
> gen long obsno = _n
>
> qui foreach v of var <whatever> {
>             local ++y
>             gen y`y' = `y' if missing(`v')
>             local which : var label `v'
>             if "`which'" == "" local which "`v'"
>              local call `call'  `y'  "`which'"
>             local Y `Y' y`y'
> }
>
> scatter `Y' obsno, ms(dh ..) yla(`call', ang(h) noticks) legend(off)
>
>
>>
>> On Thu, Nov 1, 2012 at 1:10 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> You could create variables like
>>>
>>> gen yxmiss = missing(y) - missing(x)
>>> gen long obs = _n
>>>
>>> scatter yxmiss obs if missing(y, x)
>>>
>>> On Wed, Oct 31, 2012 at 7:39 PM, László Sándor <sandorl@gmail.com> wrote:
>>>> Thanks, Nick.
>>>>
>>>> The values definitely don't line up that neatly, but that's a worry
>>>> for another day.
>>>>
>>>> Basically my problem is, if I know I can expect differences between
>>>> the variables, is there a neat way to compare their missing patterns
>>>> (one always starting early, or one mistakenly having the years in
>>>> reverse order)?
>>>>
>>>> On Wed, Oct 31, 2012 at 3:15 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>>> If # different versions of the same data should be the same, there
>>>>> will be # duplicates of everything in a combined dataset.
>>>>>
>>>>> This applies to missings too.
>>>>>
>>>>> -duplicates- is therefore something that springs to mind. Panels are
>>>>> no problem, as panel identifiers are just other variables
>>>>>
>>>>> Naturally, if the combined dataset is extremely large, this won't be
>>>>> very practical. .
>>>>>
>>>>> Nick
>>>>>
>>>>> On Wed, Oct 31, 2012 at 7:02 PM, László Sándor <sandorl@gmail.com> wrote:
>>>>>
>>>>>> I have a panel-data cleaning problem that probably has some neat
>>>>>> solution, probably already out there. I am happy to try any solutions
>>>>>> for Stata 12.1 MP.
>>>>>>
>>>>>> Background: I had to try to look up supposedly the same data from
>>>>>> multiple sources. (Financial data for the same securities, but
>>>>>> different data sources were expected to cover different subsets of my
>>>>>> universe, or for different time periods.)
>>>>>>
>>>>>> But now I have a panel where I would like to cross-check different
>>>>>> version of the same data, and most crucially, I would like to verify
>>>>>> that I got the years correctly for each version. (FYI: financial data
>>>>>> sources can be opaque about how they handle missing data if you ask
>>>>>> for "end-of-year prices for the last 15 calendar years", and whether
>>>>>> they give years in ascending or descending order). For this, I would
>>>>>> like to compare what periods I have non-missing values for a family of
>>>>>> variables, say, bloomberg_price and reuters_price.
>>>>>>
>>>>>> Presumably, if I got the start and the end years right, I could hope
>>>>>> -compare- those, (e.g. -compare *_price_first- ). And hope that the
>>>>>> patterns will be clear.
>>>>>>
>>>>>> That said, I'm afraid some more nuanced analysis of missing value
>>>>>> patterns might be justified. What are good tools for that? (How can I
>>>>>> "xtdes by variable"? Or "misstable pattern in a panel"?)

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index