Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: using Stata to detect interviewer fraud |
Date | Sat, 1 May 2010 18:49:54 -0400 |
Here's a quick and simple way to do it. It does not distinguish missing values but that should be easy to adjust. If I look for cars that are the same for 70% or more variables, I find that the Dodge Diplomat is very similar to the Dodge Magnum. Hope this helps, Robert *--------------------------- begin example ----------------------- version 11 clear all sysuse auto unab vlist: * gen id1 = _n tempfile f qui save "`f'" rename id1 id2 cross using "`f'" gen diffid = id1 != id2 sort id1 diffid id2 gen nmatch = 0 foreach v in `vlist' { qui by id1: replace nmatch = nmatch + (`v'[1] == `v') } by id1: gen similar = nmatch / nmatch[1] > .7 by id1: egen check = sum(similar) list id1 id2 make-foreign if check>1 & similar, noobs sepby(id1) *--------------------- end example -------------------------- On Fri, Apr 30, 2010 at 11:16 PM, Michelson, Ethan <emichels@indiana.edu> wrote: > I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites)..... > > In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer. > > ******************** > sort id > gen order=0 > gen add=-1 > replace order=1 if _n==1 > levels id, local(levels) > foreach l of local levels { > gen same_`l'=0 > gen all_`l'=0 > } > forv n = 1(1)504 { > foreach l of local levels { > foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 { > quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l' > quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l' > display "`l' `n'" > } > } > quietly replace order=add if order==1 > quietly replace add=add-1 > gsort -order id > quietly replace order=1 if _n==1 > } > foreach l of local levels { > gen prop_`l'=same_`l'/all_`l'*100 > } > foreach l of local levels { > list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<. > } > > ****************** > > Ethan Michelson > Departments of Sociology and East Asian Languages & Cultures, Associate Professor > Maurer School of Law, Associate Professor of Sociology and Law > mail address: > Department of Sociology > Indiana University > 744 Ballantine Hall > 1020 E. Kirkwood Ave. > Bloomington, IN 47405 > Phone: (812) 856-1521 > Fax: (812) 855-0781 > Email: emichels@indiana.edu > URL: http://www.indiana.edu/~emsoc/ > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/