Re: st: using Stata to detect interviewer fraud

Sat, 1 May 2010 12:00:59 -0400

I have no idea if this would be any faster, but you could save each observation as a temporary file, then loop through the number of questionnaires to read in each file and using -cf- to compare responses with every other file, and saving the percentage differences using -postfile-, something like below, which is actually really ineffecient because the loop compares questionnaire 2 with 4 and then again 4 with 2 which is unnecessary but I am sure some more elegant programming could take care of it. Also I am not sure how the approach below addresses only comparing non-missing data, but if these are all stored in a similar manner this might not be a problem. clear *made up data input quesid a b c d 1 1 2 3 0 2 1 2 3 0 3 4 4 4 0 4 1 4 5 0 5 2 4 3 0 end qui desc local nvars=r(k)-1 qui count local totques=r(N) levelsof quesid, local(levels) *save temporary files one per observation foreach i in `levels' { tempfile file`i' preserve keep if quesid==`i' save "`file`i''" restore } *set up postfile tempname testx tempfile diffs postfile `testx' quesid1 quesid2 double(pctdiff) numdiffs nvars using `diffs' *compare each questionnaire with the other forvalues a=1/`totques' { forvalues b=1/`totques' { if `a'~=`b' { use "`file`a''", clear capture cf _all using "`file`b''" *subtract one because quesid will always differ local diffcount=r(Nsum) scalar pctdiffs=((`diffcount'-1)/`nvars')*100 post `testx' (`a') (`b') (pctdiffs) (`diffcount'-1) (`nvars') di "differences in ques `a' and ques `b'=" r(Nsum) } } } postclose `testx' clear use `diffs' list +------------------------------------------------+ | quesid1 quesid2 pctdiff numdiffs nvars | |------------------------------------------------| 1. | 1 2 0 0 4 | 2. | 1 3 75 3 4 | 3. | 1 4 50 2 4 | 4. | 1 5 50 2 4 | 5. | 2 1 0 0 4 | |------------------------------------------------| 6. | 2 3 75 3 4 | 7. | 2 4 50 2 4 | 8. | 2 5 50 2 4 | 9. | 3 1 75 3 4 | 10. | 3 2 75 3 4 | |------------------------------------------------| 11. | 3 4 50 2 4 | 12. | 3 5 50 2 4 | 13. | 4 1 50 2 4 | 14. | 4 2 50 2 4 | 15. | 4 3 50 2 4 | |------------------------------------------------| 16. | 4 5 50 2 4 | 17. | 5 1 50 2 4 | 18. | 5 2 50 2 4 | 19. | 5 3 50 2 4 | 20. | 5 4 50 2 4 | +------------------------------------------------+ On Fri, Apr 30, 2010 at 11:16 PM, Michelson, Ethan <emichels@indiana.edu> wrote: > I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites)..... > > In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer. > > ******************** > sort id > gen order=0 > gen add=-1 > replace order=1 if _n==1 > levels id, local(levels) > foreach l of local levels { > gen same_`l'=0 > gen all_`l'=0 > } > forv n = 1(1)504 { > foreach l of local levels { > foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 { > quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l' > quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l' > display "`l' `n'" > } > } > quietly replace order=add if order==1 > quietly replace add=add-1 > gsort -order id > quietly replace order=1 if _n==1 > } > foreach l of local levels { > gen prop_`l'=same_`l'/all_`l'*100 > } > foreach l of local levels { > list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<. > } > > ****************** > > Ethan Michelson > Departments of Sociology and East Asian Languages & Cultures, Associate Professor > Maurer School of Law, Associate Professor of Sociology and Law > mail address: > Department of Sociology > Indiana University > 744 Ballantine Hall > 1020 E. Kirkwood Ave. > Bloomington, IN 47405 > Phone: (812) 856-1521 > Fax: (812) 855-0781 > Email: emichels@indiana.edu > URL: http://www.indiana.edu/~emsoc/ > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

