Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: using Stata to detect interviewer fraud

From	"Lachenbruch, Peter" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: using Stata to detect interviewer fraud
Date	Sun, 2 May 2010 09:22:36 -0700

More generally, you'd like to detect observations that are very similar, not just proportions of missing.  In a legal case a few years ago, some people were copying previous data to new observations so they didn't have to rerun a lab test.  We used a sum of squares criterion and ranked the values.  Seemed to work fairly well,  Stas's idea of clustering seems a good one.

________________________________________
From: [email protected] [[email protected]] On Behalf Of Michelson, Ethan [[email protected]]
Sent: Friday, April 30, 2010 8:16 PM
To: [email protected]
Subject: st: using Stata to detect interviewer fraud

I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites).....

In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.

********************
sort id
gen order=0
gen add=-1
replace order=1 if _n==1
levels id, local(levels)
foreach l of local levels {
   gen same_`l'=0
   gen all_`l'=0
}
forv n = 1(1)504 {
   foreach l of local levels {
      foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 {
         quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l'
         quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l'
         display "`l' `n'"
     }
   }
   quietly replace order=add if order==1
   quietly replace add=add-1
   gsort -order id
   quietly replace order=1 if _n==1
}
foreach l of local levels {
   gen prop_`l'=same_`l'/all_`l'*100
}
foreach l of local levels {
   list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<.
}

******************

Ethan Michelson
Departments of Sociology and East Asian Languages & Cultures, Associate Professor
Maurer School of Law, Associate Professor of Sociology and Law
mail address:
Department of Sociology
Indiana University
744 Ballantine Hall
1020 E. Kirkwood Ave.
Bloomington, IN 47405
Phone: (812) 856-1521
Fax: (812) 855-0781
Email: [email protected]
URL: http://www.indiana.edu/~emsoc/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: RE: tabout - how to flip a table
Next by Date: st: question on gllamm with ip(f), the discrete latent variable
Previous by thread: st: Re: using Stata to detect interviewer fraud
Next by thread: st: How do I create a new observation that is the sum of two observations?
Index(es):
- Date
- Thread