Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# RE: st: using Stata to detect interviewer fraud

 From "Michelson, Ethan" To "statalist@hsphsun2.harvard.edu" Subject RE: st: using Stata to detect interviewer fraud Date Sat, 1 May 2010 23:33:09 -0400

```Robert, Thanks so much! This is brilliant, far more elegant and efficient than my clumsy code. As you suggested, I modified it slightly (1) to identify matches for nonmissing values only, and (2) to calculate the proportion of matches using the number of variables with BOTH values nonmissing as the denominator. Problem solved. Thank you! Best, Ethan

**************************
unab vlist: a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195
sort id
tempfile f
qui save "`f'"

rename id id2
cross using "`f'"
gen diffid = id != id2
sort id diffid id2
gen nmatch = 0
gen total = 0
foreach v in `vlist' {
qui by id: replace nmatch = nmatch + (`v'[1] == `v' & (`v'<.|`v'==.a|`v'==.b|`v'==.c))
qui by id: replace total = total + ((`v'<.|`v'==.a|`v'==.b|`v'==.c)&(`v'[1]<.|`v'[1]==.a|`v'[1]==.b|`v'[1]==.c))
}

by id: gen prop = nmatch / total *100
by id: gen similar = (nmatch / total *100 > 80) & (nmatch / nmatch[1] < .)
by id: egen check = sum(similar)

list v5 v8 id id2 prop nmatch total if check>1 & similar, noobs sepby(id)
***********************

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Robert Picard
Sent: Sunday, May 02, 2010 6:50 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: using Stata to detect interviewer fraud

Here's a quick and simple way to do it. It does not distinguish
missing values but that should be easy to adjust. If I look for cars
that are the same for 70% or more variables, I find that the Dodge
Diplomat is very similar to the Dodge Magnum.

Hope this helps,

Robert

*--------------------------- begin example -----------------------
version 11

clear all
sysuse auto
unab vlist: *
gen id1 = _n
tempfile f
qui save "`f'"

rename id1 id2
cross using "`f'"
gen diffid = id1 != id2
sort id1 diffid id2
gen nmatch = 0
foreach v in `vlist' {
qui by id1: replace nmatch = nmatch + (`v'[1] == `v')
}

by id1: gen similar = nmatch / nmatch[1] > .7
by id1: egen check = sum(similar)

list id1 id2 make-foreign if check>1 & similar, noobs sepby(id1)
*--------------------- end example --------------------------

On Fri, Apr 30, 2010 at 11:16 PM, Michelson, Ethan <emichels@indiana.edu> wrote:
> I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites).....
>
> In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.
>
> ********************
> sort id
> gen order=0
> replace order=1 if _n==1
> levels id, local(levels)
> foreach l of local levels {
>    gen same_`l'=0
>    gen all_`l'=0
> }
> forv n = 1(1)504 {
>    foreach l of local levels {
>       foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 {
>          quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l'
>          quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l'
>          display "`l' `n'"
>      }
>    }
>    quietly replace order=add if order==1
>    gsort -order id
>    quietly replace order=1 if _n==1
> }
> foreach l of local levels {
>    gen prop_`l'=same_`l'/all_`l'*100
> }
> foreach l of local levels {
>    list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<.
> }
>
> ******************
>
> Ethan Michelson
> Departments of Sociology and East Asian Languages & Cultures, Associate Professor
> Maurer School of Law, Associate Professor of Sociology and Law
> Department of Sociology
> Indiana University
> 744 Ballantine Hall
> 1020 E. Kirkwood Ave.
> Bloomington, IN 47405
> Phone: (812) 856-1521
> Fax: (812) 855-0781
> Email: emichels@indiana.edu
> URL: http://www.indiana.edu/~emsoc/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```