Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# st: Re: using Stata to detect interviewer fraud

 From Mike Lacy To statalist@hsphsun2.harvard.edu Subject st: Re: using Stata to detect interviewer fraud Date Sat, 01 May 2010 13:10:25 -0600

```

>Date: Fri, 30 Apr 2010 23:16:14 -0400
>From: "Michelson, Ethan" <emichels@indiana.edu
>Subject: st: using Stata to detect interviewer fraud

```
>I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer >fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each >others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing >questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this >task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to >process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers >conspired across different survey sites).....
```
```
>In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The >final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I >have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.
```... snip, snip

```
A generalized version of -matrix dissimilarity- would solve this, since it will return a matrix of matching coefficients between all pairs of respondents, but unfortunately it only will do this for binary variables. I recently needed a replacement of this kind, and wrote what is doubtless a clumsy bit of Mata code. It will do Ethan's problem in a 30 sec. or so on my old Wintel laptop. I'd welcome comments or improvements on the code below, because this is a part of what I need to do in another context, and because I think a good program to accomplish this end would serve a larger purpose.
```
clear all
// Create some simulated questionnaire data to work on.
set obs 505
local nvars = 100 // number of variables
local ncat = 2    // number of response categories for each variable
forval i = 1/`nvars' {
gen byte q`i' = 1 + trunc(runiform() * `ncat')
}
//
```
// Mata program that returns a Stata matrix (Respondent X Respondent) of the proportion of
```// matches across a list of variables. This is essentially a replacement
// for -matrix dissim-, which can only do matching coefficients for
// binary variables
//
mata mata clear
mata:
void mat_match ///
(string varlist,          // list of variables across which to match
string scalar stmatname) // name of Stata matrix for results
//
{
```
st_view(X=., ., tokens(varlist)) // tokens splits the string into a row vector
```   nsubj = rows(X)
nvar = cols(X)
M = J(nsubj, nsubj, 0)
for (j = 1; j <= nvar; j++ ) {
for (ego = 1; ego <=nsubj; ego++) {
for (alter = 1; alter <= nsubj; alter++) {
if (X[ego,j] == X[alter,j]) {
M[ego,alter] = M[ego,alter] + 1
}
}
}
}
M = M/nvar  // proportion
st_matrix(stmatname,M)
}
end
//
//
```
// Illustrate use: Feed the list of variables created above to mat_match, return matrix of matching
```// proportions in Stata matrix "M"
quiet unab varlist: q*
mata: mat_match("`varlist'", "M")
//
// Inspect the matching matrix to find excessive matches. This could
// be included in the Mata program, but I only need the matrix. Cases
// here are ID'd by case number, not by a true id number.
clear
svmat M
gen str HighMatch = ""
local toomuch = 0.8
foreach M of varlist M* {
quiet replace HighMatch = HighMatch + "`M'" + " " if (`M' > `toomuch')
}
edit HighMatch

Regards,

Mike Lacy
Dept. of Sociology
```