Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Re: using Stata to detect interviewer fraud
From 
 
Mike Lacy <[email protected]> 
To 
 
[email protected] 
Subject 
 
st: Re: using Stata to detect interviewer fraud 
Date 
 
Sat, 01 May 2010 13:10:25 -0600 
>Date: Fri, 30 Apr 2010 23:16:14 -0400
>From: "Michelson, Ethan" <[email protected]
>Subject: st: using Stata to detect interviewer fraud
>I'd be deeply grateful for help writing a more efficient, more 
parsimonious .do file to help detect interviewer >fraud. After 
completing a survey of 2,500 households, I discovered that a few 
interviewers copied each >others' questionnaires. I decided to write 
some code that calculates the proportion of all 
nonmissing >questionnaire items that are identical across every other 
questionnaire. Although my .do file accomplishes this >task, I 
strongly suspect I'm making Stata do tons of unnecessary work. It 
takes Stata about 12 hours to >process 505 questionnaires (from a 
single survey site, since I can rule out the possibility that 
interviewers >conspired across different survey sites).....
>In the following code, "id" is the unique questionnaire id. There 
are 505 questionnaires in this batch. The >final command at the 
bottom asks Stata to list combinations of questionnaires with >80% 
identical content. I >have no doubt there's a far more efficient way 
to do this. I'd really appreciate any advice anyone can offer.
... snip, snip
A generalized version of -matrix dissimilarity-  would solve this, 
since it will return a matrix of matching coefficients between all 
pairs of respondents, but unfortunately it only will do this for 
binary variables. I recently needed a replacement of this kind, and 
wrote what is doubtless a clumsy bit of Mata code.  It will do 
Ethan's problem in a 30 sec. or so on my old Wintel laptop.  I'd 
welcome comments or improvements on the code below, because this is a 
part of what I need to do in another context, and because I think a 
good program to accomplish this end would serve a larger purpose.
 clear all
// Create some simulated questionnaire data to work on.
set obs 505
local nvars = 100 // number of variables
local ncat = 2    // number of response categories for each variable
forval i = 1/`nvars' {
  gen byte q`i' = 1 + trunc(runiform() * `ncat')
}
//
// Mata program that returns a Stata matrix (Respondent X Respondent) 
of the proportion of
// matches across a list of variables. This is essentially a replacement
// for -matrix dissim-, which can only do matching coefficients for
// binary variables
//
mata mata clear
mata:
void mat_match ///
   (string varlist,          // list of variables across which to match
    string scalar stmatname) // name of Stata matrix for results
//
{
   st_view(X=., ., tokens(varlist)) // tokens splits the string into 
a row vector
   nsubj = rows(X)
   nvar = cols(X)
   M = J(nsubj, nsubj, 0)
   for (j = 1; j <= nvar; j++ ) {
     for (ego = 1; ego <=nsubj; ego++) {
        for (alter = 1; alter <= nsubj; alter++) {
           if (X[ego,j] == X[alter,j]) {
              M[ego,alter] = M[ego,alter] + 1
           }
        }
     }
   }
   M = M/nvar  // proportion
   st_matrix(stmatname,M)
}
end
//
//
// Illustrate use: Feed the list of variables  created above to 
mat_match, return matrix of matching
// proportions in Stata matrix "M"
quiet unab varlist: q*
mata: mat_match("`varlist'", "M")
//
// Inspect the matching matrix to find excessive matches. This could
// be included in the Mata program, but I only need the matrix. Cases
// here are ID'd by case number, not by a true id number.
clear
svmat M
gen str HighMatch = ""
local toomuch = 0.8
foreach M of varlist M* {
  quiet replace HighMatch = HighMatch + "`M'" + " " if (`M' > `toomuch')
}
edit HighMatch
Regards,
Mike Lacy
Dept. of Sociology
Colorado State University
Fort Collins CO 80521 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/