Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: using Stata to detect interviewer fraud

From   Mike Lacy <>
Subject   st: Re: using Stata to detect interviewer fraud
Date   Sat, 01 May 2010 13:10:25 -0600

>Date: Fri, 30 Apr 2010 23:16:14 -0400
>From: "Michelson, Ethan" <
>Subject: st: using Stata to detect interviewer fraud

>I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer >fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each >others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing >questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this >task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to >process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers >conspired across different survey sites).....

>In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The >final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I >have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.
... snip, snip

A generalized version of -matrix dissimilarity- would solve this, since it will return a matrix of matching coefficients between all pairs of respondents, but unfortunately it only will do this for binary variables. I recently needed a replacement of this kind, and wrote what is doubtless a clumsy bit of Mata code. It will do Ethan's problem in a 30 sec. or so on my old Wintel laptop. I'd welcome comments or improvements on the code below, because this is a part of what I need to do in another context, and because I think a good program to accomplish this end would serve a larger purpose.

 clear all
// Create some simulated questionnaire data to work on.
set obs 505
local nvars = 100 // number of variables
local ncat = 2    // number of response categories for each variable
forval i = 1/`nvars' {
  gen byte q`i' = 1 + trunc(runiform() * `ncat')
// Mata program that returns a Stata matrix (Respondent X Respondent) of the proportion of
// matches across a list of variables. This is essentially a replacement
// for -matrix dissim-, which can only do matching coefficients for
// binary variables
mata mata clear
void mat_match ///
   (string varlist,          // list of variables across which to match
    string scalar stmatname) // name of Stata matrix for results
st_view(X=., ., tokens(varlist)) // tokens splits the string into a row vector
   nsubj = rows(X)
   nvar = cols(X)
   M = J(nsubj, nsubj, 0)
   for (j = 1; j <= nvar; j++ ) {
     for (ego = 1; ego <=nsubj; ego++) {
        for (alter = 1; alter <= nsubj; alter++) {
           if (X[ego,j] == X[alter,j]) {
              M[ego,alter] = M[ego,alter] + 1
   M = M/nvar  // proportion
// Illustrate use: Feed the list of variables created above to mat_match, return matrix of matching
// proportions in Stata matrix "M"
quiet unab varlist: q*
mata: mat_match("`varlist'", "M")
// Inspect the matching matrix to find excessive matches. This could
// be included in the Mata program, but I only need the matrix. Cases
// here are ID'd by case number, not by a true id number.
svmat M
gen str HighMatch = ""
local toomuch = 0.8
foreach M of varlist M* {
  quiet replace HighMatch = HighMatch + "`M'" + " " if (`M' > `toomuch')
edit HighMatch


Mike Lacy
Dept. of Sociology
Colorado State University
Fort Collins CO 80521

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index