Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: using Stata to detect interviewer fraud

From	Mike Lacy <[email protected]>
To	[email protected]
Subject	st: Re: using Stata to detect interviewer fraud
Date	Sat, 01 May 2010 13:10:25 -0600



>Date: Fri, 30 Apr 2010 23:16:14 -0400
>From: "Michelson, Ethan" <[email protected]
>Subject: st: using Stata to detect interviewer fraud

>I'd be deeply grateful for help writing a more efficient, moreparsimonious .do file to help detect interviewer >fraud. Aftercompleting a survey of 2,500 households, I discovered that a fewinterviewers copied each >others' questionnaires. I decided to writesome code that calculates the proportion of allnonmissing >questionnaire items that are identical across every otherquestionnaire. Although my .do file accomplishes this >task, Istrongly suspect I'm making Stata do tons of unnecessary work. Ittakes Stata about 12 hours to >process 505 questionnaires (from asingle survey site, since I can rule out the possibility thatinterviewers >conspired across different survey sites).....

>In the following code, "id" is the unique questionnaire id. Thereare 505 questionnaires in this batch. The >final command at thebottom asks Stata to list combinations of questionnaires with >80%identical content. I >have no doubt there's a far more efficient wayto do this. I'd really appreciate any advice anyone can offer.

... snip, snip

A generalized version of -matrix dissimilarity- would solve this,since it will return a matrix of matching coefficients between allpairs of respondents, but unfortunately it only will do this forbinary variables. I recently needed a replacement of this kind, andwrote what is doubtless a clumsy bit of Mata code. It will doEthan's problem in a 30 sec. or so on my old Wintel laptop. I'dwelcome comments or improvements on the code below, because this is apart of what I need to do in another context, and because I think agood program to accomplish this end would serve a larger purpose.


 clear all
// Create some simulated questionnaire data to work on.
set obs 505
local nvars = 100 // number of variables
local ncat = 2    // number of response categories for each variable
forval i = 1/`nvars' {
  gen byte q`i' = 1 + trunc(runiform() * `ncat')
}
//

// Mata program that returns a Stata matrix (Respondent X Respondent)of the proportion of

// matches across a list of variables. This is essentially a replacement
// for -matrix dissim-, which can only do matching coefficients for
// binary variables
//
mata mata clear
mata:
void mat_match ///
   (string varlist,          // list of variables across which to match
    string scalar stmatname) // name of Stata matrix for results
//
{

st_view(X=., ., tokens(varlist)) // tokens splits the string intoa row vector

   nsubj = rows(X)
   nvar = cols(X)
   M = J(nsubj, nsubj, 0)
   for (j = 1; j <= nvar; j++ ) {
     for (ego = 1; ego <=nsubj; ego++) {
        for (alter = 1; alter <= nsubj; alter++) {
           if (X[ego,j] == X[alter,j]) {
              M[ego,alter] = M[ego,alter] + 1
           }
        }
     }
   }
   M = M/nvar  // proportion
   st_matrix(stmatname,M)
}
end
//
//

// Illustrate use: Feed the list of variables created above tomat_match, return matrix of matching

// proportions in Stata matrix "M"
quiet unab varlist: q*
mata: mat_match("`varlist'", "M")
//
// Inspect the matching matrix to find excessive matches. This could
// be included in the Mata program, but I only need the matrix. Cases
// here are ID'd by case number, not by a true id number.
clear
svmat M
gen str HighMatch = ""
local toomuch = 0.8
foreach M of varlist M* {
  quiet replace HighMatch = HighMatch + "`M'" + " " if (`M' > `toomuch')
}
edit HighMatch


Regards,

Mike Lacy
Dept. of Sociology
Colorado State University

Fort Collins CO 80521


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: Re-re-post: Stata 11 - Factor variables in a regression command
Next by Date: st: How do I create a new observation that is the sum of two observations?
Previous by thread: st: re: xtivreg
Next by thread: st: RE: using Stata to detect interviewer fraud
Index(es):
- Date
- Thread