Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: Using Multiple Imputation in a very large dataset

 From Raquel Rangel de Meireles Guimarães To statalist@hsphsun2.harvard.edu Subject st: Using Multiple Imputation in a very large dataset Date Fri, 03 Jun 2011 12:36:31 -0300

```Dear stata users,

I am using Stata MP Dual Core 64-bits on windows 7. I have 4GB RAM, but
I've allocated 5GB to store my data.

I am interested in modeling the determinants of school performance. I
have data for 1.939.147 students. My dependent variable is the reading
proficiency (fully Observed), and I have the student's individual
characteristics (gender, race and age - fully Observed) and the scores
for the socioeconomics constructs (socioeconomic level, student
motivation, parents Involvement, Cultural Capital), which were obtained
via Item Response Theory.

I would like to impute values ​​for the socioeconomic characteristics
according to levels of student's proficiency, gender, race and age.

My data can be found at the following website:

I would like to impute values since I will lost a lot of students in my
study doing regressions.

Here is a descriptive statistics of my fully observed variables X:

Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------------
cod_uf | 1939147 32.73305 9.495694 11 53
região | 1939147 2.898205 1.031377 1 5
qn1 (sex) | 1939147 1.499852 .5000001 1 2
qn2 (race) | 1939147 1.941047 .9668508 1 5
qn4 (age groups) | 1939147 3.752611 1.195527 1 8
-------------+-------------------------------------------------------------------------
leitura (reading proficiency) | 1939147 175.8849 41.19135 0 347.36

Here is the misstable of my missing values:

. misstable sum capitalcultural envolvimento motivacao nse
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
capitalcultural | 42,986 1896161 | 371 -1.662 1.662
envolvimento | 20,302 1918845 | 19 -1.178 1.178
motivacao | 37,507 1901640 | 15 -1.014 .672
nse | 6,092 1933055 | >500 -2.02 2.02
-----------------------------------------------------------------------------

Here is my procedure to do multiple imputation:

mi set mlong
mi register imputed capitalcultural envolvimento motivacao nse
mi register regular leitura qn1 qn2 qn4
tab qn1, g(sexo)
tab qn2, g(raca)
tab região, g(regiao)
xi: mi impute reg capitalcultural = leitura sexo1 raca1 regiao1 qn4,

I got the following message error: insufficient disk space r(699)

Could anyone please help me? Is there a possibility of another
imputation technique? Hotdeck would not be useful since the imputed
variables are not categorized.

Kind regards,

Raquel

--
Raquel Rangel de Meireles Guimarães
Professora Substituta do Departamento de Demografia, UFMG
Doutoranda em Demografia