Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: dynamic line execution in mata


From   Andrew Maurer <Andrew.Maurer@qrm.com>
To   Statalist Statalist <statalist@hsphsun2.harvard.edu>
Subject   RE: st: dynamic line execution in mata
Date   Wed, 12 Feb 2014 18:53:03 +0000

Thanks for the responses. Nick's solution to 1) worked and the solution to 2) was that I should have been using st_sstore() rather than st_store(). For 3), I haven't found a way to recover additional dataset metadata like characteristics.

I anticipate that the way I have written the code now is highly illegal. The code I had previously posted required 2 steps: 1) saving the "if" observations to a mata file and 2) later recovering the data from the mata file and saving as a dta. In order to have everything done in one step, I now dynamically write a temporary do file mid-program for step 2 and then shell escape to a new instance of stata to execute the conversion from mata-to-dta.
 
The now working code is below. I added support for variable and value labels. The shell escape portion will only work on Windows and right now I've hardcoded Stata's executable name "StataMP-64.exe". Maybe there's a way of creturning  the executable name that I haven't found. If anyone has any advice on efficiency or coding style, I'd be very interested to hear.

Responding to Phil: I will use this enough to have made the programming of it worthwhile (along with the side benefit of getting more experience learning how to program in stata and mata, since I'm still learning mata). I could still look into using your idea of saving a shell dataset in order to get the program to retain all of the metadata that save does, although at least for my purposes, I don't typically use dataset labels or characteristics.

Thank you!
Andrew Maurer





********* Begin Code **********************
cap program drop saveif
program define saveif
	syntax varlist [if] [in] using/, [replace]
	
	**
	* Part 1 - send observations to mata and write to file with fputmatrix() 
	**
		
	* send varlist to mata
	putmata `varlist' `if' `in', view
	
	* create mata objects for:
	*	1) variable names 2) storeage types 3) var labels 4) value labels
	*	5) the data itself
	forval i = 1/`: word count `varlist'' {
		if `i' == 1 {
			mata: varnames = "`: word `i' of `varlist''"
			mata: vartypes = "`: type `: word `i' of `varlist'''"
			mata: varlabels = "`: variable label `: word `i' of `varlist'''"
			mata: vallabels = "`: value label `: word `i' of `varlist'''"
			mata: varpointers = &`: word `i' of `varlist'' // pointers
		}
		else {
			mata: varnames = varnames,"`: word `i' of `varlist''"
			mata: vartypes = vartypes,"`: type `: word `i' of `varlist'''"
			mata: varlabels = varlabels,"`: variable label `: word `i' of `varlist'''"
			mata: vallabels = vallabels,"`: value label `: word `i' of `varlist'''"
			mata: varpointers = varpointers,&`: word `i' of `varlist'' // pointers
		}
		* save value labels to temporary files (error if they already exist)
		if "`: value label `: word `i' of `varlist'''" != "" label save `: value label `: word `i' of `varlist''' using `"`=c(tmpdir)'`: value label `: word `i' of `varlist'''"'
	}

	
	* check that `using' file does not already exist
	cap confirm new file "`using'"
	if _rc != 0 {
		if "`replace'" == "replace" rm "`using'"
		else {
			di as error "file `using' already exists"
			error 1
		}
	}

	* save the created objects to a file
	tempfile matadata //tempfile location to pass to second instance of stata
	mata: matadata = st_local("matadata")
	mata: fh = fopen("`matadata'", "w")
	mata: fputmatrix(fh, varnames)
	mata: fputmatrix(fh, vartypes)
	mata: fputmatrix(fh, varlabels)
	mata: fputmatrix(fh, vallabels)
	mata: fputmatrix(fh, varpointers)
	
	mata: fclose(fh)
	
	**
	* Part 2 - shell escape, open up second instance of stata to read in the
	*	data and save it as a dta
	**
	
	tempfile convert_to_dta
	
	file open fid using `convert_to_dta', write
	
	file write fid ///
	`"capture mata mata drop recover_from_saveif()"' _n ///
	`"mata:"' _n ///
	`"void recover_from_saveif()"' _n ///
	`"{"' _n ///
	`""' _n ///
	`"      fileloc = "`matadata'""' _n ///
	`"      fh = fopen(fileloc, "r")"' _n ///
	`"      "' _n ///
	`"      varnames = fgetmatrix(fh) "' _n ///
	`"      vartypes = fgetmatrix(fh) "' _n ///
	`"      varlabels = fgetmatrix(fh) "' _n ///
	`"      vallabels = fgetmatrix(fh) "' _n ///
	`"      varpointers = fgetmatrix(fh) "' _n ///
	`"      //matadata = fgetmatrix(fh) "' _n ///
	`"      varcount = cols(varnames)"' _n ///
	`"      "' _n ///
	`"      fclose(fh)"' _n ///
	`"      "' _n ///
	`"      // foreach var of varnames, load var into stata with correct variable type"' _n ///
	`"      "' _n ///
	`"      for (i=1; i<=varcount;i++) {"' _n ///
	`"              thisvarname = varnames[1,i] // eg contains "date""' _n ///
	`"              thisvartype = vartypes[1,i] // eg contains "int""' _n ///
	`"              thisvarlabel = varlabels[1,i] // eg contains "Date""' _n ///
	`"              thisvallabel = vallabels[1,i] "' _n ///
	`"              thisvar = varpointers[1,i] // eg pointer to date vector"' _n ///
	`"              if (i == 1) st_addobs(rows(*thisvar))"' _n ///
	`"              if (!strmatch(thisvartype,"str*")) st_store(., st_addvar(thisvartype,thisvarname),*thisvar) "' _n ///
	`"              else st_sstore(., st_addvar(thisvartype,thisvarname),*thisvar) "' _n ///
	`"              stata("label var " + thisvarname + `" ""' + thisvarlabel + `"""')"' _n ///
	`"              if (vallabels[1,i] != "") {"' _n ///
	`"                      stata("local tmpdir = c(tmpdir)")"' _n ///
	`"                      stata(`"qui do ""' + st_local("tmpdir") + thisvallabel + `".do""')"' _n ///
	`"                      stata("label values " + thisvarname + " " + `"""' + thisvallabel + `"""')"' _n ///
	`"                      \\ delete the temporary label file created earlier"' _n ///
	`"                      stata(`"rm ""' + st_local("tmpdir") + thisvallabel + `".do""')"' _n ///
	`"              }"' _n ///
	`"      }"' _n ///
	`""' _n ///
	`"}"' _n ///
	`"end"' _n ///
	`""' _n ///
	`"mata: recover_from_saveif()"' _n ///
	`"cd "`c(pwd)'""' _n /// 
	`"save `using', replace"' _n ///
	`""' ///
	
	file close fid
	
	!"`c(sysdir_stata)'\StataMP-64.exe" /e do "`convert_to_dta'"
		
end	

* Sample execution using built-in dataset
sysuse auto.dta, clear
saveif * in 1/5 using test.dta, replace

********* End Code ************************





-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Phil Schumm
Sent: Tuesday, February 11, 2014 12:42 PM
To: Statalist Statalist
Subject: Re: st: dynamic line execution in mata

On Feb 11, 2014, at 12:07 PM, Andrew Maurer <Andrew.Maurer@qrm.com> wrote:
> That would be great if someone has done this before, but I haven't found any user-written programs that do this.


I don't believe I've seen any either.


> I have at least the barebones working using pointers (see updated code below with example execution using auto.dta). However, does anyone have advice on a few additional issues I'm having with mata:
> 
> 1) How can I label a Stata variable using mata objects for the variable name and label? Eg) In recover_from_saveif() I have a string variable name stored in thisvarname and the label string stored in thisvarlabel.  Does mata have syntax available such as the following in order to build up the line piece by piece? (I'm not sure how to deal with the unmatched quotation marks to be sent to Stata.) 
> 
> execute( `"stata(`"label var " "' + thisvarname + `"""' + thisvarlabel `"")"' )
> 
> 2) Are there issues with using st_store() for pointers to string varaibles? The recover_from_saveif() program works for numeric variables, but not string variables. The issue is in the line st_store(., st_addvar(thisvartype,thisvarname),*thisvar), which returns "nonreal found where real required" only for *thisvar which points to string data and not numeric data.
> 
> 3) Is there a way to view the source code for commands like "save" that do not have a corresponding save.ado or save.mata file in the ado/base directory? Responding to the issue you raised, Nick, of not having value labels and dataset characteristics, is there a way to list and loop through them in stata/mata? Eg, "char dir" lists all characteristics associated with the dataset, but doesn't post to rclass results. How can I access them mid-program?


Your questions all have answers, though I don't have the time to answer them now (apologies).  However, rather than coding all of this yourself, you might try a different approach.  Stata permits you to have a dataset with no observations, but with all of the other meta-data intact (e.g., variable names, labels, value labels, notes, etc.).  Thus, you could move all of the actual data (i.e., the values of each variable) into Mata, but nothing else; when you've done that, then you could delete all of the observations on the Stata side, leaving just the "shell" of the dataset but with all the meta-data.  Finally, to generate your subset dataset, move the data from the selected observations only back into Stata, and just use -save-.

Obviously, there are some issues to consider here.  First, if you move all of the data at once, you'll have two copies of the data in memory, and with a really large dataset you won't want to do that.  One way to get around this would be to move one variable at a time and write it to disk; then, when you've translated all of the variables into vectors stored on disk, delete all the observations in Stata, and then read the vectors back into memory (in Mata).  This wouldn't make sense if you just wanted to save one subset, but if you were saving many different subsets at a time, it might make sense.  Think of it as preprocessing the data into two parts: (1) a Stata-format shell with no data but all of the meta-data, and (2) the data only, saved as individual vectors (or 2 matrices, one numeric and one string).  Once you've got this, then a simple command could use these to generate the various subsets.

Whether this approach would make sense would depend entirely on your particular needs, which you haven't described in detail.  It might make sense if you were generating tens or hundreds of different subsets at a time, but not if you were generating only one at a time.  However, the amount of code required would be comparatively much smaller than for your approach above.


> Ps - just doing a rough test on some sample data to get a benchmark, savesome took me 7.90s on a 1gb dataset, while saveif took 0.44s.


This is not surprising, given that -savesome- presumably uses -preserve-/-restore-.  However, computer time is cheap (especially on your own laptop/desktop) relative to person-time.  I presume that you anticipate needing to do enough of this that it makes sense to spend time programming this, as opposed to just running it while you're doing something else?  Part of the reason I ask is that this is the type of feature that is best implemented within Stata itself -- implementations using -preserve-/-restore- or moving the data back and forth between Stata and Mata will always be a very poor substitute.  Thus, if it were me, unless I really needed this, I would probably wait for StataCorp to add it.


-- Phil


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index