Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: dynamic line execution in mata
Phil Schumm <firstname.lastname@example.org>
Statalist Statalist <email@example.com>
Re: st: dynamic line execution in mata
Tue, 11 Feb 2014 12:42:28 -0600
On Feb 11, 2014, at 12:07 PM, Andrew Maurer <Andrew.Maurer@qrm.com> wrote:
> That would be great if someone has done this before, but I haven't found any user-written programs that do this.
I don't believe I've seen any either.
> I have at least the barebones working using pointers (see updated code below with example execution using auto.dta). However, does anyone have advice on a few additional issues I'm having with mata:
> 1) How can I label a Stata variable using mata objects for the variable name and label? Eg) In recover_from_saveif() I have a string variable name stored in thisvarname and the label string stored in thisvarlabel. Does mata have syntax available such as the following in order to build up the line piece by piece? (I'm not sure how to deal with the unmatched quotation marks to be sent to Stata.)
> execute( `"stata(`"label var " "' + thisvarname + `"""' + thisvarlabel `"")"' )
> 2) Are there issues with using st_store() for pointers to string varaibles? The recover_from_saveif() program works for numeric variables, but not string variables. The issue is in the line st_store(., st_addvar(thisvartype,thisvarname),*thisvar), which returns "nonreal found where real required" only for *thisvar which points to string data and not numeric data.
> 3) Is there a way to view the source code for commands like "save" that do not have a corresponding save.ado or save.mata file in the ado/base directory? Responding to the issue you raised, Nick, of not having value labels and dataset characteristics, is there a way to list and loop through them in stata/mata? Eg, "char dir" lists all characteristics associated with the dataset, but doesn't post to rclass results. How can I access them mid-program?
Your questions all have answers, though I don't have the time to answer them now (apologies). However, rather than coding all of this yourself, you might try a different approach. Stata permits you to have a dataset with no observations, but with all of the other meta-data intact (e.g., variable names, labels, value labels, notes, etc.). Thus, you could move all of the actual data (i.e., the values of each variable) into Mata, but nothing else; when you've done that, then you could delete all of the observations on the Stata side, leaving just the "shell" of the dataset but with all the meta-data. Finally, to generate your subset dataset, move the data from the selected observations only back into Stata, and just use -save-.
Obviously, there are some issues to consider here. First, if you move all of the data at once, you'll have two copies of the data in memory, and with a really large dataset you won't want to do that. One way to get around this would be to move one variable at a time and write it to disk; then, when you've translated all of the variables into vectors stored on disk, delete all the observations in Stata, and then read the vectors back into memory (in Mata). This wouldn't make sense if you just wanted to save one subset, but if you were saving many different subsets at a time, it might make sense. Think of it as preprocessing the data into two parts: (1) a Stata-format shell with no data but all of the meta-data, and (2) the data only, saved as individual vectors (or 2 matrices, one numeric and one string). Once you've got this, then a simple command could use these to generate the various subsets.
Whether this approach would make sense would depend entirely on your particular needs, which you haven't described in detail. It might make sense if you were generating tens or hundreds of different subsets at a time, but not if you were generating only one at a time. However, the amount of code required would be comparatively much smaller than for your approach above.
> Ps - just doing a rough test on some sample data to get a benchmark, savesome took me 7.90s on a 1gb dataset, while saveif took 0.44s.
This is not surprising, given that -savesome- presumably uses -preserve-/-restore-. However, computer time is cheap (especially on your own laptop/desktop) relative to person-time. I presume that you anticipate needing to do enough of this that it makes sense to spend time programming this, as opposed to just running it while you're doing something else? Part of the reason I ask is that this is the type of feature that is best implemented within Stata itself -- implementations using -preserve-/-restore- or moving the data back and forth between Stata and Mata will always be a very poor substitute. Thus, if it were me, unless I really needed this, I would probably wait for StataCorp to add it.
* For searches and help try: