Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Is there a better way to transpose a dataset and automatically label variables based on a string variable?


From   John Bensin <johnalexbensin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Is there a better way to transpose a dataset and automatically label variables based on a string variable?
Date   Fri, 21 Jun 2013 08:40:26 -0500

It looks like my email client mangled my first post, and I think the
formatting is relevant for the data sections, so I'll try this again.

I have several datasets that fit this general format (variable names
are in the first row):


varlabels             category     diff_category     category3
"Some label"        0.79221      0.67874         0.70605
"Another label"     0.95949      0.75774         0.031833
"Additional label"  0.65574      0.74313         0.27692
"Fourth label"      0.035712     0.39223         0.046171
"Fifth label"       0.84913      0.65548         0.097132
"Final label"       0.93399      0.17119         0.82346


My goal is to use Stata 12 to get them into this format:

_var1      _var2      _var3      _var4      _var5      _var6        cat
0.792210   0.959490   0.655740   0.035712   0.849130   0.933990
"category"
0.678740   0.757740   0.743130   0.392230   0.655480   0.171190
"diff category"
0.706050   0.031833   0.276920   0.046171   0.097132   0.823460
"category3"

where _var1 is labelled "Some label", _var2 is labelled "Another
label", etc. The datasets are different and numerous enough that
labelling the variables individually isn't possible. The names of the
category variables aren't predictable either.

My current solution is to improvise with Mata, xpose, and destring/tostring


clear
input str50 varlabels category diff_category category3
"Some label"        -18.2059     0.67874     105869.3
"Another label"     0.95949      598.165     0.031833
"Additional label"  0.65574      0.74313     0.27692
"Fourth label"        194          0.39223     0.046171
"Fifth label"       0.84913      394.069     -5968387
"Final label"       0.93399      0.17119     0.82346
end

putmata varlabels, replace

mata
    varlabels = varlabels'
end

xpose, clear varname
tostring v*, force usedisplayformat replace
getmata (v*)=varlabels, update force

foreach v of varlist v* {
    local name = `v'[1]
    lab var `v' "`name'"
}

drop if _varname == "varlabels"
destring, replace



Although this works, I question if there is a better way because I see
a few problems with this:

1. The combination of tostring/destring may cause a loss of precision,
according to Stata's warnings.

2. The --getmata-- call forces the vector of labels into the first
observation, which seems clunky. Is it safe to always rely on this
working or do I risk overwriting data?

I also pursued a solution using the --sxpose-- command from the SSC:


clear
input str50 varlabels category diff_category category3
"Some label"        0.79221      0.67874     0.70605
"Another label"     0.95949      0.75774     0.031833
"Additional label"  0.65574      0.74313     0.27692
"Fourth label"        0.035712     0.39223     0.046171
"Fifth label"       0.84913      0.65548     0.097132
"Final label"       0.93399      0.17119     0.82346
end

sxpose, clear force

foreach v of varlist _var* {
    local name = `v'[1]
    lab var `v' "`name'"
}


This costs me the names of the category variables, which I need to
preserve, as well as some accuracy. It yields data that look like
this:

_var1             _var2               _var3               _var4
       _var5            _var6
Some label        Another label      Additional label     Fourth label
     Fifth label        Final label
0.792209983        0.959490001        0.655740023        0.035712
     0.849129975        0.933990002
0.678740025        0.757740021        0.743130028        0.392230004
     0.655480027        0.171189994
0.706049979        0.031833001        0.276919991        0.046170998
     0.097131997        0.823459983


Any suggestions? I'm happy to continue with my Mata/xpose solution if
it won't cost me data, but if there's a better solution I'll use it.

John Bensin

On Fri, Jun 21, 2013 at 8:34 AM, John Bensin <johnalexbensin@gmail.com> wrote:
> I have several datasets that fit this general format (variable names
> are in the first row):
>
>         varlabels             category     diff_category     category3
>         "Some label"        0.79221      0.67874         0.70605
>         "Another label"     0.95949      0.75774         0.031833
>         "Additional label"  0.65574      0.74313         0.27692
>         "Fourth label"      0.035712     0.39223         0.046171
>         "Fifth label"       0.84913      0.65548         0.097132
>         "Final label"       0.93399      0.17119         0.82346
>
> My goal is to use Stata 12 to get them into this format:
>
>            _var1      _var2      _var3      _var4      _var5
> _var6        cat
>         0.792210   0.959490   0.655740   0.035712   0.849130
> 0.933990        "category"
>         0.678740   0.757740   0.743130   0.392230   0.655480
> 0.171190        "diff category"
>         0.706050   0.031833   0.276920   0.046171   0.097132
> 0.823460        "category3"
>
> where _var1 is labelled "Some label", _var2 is labelled "Another
> label", etc. The datasets are
> different and numerous enough that labelling the variables
> individually isn't possible. The names
> of the category variables aren't predictable either.
>
> My current solution is to improvise with Mata, xpose, and destring/tostring
>
>
>         clear
>         input str50 varlabels category diff_category category3
>         "Some label"        -18.2059     0.67874     105869.3
>         "Another label"     0.95949      598.165     0.031833
>         "Additional label"  0.65574      0.74313     0.27692
>         "Fourth label"        194          0.39223     0.046171
>         "Fifth label"       0.84913      394.069     -5968387
>         "Final label"       0.93399      0.17119     0.82346
>         end
>
>         putmata varlabels, replace
>
>         mata
>             varlabels = varlabels'
>         end
>
>         xpose, clear varname
>         tostring v*, force usedisplayformat replace
>         getmata (v*)=varlabels, update force
>
>         foreach v of varlist v* {
>             local name = `v'[1]
>             lab var `v' "`name'"
>         }
>
>         drop if _varname == "varlabels"
>         destring, replace
>
>
> Although this works, I question if there is a better way because I see
> a few problems with this:
>
>         1. The combination of tostring/destring may cause a loss of
> precision, according to Stata's
>            warnings.
>
>         2. The --getmata-- call forces the vector of labels into the
> first observation, which seems clunky.
>            Is it safe to always rely on this working or do I risk
> overwriting data?
>
> I also pursued a solution using the --sxpose-- command from the SSC:
>
>
>         clear
>         input str50 varlabels category diff_category category3
>         "Some label"        0.79221      0.67874     0.70605
>         "Another label"     0.95949      0.75774     0.031833
>         "Additional label"  0.65574      0.74313     0.27692
>         "Fourth label"        0.035712     0.39223     0.046171
>         "Fifth label"       0.84913      0.65548     0.097132
>         "Final label"       0.93399      0.17119     0.82346
>         end
>
>         sxpose, clear force
>
>         foreach v of varlist _var* {
>             local name = `v'[1]
>             lab var `v' "`name'"
>         }
>
>
> This costs me the names of the category variables, which I need to
> preserve; it yields data that look like
> this:
>
>     _var1             _var2               _var3               _var4
>            _var5            _var6
>     Some label        Another label      Additional label     Fourth
> label      Fifth label        Final label
>     0.792209983        0.959490001        0.655740023        0.035712
>          0.849129975        0.933990002
>     0.678740025        0.757740021        0.743130028
> 0.392230004        0.655480027        0.171189994
>     0.706049979        0.031833001        0.276919991
> 0.046170998        0.097131997        0.823459983
>
>
> Any suggestions? I'm happy to continue with my Mata/xpose solution if
> it won't cost me data, but if there's a
> better solution I'll use it.
>
> John Bensin
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index