Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: need help destringing a variable


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   RE: st: need help destringing a variable
Date   Sun, 16 Sep 2007 16:07:29 +0100

D[aniel] James McNeil wrote [edited]

-----------------------------------------------------
I just want to convert all the strings in one variable into integers.

e.g., the variable is -grade_school- (current grade in school) and the variable column has "2nd", "5th" etc., and I want to make it "2", "5" by issuing just one command.
-----------------------------------------------------

There were several suggestions, but further comment is possible.
Daniel didn't say how high current grade in school goes, or indeed
what education system he is referring to. As this is an international
list, it is unwise to assume that such local details are universally
understood. Let's guess that grade may go higher than 9, so second
digits are possible.

0. Generalities
===============

Wanting a single command is all very well, but it is wise to
check that the input is as assumed and the results are as desired.

For example, an ultra-careful solution would check that there
were no leading or trailing spaces in -grade-. Also, running
-tabulate- or -levelsof- on the result would be a good idea.

1. Simple string functions
==========================

Following Scott Hankins' suggestion, the first character, from James's information assumed to be a numeric character, is

substr(grade_school,1,1)

We would want to add the second character whenever it was also
numeric:

substr(grade_school,1,1) +
cond(inrange(real(substr(grade_school, 1,2)),1,9), substr(grade_school,2,1), "")

and finally you could put -real()- round all that. That's
not elegant. If -grade- were always less than 10,

real(substr(grade_school,1,1))

would however be a nice direct solution. As hinted above,

real(substr(trim(grade_school),1,1))

would do no harm.

2. Regular expressions
======================

Frank de Libero suggested

regexs(1) if regexm(ns,"^([0-9]+)")

That is a great solution for those familiar with regular expressions.

3. -destring-
=============

Svend Juul and Thomas Steichen suggested -destring- with the -ignore()- option. I would not try to ignore too much, because that might just hide
some problems.

destring grade , generate(n_grade) ignore("stndrh")

would catch "st", "nd", "rd" and "th".

4. -encode-
===========

An -encode- solution is not out of the question.

label def grade 1 "1st" 2 "2nd" 3 "3rd" 4 "4th" 5 "5th" 6 "6th" 7 "7th" <...>
encode grade n_grade, label(grade)

This would be good at catching codes that are not as expected.

5. -egen, sieve()-
==================

-egenmore- on SSC offers yet another solution through its -sieve()-
function.

Nick
n.j.cox@durham.ac.uk


*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index