Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: Increase speed of -replace-


From   "Friedrich Huebler" <fhuebler@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: Increase speed of -replace-
Date   Thu, 8 May 2008 20:33:34 -0400

Thank you, Michael. With my old method it took 15 seconds to replace
all strings in one file. Your solution with -merge- does the same in 2
seconds.

Friedrich

On Thu, May 8, 2008 at 4:52 PM, Michael Blasnik
<michael.blasnik@verizon.net> wrote:
> ...
>
> Although I am perhaps the biggest proponent of using -in- rather than -if-
> when performing many operations on subsets of large datasets, I would
> actually recommend reducing the dataset and using -merge- for this problem
> (if I understand it correctly).
>
> Continuing with your example, if there are perhaps 200 different values of
> make in the 150,000 observation dataset, then I would
>
> keep make
> bysort make: keep if _n==1
> gen newmake=make
> replace newmake = "AMC" if strpos(make,"AMC")>0
> replace newmake = "Buick" if strpos(make,"Buick")>0
> replace newmake = "Cadillac" if strpos(make,"Cad.")>0
> replace newmake = "Chevrolet" if strpos(make,"Chev.")>0
> .....etc..
> sort make
> save make2newmake
>
> use master
> sort make
> merge make using make2newmake
>
>
> Now you have the new names in a dataset.  This approach will be much faster
> because you are only replacing one copy of each value of make -- just 200
> values rather than 150,000.  The speed improvement from this approach should
> more than make up for the need to merge it back in.
>
> If you do this repeatedly, you can just keep the make2newmake dataset and do
> the merge first.  Then you can check for new categories of make (count if
> _m==1) and classify them and add them to that lookup file.
>
> M Blasnik
>
> ----- Original Message ----- From: "Friedrich Huebler" <fhuebler@gmail.com>
> To: <statalist@hsphsun2.harvard.edu>
> Sent: Thursday, May 08, 2008 3:10 PM
> Subject: st: Increase speed of -replace-
>
>
>> I am looking for a way to increase the speed of -replace-. I have a
>> long string variable consisting of several words that should be
>> reduced to a shorter string, depending on the text in each
>> observation. The problem can be reproduced with the auto data. Assume
>> that we want to replace the text in the variable "make" by a single
>> word. Assume further that the text we are looking for (e.g. "Chev.")
>> is not necessarily at the beginning of the string but that it can be
>> anywhere in the variable. My solution is shown below but it is slow
>> with more than 200 -replace- commands and about 150,000 observations.
>> Is there a faster solution?
>>
>> sysuse auto
>> replace make = "AMC" if strpos(make,"AMC")>0
>> replace make = "Buick" if strpos(make,"Buick")>0
>> replace make = "Cadillac" if strpos(make,"Cad.")>0
>> replace make = "Chevrolet" if strpos(make,"Chev.")>0
>> replace make = "Dodge" if strpos(make,"Dodge")>0
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index