Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Loading long string variables (from SQL) into Stata |

Date |
Thu, 6 Dec 2012 09:35:56 +0000 |

That's good: you can imagine a strategy. In detail: there's no guarantee that words would be split in the middle. Also, there is no need to create new variables as (e.g.) strpos(substr(text1, length(text1) - 17, 17) + substr(text2, 1, 17), "heteroskedasticity") would work. But the Mata approach is better. By the way, in my ignorance I had to Google "NLP" as mentioned by Dimitriy. I came up with repeated hits on a species of baloney called "neurolinguistic programming" but I guess Dimitriy meant something like "natural language processing". Nick On Thu, Dec 6, 2012 at 9:19 AM, Jen Zhen <jenzhen99@gmail.com> wrote: > Dear Steve, Dimitriy and Nick, > > thanks so much for your responses. > > Firstly, Nick's question is of course a fair point. What I'd like to > do is to find substrings that are particular words (so at most say 20 > characters long). So I think that once I have loaded and searched say > text1-text30, each of length 244, I can then also generate new > variables -gen text1b = substr(text1,235,10)+substr(text2,1, 10)- and > so on and search these next to also capture words that were truncated > in the initial split. I hope that that makes more sense now. > > Dimitriy's suggestion on how to load only parts of the whole text > variable does seem to do the job even without me having to split the > string in SQL before, so it looks like I'm settled on this. > > Thanks so much and best regards, > JZ > > > > 30 split text variables, I can then also recombine them > > On Tue, Dec 4, 2012 at 12:42 AM, Dimitriy V. Masterov > <dvmaster@gmail.com> wrote: >> Nick, >> >> I often work with long strings that are user-entered text, like a >> message or a search query. >> >> Sometimes they merely provide context and help me make sense of the >> structured data that I am analyzing using conventional methods. >> >> Other times I use regular expressions to look for key words like >> refund across the split strings. This is caveman NLP that serves as an >> input to producing structured data. >>> On Dec 1, 2012 3:33 AM, "Nick Cox" <njcoxstata@gmail.com> wrote: >>>> >>>> I am usually puzzled by questions like this -- not because I doubt in >>>> any way that people have string fields (much) longer than 244 >>>> characters [which is the limit, not 240], but because I don't ever see >>>> any hint at what the intended or imagined strategy would be once the >>>> pieces have been cut and read separately into Stata. >>>> >>>> Suppose you would like to import as a string variable something that >>>> is say about 1000 characters long. Suppose that you succeed in >>>> importing it as -mystr1- ... -mystr5-. If these parts are logically >>>> separate, so that they make sense as individual fields, all well and >>>> good. But what is the strategy if they don't, meaning that the parts >>>> are just chopped to fit, regardless of meaning? >>>> >>>> For example, many basic problems centre on looking for a substring >>>> within a string. Now you need a strategy that looks for a substring >>>> that may have been split by a chop. >>>> >>>> There are ways of approaching these problems, most notably using Mata >>>> to reassemble parts into longer strings, but are people asking for how >>>> to import also planning the code they would need to write to handle >>>> them? Or even planning to learn enough Mata to work with these strings >>>> as they should be? >>>> >>>> (If your fields are 10,000 or 100,000 characters ..., the question >>>> clearly has even more force.) >>>> >>>> Nick >>>> >>>> On Fri, Nov 30, 2012 at 5:30 PM, Dimitriy V. Masterov >>>> <dvmaster@gmail.com> wrote: >>>> > Jen, >>>> > >>>> > You can split the long variable into substring using SQL. The function >>>> > may have a slightly different name depending what dialect of SQL you >>>> > have. For example: >>>> > >>>> > odbc load, exec("select substr(longstringvar,1,240) as s1, >>>> > substr(longstringvar,241,480) as s2 from dbname.tablename") clear >>>> > dsn("dsn"); >>>> > >>>> > If you have lots of these, I don't know of a way to automate the >>>> > process of deciding how many pieces there should be. I would do an >>>> > initial query to get the length of the varchar variables in SQL. Then >>>> > you can maybe automate something if the need arises. >>>> > >>>> > DVM >>>> > >>>> > On Fri, Nov 30, 2012 at 6:01 AM, Jen Zhen <jenzhen99@gmail.com> wrote: >>>> >> Coming back to this earlier post: >>>> >> >>>> >> Does anyone actually know how to split SQL text columns (string >>>> >> variables) after each 240 characters and save the resulting columns >>>> >> (variables) as a new table (dataset), using SQL browser (or maybe even >>>> >> from within Stata, but I guess that's probably not possible)? >>>> >> >>>> >> Thanks so much, >>>> >> JZ >>>> >> >>>> >> >>>> >> On Tue, Sep 4, 2012 at 6:25 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> >>> It's my understanding that >>>> >>> >>>> >>> 1. -odbc-, just like any Stata import command, expects a one-to-one >>>> >>> mapping between what it imports and the Stata variables that result. >>>> >>> >>>> >>> 2. Thus, as you say, it seems that you need to manipulate your data >>>> >>> outside Stata. >>>> >>> >>>> >>> 3. The only alternative is to read directly into Mata. >>>> >>> >>>> >>> Note that parts of long strings that are -str244- would not be much >>>> >>> use inside Stata any way, except in so far as you could re-combine >>>> >>> them in Mata. >>>> >>> >>>> >>> Nick >>>> >>> >>>> >>> On Tue, Sep 4, 2012 at 5:15 PM, Jen Zhen <jenzhen99@gmail.com> wrote: >>>> >>>> The ideal thing would be if it was somehow possible to split the >>>> >>>> variables upon loading, >>>> >>>> or to load only the first x characters once and load only the second >>>> >>>> x >>>> >>>> characters the second time etc, >>>> >>>> but I guess if any splitting can only be done after loading then >>>> >>>> there >>>> >>>> seems to be no way around having to do the adjustment already in >>>> >>>> SQL... >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Sep 4, 2012 at 4:53 PM, Nick Cox <njcoxstata@gmail.com> >>>> >>>> wrote: >>>> >>>>> -compress- can do nothing to help and not just for the reason you >>>> >>>>> mention, that the variables in Stata were truncated on input. >>>> >>>>> >>>> >>>>> -compress- will only compress variables from one storage type to a >>>> >>>>> more compact storage type when that is possible without loss of >>>> >>>>> information. But the longer strings could not be variables to start >>>> >>>>> with and in any case there would be loss of information. >>>> >>>>> >>>> >>>>> Nick >>>> >>>>> >>>> >>>>> On Tue, Sep 4, 2012 at 2:41 PM, tashi lama <ltashi32@hotmail.com> >>>> >>>>> wrote: >>>> >>>>>> I would also like to find the answer. There is _compress but I am >>>> >>>>>> not sure it will be any help since the columns are already truncated before >>>> >>>>>> it loads. _compress seems to be helpful once the columns are successfully >>>> >>>>>> loaded and you want to save the memory. >>>> >>>>> >>>> >>>>> By the way, if you are running stata in unix, did you have any >>>> >>>>> problem >>>> >>>>> loading sql columns with type varchar? I am having an issue while >>>> >>>>> loading columns with type varchar. >>>> >>>>> >>>> >>>>> From: jenzhen99@gmail.com >>>> >>>>> >>>> >>>>>>> I`m loading some variables from an SQL database into Stata using >>>> >>>>>>> the 2 >>>> >>>>>>> lines of code given below. >>>> >>>>>>> One of them is a very long string variable that gets cut off when >>>> >>>>>>> loaded into Stata. Is there a way to either increase the maximum >>>> >>>>>>> length of string variables allowed in Stata, or to automatically >>>> >>>>>>> split >>>> >>>>>>> each string at whatever is the maximum number of symbols and put >>>> >>>>>>> the >>>> >>>>>>> rest into further string variables? >>>> >>>>>>> Thanks so much and best regards, >>>> >>>>>>> JZ >>>> >>>>>>> >>>> >>>>>>> local sql "SELECT var1, var2, var3 FROM database WHERE var1>5" >>>> >>>>>>> odbc load, exec("`sql'") conn("`db'") * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Loading long string variables (from SQL) into Stata***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Loading long string variables (from SQL) into Stata***From:*"Dimitriy V. Masterov" <dvmaster@gmail.com>

**Re: st: Loading long string variables (from SQL) into Stata***From:*Jen Zhen <jenzhen99@gmail.com>

- Prev by Date:
**Re: st: importance of independent variables** - Next by Date:
**Re: st: example about choice experiment datasheet** - Previous by thread:
**Re: st: Loading long string variables (from SQL) into Stata** - Next by thread:
**Re: st: use egen functions to get survey weighted means** - Index(es):