Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Jen Zhen <jenzhen99@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Loading long string variables (from SQL) into Stata |

Date |
Thu, 6 Dec 2012 10:19:52 +0100 |

Dear Steve, Dimitriy and Nick, thanks so much for your responses. Firstly, Nick's question is of course a fair point. What I'd like to do is to find substrings that are particular words (so at most say 20 characters long). So I think that once I have loaded and searched say text1-text30, each of length 244, I can then also generate new variables -gen text1b = substr(text1,235,10)+substr(text2,1, 10)- and so on and search these next to also capture words that were truncated in the initial split. I hope that that makes more sense now. Dimitriy's suggestion on how to load only parts of the whole text variable does seem to do the job even without me having to split the string in SQL before, so it looks like I'm settled on this. Thanks so much and best regards, JZ 30 split text variables, I can then also recombine them On Tue, Dec 4, 2012 at 12:42 AM, Dimitriy V. Masterov <dvmaster@gmail.com> wrote: > Nick, > > I often work with long strings that are user-entered text, like a > message or a search query. > > Sometimes they merely provide context and help me make sense of the > structured data that I am analyzing using conventional methods. > > Other times I use regular expressions to look for key words like > refund across the split strings. This is caveman NLP that serves as an > input to producing structured data. > > DVM > > On Sat, Dec 1, 2012 at 9:29 AM, Dimitriy V. Masterov <dvmaster@gmail.com> wrote: >> Nick, >> >> I often work with long strings that are user-entered text, like a message or >> a search query. >> >> Sometimes they merely provide context and help me make sense of the >> structured data that I am analyzing using conventional methods. >> >> Other times I use regular expressions to look for key words like refund >> across the split strings. This is caveman NLP that serves as an input to >> producing structured data. >> >> DVM >> >> On Dec 1, 2012 3:33 AM, "Nick Cox" <njcoxstata@gmail.com> wrote: >>> >>> I am usually puzzled by questions like this -- not because I doubt in >>> any way that people have string fields (much) longer than 244 >>> characters [which is the limit, not 240], but because I don't ever see >>> any hint at what the intended or imagined strategy would be once the >>> pieces have been cut and read separately into Stata. >>> >>> Suppose you would like to import as a string variable something that >>> is say about 1000 characters long. Suppose that you succeed in >>> importing it as -mystr1- ... -mystr5-. If these parts are logically >>> separate, so that they make sense as individual fields, all well and >>> good. But what is the strategy if they don't, meaning that the parts >>> are just chopped to fit, regardless of meaning? >>> >>> For example, many basic problems centre on looking for a substring >>> within a string. Now you need a strategy that looks for a substring >>> that may have been split by a chop. >>> >>> There are ways of approaching these problems, most notably using Mata >>> to reassemble parts into longer strings, but are people asking for how >>> to import also planning the code they would need to write to handle >>> them? Or even planning to learn enough Mata to work with these strings >>> as they should be? >>> >>> (If your fields are 10,000 or 100,000 characters ..., the question >>> clearly has even more force.) >>> >>> Nick >>> >>> On Fri, Nov 30, 2012 at 5:30 PM, Dimitriy V. Masterov >>> <dvmaster@gmail.com> wrote: >>> > Jen, >>> > >>> > You can split the long variable into substring using SQL. The function >>> > may have a slightly different name depending what dialect of SQL you >>> > have. For example: >>> > >>> > odbc load, exec("select substr(longstringvar,1,240) as s1, >>> > substr(longstringvar,241,480) as s2 from dbname.tablename") clear >>> > dsn("dsn"); >>> > >>> > If you have lots of these, I don't know of a way to automate the >>> > process of deciding how many pieces there should be. I would do an >>> > initial query to get the length of the varchar variables in SQL. Then >>> > you can maybe automate something if the need arises. >>> > >>> > DVM >>> > >>> > On Fri, Nov 30, 2012 at 6:01 AM, Jen Zhen <jenzhen99@gmail.com> wrote: >>> >> Coming back to this earlier post: >>> >> >>> >> Does anyone actually know how to split SQL text columns (string >>> >> variables) after each 240 characters and save the resulting columns >>> >> (variables) as a new table (dataset), using SQL browser (or maybe even >>> >> from within Stata, but I guess that's probably not possible)? >>> >> >>> >> Thanks so much, >>> >> JZ >>> >> >>> >> >>> >> On Tue, Sep 4, 2012 at 6:25 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> >>> It's my understanding that >>> >>> >>> >>> 1. -odbc-, just like any Stata import command, expects a one-to-one >>> >>> mapping between what it imports and the Stata variables that result. >>> >>> >>> >>> 2. Thus, as you say, it seems that you need to manipulate your data >>> >>> outside Stata. >>> >>> >>> >>> 3. The only alternative is to read directly into Mata. >>> >>> >>> >>> Note that parts of long strings that are -str244- would not be much >>> >>> use inside Stata any way, except in so far as you could re-combine >>> >>> them in Mata. >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Sep 4, 2012 at 5:15 PM, Jen Zhen <jenzhen99@gmail.com> wrote: >>> >>>> The ideal thing would be if it was somehow possible to split the >>> >>>> variables upon loading, >>> >>>> or to load only the first x characters once and load only the second >>> >>>> x >>> >>>> characters the second time etc, >>> >>>> but I guess if any splitting can only be done after loading then >>> >>>> there >>> >>>> seems to be no way around having to do the adjustment already in >>> >>>> SQL... >>> >>>> >>> >>>> >>> >>>> On Tue, Sep 4, 2012 at 4:53 PM, Nick Cox <njcoxstata@gmail.com> >>> >>>> wrote: >>> >>>>> -compress- can do nothing to help and not just for the reason you >>> >>>>> mention, that the variables in Stata were truncated on input. >>> >>>>> >>> >>>>> -compress- will only compress variables from one storage type to a >>> >>>>> more compact storage type when that is possible without loss of >>> >>>>> information. But the longer strings could not be variables to start >>> >>>>> with and in any case there would be loss of information. >>> >>>>> >>> >>>>> Nick >>> >>>>> >>> >>>>> On Tue, Sep 4, 2012 at 2:41 PM, tashi lama <ltashi32@hotmail.com> >>> >>>>> wrote: >>> >>>>>> I would also like to find the answer. There is _compress but I am >>> >>>>>> not sure it will be any help since the columns are already truncated before >>> >>>>>> it loads. _compress seems to be helpful once the columns are successfully >>> >>>>>> loaded and you want to save the memory. >>> >>>>> >>> >>>>> By the way, if you are running stata in unix, did you have any >>> >>>>> problem >>> >>>>> loading sql columns with type varchar? I am having an issue while >>> >>>>> loading columns with type varchar. >>> >>>>> >>> >>>>> From: jenzhen99@gmail.com >>> >>>>> >>> >>>>>>> I`m loading some variables from an SQL database into Stata using >>> >>>>>>> the 2 >>> >>>>>>> lines of code given below. >>> >>>>>>> One of them is a very long string variable that gets cut off when >>> >>>>>>> loaded into Stata. Is there a way to either increase the maximum >>> >>>>>>> length of string variables allowed in Stata, or to automatically >>> >>>>>>> split >>> >>>>>>> each string at whatever is the maximum number of symbols and put >>> >>>>>>> the >>> >>>>>>> rest into further string variables? >>> >>>>>>> Thanks so much and best regards, >>> >>>>>>> JZ >>> >>>>>>> >>> >>>>>>> local sql "SELECT var1, var2, var3 FROM database WHERE var1>5" >>> >>>>>>> odbc load, exec("`sql'") conn("`db'") >>> >>>>> >>> >>>>> * >>> >>>>> * For searches and help try: >>> >>>>> * http://www.stata.com/help.cgi?search >>> >>>>> * http://www.stata.com/support/statalist/faq >>> >>>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>>> >>> >>>> * >>> >>>> * For searches and help try: >>> >>>> * http://www.stata.com/help.cgi?search >>> >>>> * http://www.stata.com/support/statalist/faq >>> >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> >>> * >>> >>> * For searches and help try: >>> >>> * http://www.stata.com/help.cgi?search >>> >>> * http://www.stata.com/support/statalist/faq >>> >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >>> >> * >>> >> * For searches and help try: >>> >> * http://www.stata.com/help.cgi?search >>> >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> >> * http://www.ats.ucla.edu/stat/stata/ >>> > >>> > * >>> > * For searches and help try: >>> > * http://www.stata.com/help.cgi?search >>> > * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> > * http://www.ats.ucla.edu/stat/stata/ >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Loading long string variables (from SQL) into Stata***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**Re: st: Loading long string variables (from SQL) into Stata***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Loading long string variables (from SQL) into Stata***From:*"Dimitriy V. Masterov" <dvmaster@gmail.com>

- Prev by Date:
**st: importance of independent variables** - Next by Date:
**Re: st: importance of independent variables** - Previous by thread:
**Re: st: Loading long string variables (from SQL) into Stata** - Next by thread:
**Re: st: Loading long string variables (from SQL) into Stata** - Index(es):