Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Loading long string variables (from SQL) into Stata


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Loading long string variables (from SQL) into Stata
Date   Thu, 6 Dec 2012 09:35:56 +0000

That's good: you can imagine a strategy.

In detail: there's no guarantee that words would be split in the
middle. Also, there is no need to create new variables as (e.g.)

strpos(substr(text1, length(text1) - 17, 17) + substr(text2, 1, 17),
"heteroskedasticity")

would work. But the Mata approach is better.

By the way, in my ignorance I had to Google "NLP" as mentioned by
Dimitriy. I came up with repeated hits on
a species of baloney called "neurolinguistic programming" but I guess
Dimitriy meant something like "natural language processing".

Nick

On Thu, Dec 6, 2012 at 9:19 AM, Jen Zhen <[email protected]> wrote:
> Dear Steve, Dimitriy and Nick,
>
> thanks so much for your responses.
>
> Firstly, Nick's question is of course a fair point. What I'd like to
> do is to find substrings that are particular words (so at most say 20
> characters long). So I think that once I have loaded and searched say
> text1-text30, each of length 244, I can then also generate new
> variables -gen text1b = substr(text1,235,10)+substr(text2,1, 10)- and
> so on and search these next to also capture words that were truncated
> in the initial split. I hope that that makes more sense now.
>
> Dimitriy's suggestion on how to load only parts of the whole text
> variable does seem to do the job even without me having to split the
> string in SQL before, so it looks like I'm settled on this.
>
> Thanks so much and best regards,
> JZ
>
>
>
> 30 split text variables, I can then also recombine them
>
> On Tue, Dec 4, 2012 at 12:42 AM, Dimitriy V. Masterov
> <[email protected]> wrote:
>> Nick,
>>
>> I often work with long strings that are user-entered text, like a
>> message or a search query.
>>
>> Sometimes they merely provide context and help me make sense of the
>> structured data that I am analyzing using conventional methods.
>>
>> Other times I use regular expressions to look for key words like
>> refund across the split strings. This is caveman NLP that serves as an
>> input to producing structured data.

>>> On Dec 1, 2012 3:33 AM, "Nick Cox" <[email protected]> wrote:
>>>>
>>>> I am usually puzzled by questions like this -- not because I doubt in
>>>> any way that people have string fields (much) longer than 244
>>>> characters [which is the limit, not 240], but because I don't ever see
>>>> any hint at what the intended or imagined strategy would be once the
>>>> pieces have been cut and read separately into Stata.
>>>>
>>>> Suppose you would like to import as a string variable something that
>>>> is say about 1000 characters long. Suppose that you succeed in
>>>> importing it as -mystr1- ... -mystr5-. If  these parts are logically
>>>> separate, so that they make sense as individual fields, all well and
>>>> good. But what is the strategy if they don't, meaning that the parts
>>>> are just chopped to fit, regardless of meaning?
>>>>
>>>> For example,  many basic problems centre on looking for a substring
>>>> within a string. Now you need a strategy that looks for a substring
>>>> that may have been split by a chop.
>>>>
>>>> There are ways of approaching these problems, most notably using Mata
>>>> to reassemble parts into longer strings, but are people asking for how
>>>> to import also planning the code they would need to write to handle
>>>> them? Or even planning to learn enough Mata to work with these strings
>>>> as they should be?
>>>>
>>>> (If your fields are 10,000 or 100,000 characters ..., the question
>>>> clearly has even more force.)
>>>>
>>>> Nick
>>>>
>>>> On Fri, Nov 30, 2012 at 5:30 PM, Dimitriy V. Masterov
>>>> <[email protected]> wrote:
>>>> > Jen,
>>>> >
>>>> > You can split the long variable into substring using SQL. The function
>>>> > may have a slightly different name depending what dialect of SQL you
>>>> > have. For example:
>>>> >
>>>> > odbc load, exec("select substr(longstringvar,1,240) as s1,
>>>> > substr(longstringvar,241,480) as s2 from dbname.tablename") clear
>>>> > dsn("dsn");
>>>> >
>>>> > If you have lots of these, I don't know of a way to automate the
>>>> > process of deciding how many pieces there should be. I would do an
>>>> > initial query to get the length of the varchar variables in SQL. Then
>>>> > you can maybe automate something if the need arises.
>>>> >
>>>> > DVM
>>>> >
>>>> > On Fri, Nov 30, 2012 at 6:01 AM, Jen Zhen <[email protected]> wrote:
>>>> >> Coming back to this earlier post:
>>>> >>
>>>> >> Does anyone actually know how to split SQL text columns (string
>>>> >> variables) after each 240 characters and save the resulting columns
>>>> >> (variables) as a new table (dataset), using SQL browser (or maybe even
>>>> >> from within Stata, but I guess that's probably not possible)?
>>>> >>
>>>> >> Thanks so much,
>>>> >> JZ
>>>> >>
>>>> >>
>>>> >> On Tue, Sep 4, 2012 at 6:25 PM, Nick Cox <[email protected]> wrote:
>>>> >>> It's my understanding that
>>>> >>>
>>>> >>> 1. -odbc-, just like any Stata import command, expects a one-to-one
>>>> >>> mapping between what it imports and the Stata variables that result.
>>>> >>>
>>>> >>> 2. Thus, as you say, it seems that you need to manipulate your data
>>>> >>> outside Stata.
>>>> >>>
>>>> >>> 3. The only alternative is to read directly into Mata.
>>>> >>>
>>>> >>> Note that parts of long strings that are -str244- would not be much
>>>> >>> use inside Stata any way, except in so far as you could re-combine
>>>> >>> them in Mata.
>>>> >>>
>>>> >>> Nick
>>>> >>>
>>>> >>> On Tue, Sep 4, 2012 at 5:15 PM, Jen Zhen <[email protected]> wrote:
>>>> >>>> The ideal thing would be if it was somehow possible to split the
>>>> >>>> variables upon loading,
>>>> >>>> or to load only the first x characters once and load only the second
>>>> >>>> x
>>>> >>>> characters the second time etc,
>>>> >>>> but I guess if any splitting can only be done after loading then
>>>> >>>> there
>>>> >>>> seems to be no way around having to do the adjustment already in
>>>> >>>> SQL...
>>>> >>>>
>>>> >>>>
>>>> >>>> On Tue, Sep 4, 2012 at 4:53 PM, Nick Cox <[email protected]>
>>>> >>>> wrote:
>>>> >>>>> -compress- can do nothing to help and not just for the reason you
>>>> >>>>> mention, that the variables in Stata were truncated on input.
>>>> >>>>>
>>>> >>>>> -compress- will only compress variables from one storage type to a
>>>> >>>>> more compact storage type when that is possible without loss of
>>>> >>>>> information. But the longer strings could not be variables to start
>>>> >>>>> with and in any case there would be loss of information.
>>>> >>>>>
>>>> >>>>> Nick
>>>> >>>>>
>>>> >>>>> On Tue, Sep 4, 2012 at 2:41 PM, tashi lama <[email protected]>
>>>> >>>>> wrote:
>>>> >>>>>> I would also like to find the answer. There is _compress but I am
>>>> >>>>>> not sure it will be any help since the columns are already truncated before
>>>> >>>>>> it loads. _compress seems to be helpful once the columns are successfully
>>>> >>>>>> loaded and you want to save the memory.
>>>> >>>>>
>>>> >>>>> By the way, if you are running stata in unix, did you have any
>>>> >>>>> problem
>>>> >>>>> loading sql columns with type varchar? I am having an issue while
>>>> >>>>> loading columns with type varchar.
>>>> >>>>>
>>>> >>>>> From: [email protected]
>>>> >>>>>
>>>> >>>>>>> I`m loading some variables from an SQL database into Stata using
>>>> >>>>>>> the 2
>>>> >>>>>>> lines of code given below.
>>>> >>>>>>> One of them is a very long string variable that gets cut off when
>>>> >>>>>>> loaded into Stata. Is there a way to either increase the maximum
>>>> >>>>>>> length of string variables allowed in Stata, or to automatically
>>>> >>>>>>> split
>>>> >>>>>>> each string at whatever is the maximum number of symbols and put
>>>> >>>>>>> the
>>>> >>>>>>> rest into further string variables?
>>>> >>>>>>> Thanks so much and best regards,
>>>> >>>>>>> JZ
>>>> >>>>>>>
>>>> >>>>>>> local sql "SELECT var1, var2, var3 FROM database WHERE var1>5"
>>>> >>>>>>> odbc load, exec("`sql'") conn("`db'")
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index