Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Twitter Message Sub-string Extraction?


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Twitter Message Sub-string Extraction?
Date   Mon, 13 Jun 2011 08:15:51 +0100

A very minor refinement here is that

ds tweet?

foreach t in `r(varlist)' {

could just be

foreach t of var tweet? {

as -foreach- is perfectly capable of coping with the wildcard.

-dropmiss- is from SJ.

Nick

On Sun, Jun 12, 2011 at 9:31 PM, Eric Booth <ebooth@ppri.tamu.edu> wrote:

> I'd use -split- to separate each twitter message (esp. since twitter messages are so short) and a combination of strpos() and subinstr() to find the elements you describe across the split tweets.
>  I've provided an example below -- you'll need to install -dropmiss- and -sortrows-(from SSC) via -findit- before running my example.
>
> *****************!
> clear
> inp str240(tweet)
> "*@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT @SomeOtherPerson15 @test"
> "*@somethingelse* I'm asdfasdf at #something! #summer2011 RT @asff15 @afafdf"
> "*@anothername* I'm asdff t #test! #summer2011 @test15 @YetAnotherPerson"
> end
>
> split tweet, p(" ")
> ds tweet*
>
> **username
> rename tweet1 username
> replace username = subinstr(username, "*", "", .)
> l username
> **RT
> g rt = 1 if strpos(tweet, "RT")
> ta rt
>
> **topics & recipients:
> ds tweet?
> foreach v in topic recipient {
> loc n = 1
> foreach t in `r(varlist)' {
>        g `v'`n' = `t' if strpos(`t', "#")
>        loc `++n'
>        } //end t loop
>        } //end v loop
> //get rid of empty vars:
> drop tweet?? tweet?
> order topic* recipient*
>
> sortrows topic* , replace missing
> sortrows recipient* , replace missing
> dropmiss, force
>
> **probably want to reshape at some pt:
> g id = _n
> order id
> reshape long topic recipient , i(id) j(tweet_num)
> compress
> order id username rt topic reci
>
> *****************!


On Jun 12, 2011, at 2:39 PM, Richard Fairbanks wrote:

>> I'm preparing a dataset of ~ 2,000 tweets (Twitter messages) for social
>> network analysis. I'm trying to track who tweeted to whom and the theme
>> (hashtag) of the message.
>>
>>  Observations of the single variable look like this.
>>
>>  *@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT
>> @SomeOtherPerson15 @YetAnotherPerson
>>
>> For those unfamiliar with Twitter:
>>
>> @[Name] - Username of the person sending the tweet. Must be 20 characters or
>> less, including letters and / or integers in any position.
>>
>> RT - "re-tweet" - Think of this like an email "Forward" option for tweets.
>> No help needed here, just making a dummy variable!
>>
>> #[Name] - "hashtag" - An arbitrary code in letters and integers specifying
>> the topic or adding commentary
>>
>> Subsequent @[Name]s - These are people to whom the message is specifically
>> directed.
>>
>> I know how to generate a new variable that contains the message sender
>> (always the first string after the "@" character) using regular expressions,
>> although there's probably a simpler way.
>>
>> How can I generate a new variable that contains #[Names] and @[Names] after
>> the first case of a username or hashtag? (That is, using the example, I'm
>> having trouble extracting #summer2011, @SomeOtherPerson15 and
>> @YetAnotherPerson.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index