Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Twitter Message Sub-string Extraction?


From   Eric Booth <ebooth@ppri.tamu.edu>
To   "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Twitter Message Sub-string Extraction?
Date   Sun, 12 Jun 2011 20:31:04 +0000

<>

Hi Richard:

I'd use -split- to separate each twitter message (esp. since twitter messages are so short) and a combination of strpos() and subinstr() to find the elements you describe across the split tweets. 
 I've provided an example below -- you'll need to install -dropmiss- and -sortrows-(from SSC) via -findit- before running my example.

*****************!
clear
inp str240(tweet)
"*@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT @SomeOtherPerson15 @test"
"*@somethingelse* I'm asdfasdf at #something! #summer2011 RT @asff15 @afafdf"
"*@anothername* I'm asdff t #test! #summer2011 @test15 @YetAnotherPerson"
end

split tweet, p(" ")
ds tweet*

**username
rename tweet1 username
replace username = subinstr(username, "*", "", .)
l username
**RT
g rt = 1 if strpos(tweet, "RT")
ta rt

**topics & recipients:
ds tweet?
foreach v in topic recipient {
loc n = 1
foreach t in `r(varlist)' {
	g `v'`n' = `t' if strpos(`t', "#")
	loc `++n'
	} //end t loop
	} //end v loop
//get rid of empty vars:
drop tweet?? tweet?
order topic* recipient*

sortrows topic* , replace missing
sortrows recipient* , replace missing
dropmiss, force

**probably want to reshape at some pt:
g id = _n
order id
reshape long topic recipient , i(id) j(tweet_num)
compress
order id username rt topic reci

*****************!

- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu


On Jun 12, 2011, at 2:39 PM, Richard Fairbanks wrote:

> Dear Statalisters,
> 
> I'm preparing a dataset of ~ 2,000 tweets (Twitter messages) for social
> network analysis. I'm trying to track who tweeted to whom and the theme
> (hashtag) of the message.
> 
>  Observations of the single variable look like this.
> 
>  *@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT
> @SomeOtherPerson15 @YetAnotherPerson
> 
> For those unfamiliar with Twitter:
> 
> @[Name] - Username of the person sending the tweet. Must be 20 characters or
> less, including letters and / or integers in any position.
> 
> RT - "re-tweet" - Think of this like an email "Forward" option for tweets.
> No help needed here, just making a dummy variable!
> 
> #[Name] - "hashtag" - An arbitrary code in letters and integers specifying
> the topic or adding commentary
> 
> Subsequent @[Name]s - These are people to whom the message is specifically
> directed.
> 
> I know how to generate a new variable that contains the message sender
> (always the first string after the "@" character) using regular expressions,
> although there's probably a simpler way.
> 
> How can I generate a new variable that contains #[Names] and @[Names] after
> the first case of a username or hashtag? (That is, using the example, I'm
> having trouble extracting #summer2011, @SomeOtherPerson15 and
> @YetAnotherPerson.
> 
> 
> 
> Thanks,
> Richard Fairbanks
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index