Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: string variables

From	Robert Picard <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: RE: string variables
Date	Fri, 20 Sep 2013 16:32:57 -0400

My favorite example of movie title translation:

"Travolti da un insolito destino nell'azzurro mare d'agosto" (original)
"Vers un destin insolite, sur les flots bleus de l'été" (français)

=>  "Swept Away"

Robert

On Fri, Sep 20, 2013 at 10:19 AM, Robert Picard <[email protected]> wrote:
> If the titles are the same and the ids are different, you can use
> -group_id- (from SSC) to merge the identifiers. Something like:
>
> egen newid = group(itunes_id)
> group_id newid, matchby(trackname)
>
> To match on the first few letters of the title
>
> gen trackname7 = substr(trackname,1,7)
> group_id newid, matchby(trackname7)
>
> But my impression is that movie titles are very often completely
> different in different languages so you will have to lookup manually
> each translation.
>
> Robert
>
>
> On Fri, Sep 20, 2013 at 9:51 AM, Estrella Gomez <[email protected]> wrote:
>> The problem is that I cannot split my dataset into two parts (the
>> original version of the movie and the rest) since the ids are mixed.
>> This is another example:
>>
>> itunes_id country artist trackname
>> 647263958 | it | Adam Brooks | Certamente, Forse
>> 647263958 | bg | Adam Brooks | Definitely, Maybe
>> 281009584 | cy | Adam Brooks | Definitely, Maybe
>>
>> Here I have the same ids but different titles in the first two cases
>> and same titles but different ids in the last rows. This is because
>> sometimes a translated movie has the same title than the original.
>>
>> Thank you very much,
>> Estrella
>>
>> 2013/9/20 Joe Canner <[email protected]>:
>>> Estella,
>>>
>>> I wouldn't assume that the first -n- characters of a movie title are always going to be the same in different languages.  That works for the example you provided, but there will probably be many exceptions.
>>>
>>> What you really need--and even this won't work in all cases--is "fuzzy" matching, akin to what is used, for example, by businesses to match the address you enter with a standard address in a database, or when trying to match patient information with a death index.
>>>
>>> There are two user-written programs (and there may be more), for things like this: -reclink-, and -vmatch-.  I haven't used them much so I can't say exactly how you would use them for your situation.  If you get stuck on how to manipulate your data to get it into the right structure, let us know.
>>>
>>> Of course, the best solution would be if there were an interface with Google Translate, as there is with Google Maps.  I did a quick search and couldn't find anything like this, which seems like it would be very useful in certain situations.  On the other hand, even if there was such a thing, you would end up with the opposite problem: some words would get translated that should not be (e.g., "Anchorman" in your example).
>>>
>>> Good luck!
>>>
>>> Joe Canner
>>> Johns Hopkins University School of Medicine
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Estrella Gomez
>>> Sent: Friday, September 20, 2013 6:12 AM
>>> To: [email protected]
>>> Subject: st: string variables
>>>
>>> Dear statalisters
>>>
>>> I am working on a dataset related to movies. I would like to identify each movie with an unique id. However, there are many cases in which the title is translated and then the original identifier provided in the dataset is not the same, for instance:
>>>
>>> id | country | artist | trackname
>>> 2975 | at | Adam McKay | Anchorman - Die Legende von Ron Burgundy
>>> 2975 | de | Adam McKay | Anchorman - Die Legende von Ron Burgundy
>>> 6647 | it | Adam McKay | Anchorman: La leggenda di Ron Burgundy
>>> 6653 | be | Adam McKay | Anchorman: The Legend of Ron Burgundy
>>>
>>> How could I create a new id to uniquely identify the same movie (even if it's in different languages)? Maybe I could use the first 5 or 6 letters in the title, because usually this coincides in different languages; but still I don't know how to do it.
>>>
>>> Thanks a lot,
>>> Estrella Gomez
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: RE: string variables
  - From: Joe Canner <[email protected]>

References:
- st: string variables
  - From: Estrella Gomez <[email protected]>
- st: RE: string variables
  - From: Joe Canner <[email protected]>
- Re: st: RE: string variables
  - From: Estrella Gomez <[email protected]>
- Re: st: RE: string variables
  - From: Robert Picard <[email protected]>

Prev by Date: RE: st: RE: How to get coefficient and R square from time series regression
Next by Date: Re: st: RE: How to get coefficient and R square from time series regression
Previous by thread: Re: st: RE: string variables
Next by thread: RE: st: RE: string variables
Index(es):
- Date
- Thread