Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: string variables

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: RE: string variables
Date	Fri, 20 Sep 2013 14:17:20 +0000

I would suggest a solution for cases like this, but this example illustrates the problem of doing this without a translator.  You cannot even do a "fuzzy" match in this case.

If you could always count on having two or more records with the same ID, at least one of which has the original title, then maybe there could be a solution.  In your example below, you could either (1) force the first two observations to have the same title (ideally the original title) and then use the titles to link the to the third observation; or (2) force the last two observations to have the same ID and then use the IDs to link to the first observation.

However, the way you have described the problem, I don't have much confidence that there is a consistent pattern that can be used to link these records together in general.

P.S. Another thought that just occurred to me: you could use a spell checker to identify titles that are in a different language, which might help regularize this problem.  One way to do this in Stata might be to use internet search engines, which are good at finding alternative spelling and possibly even translations.  For example, when you put "Certamente, Forse" into Google, the first hits are for the movie, including the English title.  For more information, see the following article by Statalist stalwart Sergiy Radaykin: http://www.stata.com/meeting/canada09/ca09_radyakin.pdf


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Estrella Gomez
Sent: Friday, September 20, 2013 9:51 AM
To: [email protected]
Subject: Re: st: RE: string variables

The problem is that I cannot split my dataset into two parts (the original version of the movie and the rest) since the ids are mixed.
This is another example:

itunes_id country artist trackname
647263958 | it | Adam Brooks | Certamente, Forse
647263958 | bg | Adam Brooks | Definitely, Maybe
281009584 | cy | Adam Brooks | Definitely, Maybe

Here I have the same ids but different titles in the first two cases and same titles but different ids in the last rows. This is because sometimes a translated movie has the same title than the original.

Thank you very much,
Estrella

2013/9/20 Joe Canner <[email protected]>:
> Estella,
>
> I wouldn't assume that the first -n- characters of a movie title are always going to be the same in different languages.  That works for the example you provided, but there will probably be many exceptions.
>
> What you really need--and even this won't work in all cases--is "fuzzy" matching, akin to what is used, for example, by businesses to match the address you enter with a standard address in a database, or when trying to match patient information with a death index.
>
> There are two user-written programs (and there may be more), for things like this: -reclink-, and -vmatch-.  I haven't used them much so I can't say exactly how you would use them for your situation.  If you get stuck on how to manipulate your data to get it into the right structure, let us know.
>
> Of course, the best solution would be if there were an interface with Google Translate, as there is with Google Maps.  I did a quick search and couldn't find anything like this, which seems like it would be very useful in certain situations.  On the other hand, even if there was such a thing, you would end up with the opposite problem: some words would get translated that should not be (e.g., "Anchorman" in your example).
>
> Good luck!
>
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Estrella 
> Gomez
> Sent: Friday, September 20, 2013 6:12 AM
> To: [email protected]
> Subject: st: string variables
>
> Dear statalisters
>
> I am working on a dataset related to movies. I would like to identify each movie with an unique id. However, there are many cases in which the title is translated and then the original identifier provided in the dataset is not the same, for instance:
>
> id | country | artist | trackname
> 2975 | at | Adam McKay | Anchorman - Die Legende von Ron Burgundy
> 2975 | de | Adam McKay | Anchorman - Die Legende von Ron Burgundy
> 6647 | it | Adam McKay | Anchorman: La leggenda di Ron Burgundy
> 6653 | be | Adam McKay | Anchorman: The Legend of Ron Burgundy
>
> How could I create a new id to uniquely identify the same movie (even if it's in different languages)? Maybe I could use the first 5 or 6 letters in the title, because usually this coincides in different languages; but still I don't know how to do it.
>
> Thanks a lot,
> Estrella Gomez
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: string variables
  - From: Estrella Gomez <[email protected]>
- st: RE: string variables
  - From: Joe Canner <[email protected]>
- Re: st: RE: string variables
  - From: Estrella Gomez <[email protected]>

Prev by Date: Re: st: RE: string variables
Next by Date: Re: st: RE: string variables
Previous by thread: Re: st: RE: string variables
Next by thread: Re: st: RE: string variables
Index(es):
- Date
- Thread