Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: string variables

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: string variables
Date	Fri, 20 Sep 2013 12:53:15 +0000

Estella,

I wouldn't assume that the first -n- characters of a movie title are always going to be the same in different languages.  That works for the example you provided, but there will probably be many exceptions.

What you really need--and even this won't work in all cases--is "fuzzy" matching, akin to what is used, for example, by businesses to match the address you enter with a standard address in a database, or when trying to match patient information with a death index.

There are two user-written programs (and there may be more), for things like this: -reclink-, and -vmatch-.  I haven't used them much so I can't say exactly how you would use them for your situation.  If you get stuck on how to manipulate your data to get it into the right structure, let us know.

Of course, the best solution would be if there were an interface with Google Translate, as there is with Google Maps.  I did a quick search and couldn't find anything like this, which seems like it would be very useful in certain situations.  On the other hand, even if there was such a thing, you would end up with the opposite problem: some words would get translated that should not be (e.g., "Anchorman" in your example).

Good luck!

Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Estrella Gomez
Sent: Friday, September 20, 2013 6:12 AM
To: [email protected]
Subject: st: string variables

Dear statalisters

I am working on a dataset related to movies. I would like to identify each movie with an unique id. However, there are many cases in which the title is translated and then the original identifier provided in the dataset is not the same, for instance:

id | country | artist | trackname
2975 | at | Adam McKay | Anchorman - Die Legende von Ron Burgundy
2975 | de | Adam McKay | Anchorman - Die Legende von Ron Burgundy
6647 | it | Adam McKay | Anchorman: La leggenda di Ron Burgundy
6653 | be | Adam McKay | Anchorman: The Legend of Ron Burgundy

How could I create a new id to uniquely identify the same movie (even if it's in different languages)? Maybe I could use the first 5 or 6 letters in the title, because usually this coincides in different languages; but still I don't know how to do it.

Thanks a lot,
Estrella Gomez
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: string variables
  - From: Estrella Gomez <[email protected]>

References:
- st: string variables
  - From: Estrella Gomez <[email protected]>

Prev by Date: Re: st: "Can Your Results be Replicated?" (Stata error?)
Next by Date: st: xtregar postestimate - mfx, eyex command
Previous by thread: st: string variables
Next by thread: Re: st: RE: string variables
Index(es):
- Date
- Thread