Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: exploratory data analysis for finding substitutes and complements


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: exploratory data analysis for finding substitutes and complements
Date   Fri, 30 Sep 2011 18:31:05 +0100

If this were my problem I would restructure to fewer variables and do
something like a correspondence analysis.

Ecologists often have data for lots of sites and lots of species, and
sometimes lots of times too. At first stab, their problem is similar:
you want to see which species occur together. With over a hundred
items some favourite methods such as scatter plot matrices and
correlation matrices pose as many problems as they solve. I think you
have to bite the multivariate bullet somehow.

I would collapse over week in the first instance.

In Stata terms, you would probably need to -reshape long- to get a
structure of item, store, week, expenditure.

2011/9/30 Cameron McIntosh <[email protected]>:
> Hi Dimitriy,
> This type of analysis might be a bit dicey without basket data (record per customer with a transaction date, along with items purchased), but I don't imagine ecological data is completely prohibitive, either -- this is discussed in the Nestorov and Jukić (2003) paper below. I don't know about Stata specifically...
> Hahsler, M., Buchta, C., Gruen, B., & Hornik, K. (September 19, 2011). Mining Association Rules and Frequent Itemsets: Package 'arules', Version 1.0-6.http://cran.r-project.org/web/packages/arules/arules.pdf http://cran.r-project.org/web/packages/arules/index.htmlhttp://cran.r-project.org/web/packages/arules/vignettes/arules.pdf
> Hahsler, M., Chelluboina, S. Hornik, K., & Buchta, C. (2011). The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets. Journal of Machine Learning Research, 12, 2021-2025.http://jmlr.csail.mit.edu/papers/volume12/hahsler11a/hahsler11a.pdf
> Zhang, S., & Wu, X. (2011). Fundamentals of association rules in data mining and knowledge discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(2), 97-116.http://onlinelibrary.wiley.com/doi/10.1002/widm.10/pdf
> Ben Messaoud, R., Loudcher Rabaséda, S. Missaoui, R. & Boussaid, O. (2008). OLEMAR: an On-Line Environment for Mining Association Rules in Multidimensional Data. In D. Taniar, (Ed.), Data Mining and Knowledge Discovery Technologies (pp. 1-35). IGI Global, 2008.http://eric.univ-lyon2.fr/~sabine/adwm_2007.pdf
> Khan, A., Baharudin, B., & Khan, K. (2011). Mining customer data for decision-making using new hybrid classification algorithm. Journal of Theoretical and Applied Information Technology, 27(1), 54-61. http://www.jatit.org/volumes/research-papers/Vol27No1/7Vol27No1.pdf
> Nestorov, S., & Jukić, N. (2003). Ad-Hoc Association-Rule Mining within the Data Warehouse. Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 8 - Volume 8. Washington, DC, USA: IEEE Computer Society.
> Cam
>> Date: Fri, 30 Sep 2011 11:34:50 -0400
>> Subject: st: exploratory data analysis for finding substitutes and complements
>> From: [email protected]
>> To: [email protected]
>>
>> I have a panel data set with store-level sales data for 125 items at a
>> chain restaurant. My variables are quantity sold of that item in a
>> particular store and time. My data looks like this: store_id, week,
>> hot_dogs, burgers, fries, and drinks. For each item, I would like to
>> figure out which items are substitutes or complements. For example, I
>> would expect hamburgers and fries and hot dogs and fries to be
>> complements, while hot dogs and hamburgers to be substitutes. I would
>> like to group items into clusters to make some time-series graphs, but
>> plotting all 125 items on the same graph is messy.
>>
>> My first attempt at this involved calculating pairwise correlations
>> between items, and grabbing those where the correlation is above some
>> threshold X in absolute value. This works reasonably well, but I don't
>> want to do this by hand for all the items and my loop-over-items
>> approach is slow and inefficient.
>>
>> Is there a command that can accomplish this for me? Or is there a
>> better way of doing this using some sort of clustering algorithm?
>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index