I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I would like to predict the next 5 orders and the quantity of the 3 products in each order.
I am a beginner using r and timeseries and I saw examples using arima but they're applied only to measure one thing and not multiple products like in my example.
Should I use arima?
What should I do exactly?
Sorry for my bad English. Thank you in advance.
dateordrer,product1,product2,product3
12/01/2012,2565,3254,635
25/01/2012,2270,3254,670
01/03/2012,2000,785,0
05/05/2012,300,3254,750
26/06/2012,3340,0,540
30/06/2012,0,3254,0
21/06/2012,3360,3356,830
01/07/2012,2470,3456,884
03/07/2012,3680,3554,944
05/07/2012,2817,3854,0
09/07/2012,4210,4254,32
09/08/2012,0,3254,1108
13/09/2012,4560,5210,952
25/09/2012,4452,4256,1143
31/09/2012,5090,5469,199
25/11/2012,5100,5569,0
10/12/2012,5440,5789,1323
11/12/2012,5528,5426,1350
Your question is very broad, so it can only be answered in a broad manner. Also, the question has more to do with forecasting theory than with R.
I will give you two pointers to get you started...
It seems you have some pre-processing to do, i.e.: what are your time intervals? what is your basic time unit? (week? month?). You should aggregate the data according to that time unit. For these kind of operations you can use the tidyr and lubridate packages. Here's an example of your data set after I arranged it a bit:
data.raw <- read_csv("data1.csv") %>%
mutate(date.re = as.POSIXct(dateordrer, format = "%d/%m/%Y"))
complete.dates <- range(data.raw$date.re)
dates.seq <- seq(complete.dates[1], complete.dates[2], by = "month")
series <- data.frame(sale.month = month(dates.seq), sale.year = year(dates.seq))
data.post <- data.raw %>%
mutate(sale.month = month(date.re), sale.year = year(date.re)) %>%
select(product1:product3, sale.month, sale.year) %>%
group_by(sale.month, sale.year) %>%
summarize_all(funs(sum(.))) %>%
right_join(series) %>%
replace_na(list(product1 = 0, product2 = 0, product3 = 0))
It would look like this:
sale.month sale.year product1 product2 product3
1 2012 4835 6508 1305
2 2012 0 0 0
3 2012 2000 785 0
4 2012 0 0 0
etc...
See that for months 2 and 4 you had no data (originally), therefore they appear as 0s.
Note that pre-processing is not to be taken lightly, I used months as the basic unit, but that might not be true or relevant to your goals. You might even revise this after you continue and try to see if different aggregation gives better results.
Only after preprocessing you can turn to forecasting. If the three product are independent, they can be predicted independently (e.g. use Arima / Holt-Winters / any other model * three times). However, The fact that you have three products which might be correlated to each other, directs us to hierarchical time series (package hts). The function hts() within this package is able to best-fit forecasting models when there is a statistical relationship between the various products. For example, when a certain product is purchased with another (complementing products) or when you are out-of-stock and that leads to a different product (alternative product).
Since this is far from being self-contained for such a broad topic, the next best move for you is to check out the following online book:
Forecasting: principles and practice
By Hyndman and Athanasopoulos. I read it when I started with time series. It's a very good book. Specifically, for multiple time series you should cover chapter:
9.4 Forecasting hierarchical or grouped time series
Make sure you also read chapter 7 at that book (before moving to 9.4).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
After creating a JSON data structure and iterating over that JSON structure I came up with following Data representation. Column 1 & and Column 2 are String data type Column 4 represents sum of type explained in column 3.
Example:
A | B | Level1 | 10
A | C | Level2 | 1
To get a verbal explanation on this I would take countries in Country A there are People who can speak country B's language at level 1 expertise and the total of them equals to 10.
I was thinking to represent this in 3 Axis, X = Country1, Y = Country2, and Z represent the levels. Is this sensible if so How can I accomplish this? I don't have prior experience with R 3D Graphics.
Here is how my actual data look like: They are in CSV file.
Here is data in a dataframe:
country1 country2 langLevel frequency
1 gv ca level1 2
2 gv bg level1 1
3 zea li level1 1
4 zea li level3 1
I hope I explained the problem clear enough with what I want to accomplish. I was thinking and seems like 3D is the best way to represent this but I could be wrong.
Data in CSV Format:
country1,country2,langLevel,frequency
gv,ca,level1,2
gv,bg,level1,1
zea,li,level1,1
zea,li,level3,1
zea,de,level1,26
zea,de,level3,5
zea,el,level1,1
zea,eo,level1,3
zea,en,level1,5
zea,en,level2,34
zea,en,level3,38
zea,en,level4,12
zea,es,level1,7
zea,la,level1,7
zea,zea,level1,5
zea,zea,level3,4
zea,stq,level1,1
zea,sk,level2,1
zea,nl,level4,4
zea,fr,level2,9
zea,fy,level2,1
cdo,cdo,level3,1
cdo,de,level1,23
cdo,de,level2,4
cdo,de,level3,4
cdo,eo,level1,1
cdo,eo,level2,1
cdo,eo,level3,3
cdo,en,level1,6
cdo,en,level2,31
cdo,en,level3,38
cdo,en,level4,17
cdo,es,level1,8
cdo,es,level2,6
cdo,es,level3,3
cdo,fr,level1,14
cdo,fr,level2,11
cdo,fr,level3,6
gd,als,level1,1
gd,af,level1,2
vls,de,level1,32
vls,de,level2,7
vls,de,level3,6
vls,de,level4,3
vls,eo,level1,2
vls,eo,level2,3
vls,eo,level3,3
vls,en,level1,7
vls,en,level2,38
vls,en,level3,53
vls,en,level4,16
vls,es,level1,15
vls,es,level2,4
vls,es,level3,1
vls,es,level4,1
vls,ru,level2,8
vls,ru,level3,1
vls,ja,level1,2
This is what I tried but, its really had to see anything clear in this plot:
library("rgl")
plot3d(template_levels$country1, template_levels$country2, template_levels$frequency, col=template_levels$langLevel)
Here is the plot:
One possibility is to use barplots with different colors. Here is a solution using package ggplot2 and assuming that your data frame is named df.
On the x axis we see country2 values then for each country1 we have separate facet. Each bar is colored according to langLevel. scales="free_y" in facet_grid() ensures that we have different y scale in each facet (because values are quite different).
library(ggplot2)
ggplot(df,aes(country2,frequency,fill=langLevel))+geom_bar(stat="identity")+
facet_grid(country1~.,scales="free_y")