I have a dataset like this
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12
What I would like to do is to set the Year and Month and get the correspondent row number, like
MONTH <- 12
YEAR <- 1850
ROWNUMBER = 1
Many thanks
A simple which call would be enough, e.g.:
df <- read.table(textConnection("
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12"), header=TRUE)
which(df$year == 1850 & df$month == 12)
# [1] 1
which(df$year == 1852 & df$month == 12)
# [1] 10
Sorry I found the answer
TIMEC <- which(df$year==YEAR & df$month==MONTH)
Related
I have this data that I want to plot as a time series.
Date Units.Sold
1 Jan-16 588
2 Feb-16 448
3 Mar-16 490
4 Apr-16 512
5 May-16 528
6 Jun-16 432
7 Jul-16 470
8 Aug-16 446
9 Sep-16 465
10 Oct-16 388
11 Nov-16 429
12 Dec-16 414
However, when I use ts(datasetName), I get this:
Time Series:
Start = 1
End = 12
Frequency = 1
Date Units.Sold
1 5 588
2 4 448
3 8 490
4 1 512
5 9 528
6 7 432
7 6 470
8 2 446
9 12 465
10 11 388
11 10 429
12 3 414
As you can see, the dates are in the wrong order. I want January to correspond with 1, February with 2, and so on. Can anybody help?
You need to convert your column named 'Date' to a Date - class object first. You can use as.Date for that, but you'll need to add a year first.
your_year <- 2018
df$Date <- as.Date(paste0(df$Date, '-', your_year), format = '%b-%d-%Y')
I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
Starting from this SO question.
Example data.frame:
df = read.table(text = 'ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 5
12 118 2 5
90 117 1 3
22 115 2 2
99 115 2 2', header = TRUE)
The Count col shows the total number of ID values per each Day and the Count_group col shows the sum of the ID values per each Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. 1933 = Count_group 15 because Count 6 (1933) + Count 5 (1932) + Count 3 (1931) + Count 1 (1930) + Count 0 (1929).
What I need to do is to create duplicated observations per each Count_group and add them to it in order to show per each Count_group its Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. Count_group = 15 is composed by the Count values of Day 1933, 1932, 1931, 1930 (and 1929 not present in the df). So the five days needs to be included in the Count_group = 15. The next one will be Count_group = 9, composed by 1932, 1931, 1930, 1929 and 1928; etc...
Desired output:
ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 15
34 1932 5 15
60 1932 5 15
88 1932 5 15
18 1932 5 15
33 1931 3 15
13 1931 3 15
56 1931 3 15
23 1930 1 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 9
13 1931 3 9
56 1931 3 9
23 1930 1 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 12
6 1799 4 12
52 1799 4 12
133 1799 4 12
112 1798 2 12
677 1798 2 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 6
677 1798 2 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 8
26 887 2 8
8 886 1 8
56 885 1 8
37 887 2 4
26 887 2 4
8 886 1 4
56 885 1 4
8 886 1 2
56 885 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
22 115 2 6
99 115 2 6
99 118 2 5
12 118 2 5
90 117 1 5
22 115 2 5
99 115 2 5
90 117 1 3
22 115 2 3
99 115 2 3
22 115 2 2
99 115 2 2
(note that different group of 5 days each one have been separated by a blank line in order to make them clearer)
I have got different data.frames which are grouped by n days and therefore I would like to adapt the code (by changing it a little) specifically for each of them.
Thanks
A generalised version of my previous answer...
#first add grouping variables
days <- 5 #grouping no of days
df$smalldaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]!=df$Day[i-1]))) #individual days
df$bigdaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]<df$Day[i-1]-days+1))) #blocks of linked days
#duplicate days in each big group
df2 <- lapply(split(df,df$bigdaygroup),function(x) {
n <- max(x$Day)-min(x$Day)+1 #number of consecutive days in big group
dayvec <- (max(x$Day):min(x$Day)) #possible days in range
daylog <- dayvec[dayvec %in% x$Day] #actual days in range
pattern <- data.frame(base=rep(dayvec,each=days))
pattern$rep <- sapply(1:nrow(pattern),function(i) pattern$base[i]+1-sum(pattern$base[1:i]==pattern$base[i])) #indices to repeat
pattern$offset <- match(pattern$rep,daylog)-match(pattern$base,daylog) #offsets (used later)
pattern <- pattern[(pattern$base %in% x$Day) & (pattern$rep %in% x$Day),] #remove invalid elements
#store pattern in list as offsets needed in next loop
return(list(df=split(x,x$smalldaygroup)[match(pattern$rep,daylog)],pat=pattern))
})
#change the Count_group to previous value in added entries
df2 <- lapply(df2,function(L) lapply(1:length(L$df),function(i) {
x <- L$df[[i]]
offset <- L$pat$offset #pointer to day to copy Count_group from
x$Count_group <- L$df[[i-offset[i]]]$Count_group[1]
return(x)
}))
df2 <- do.call(rbind,unlist(df2,recursive=FALSE)) #bind back together
df2[,5:6] <- NULL #remove grouping variables
head(df2,30) #ignore rownames!
ID Day Count Count_group
01.1 18 1933 6 15
01.2 33 1933 6 15
01.3 37 1933 6 15
01.4 18 1933 6 15
01.5 16 1933 6 15
01.6 11 1933 6 15
02.7 111 1932 5 15
02.8 34 1932 5 15
02.9 60 1932 5 15
02.10 88 1932 5 15
02.11 18 1932 5 15
03.12 33 1931 3 15
03.13 13 1931 3 15
03.14 56 1931 3 15
04 23 1930 1 15
05.7 111 1932 5 9
05.8 34 1932 5 9
05.9 60 1932 5 9
05.10 88 1932 5 9
05.11 18 1932 5 9
06.12 33 1931 3 9
06.13 13 1931 3 9
06.14 56 1931 3 9
07 23 1930 1 9
08.12 33 1931 3 4
08.13 13 1931 3 4
08.14 56 1931 3 4
09 23 1930 1 4
010 23 1930 1 1
11.16 6 1800 6 12
I attach a rather mechanical method, but I believe it is a good starting point.
I have noticed that in your original table the entry
ID Day Count Count_group
18 1933 6 14
is duplicated; I have left it untouched for sake of clarity.
Structure of the approach:
Read original data
Generate list of data frames, for each Day
Generate final data frame, collapsing the list in 2.
1. Read original data
We start with
df = read.table(text = 'ID Day Count Count_group
18 1933 6 14
33 1933 6 14
37 1933 6 14
18 1933 6 14
16 1933 6 14
11 1933 6 14
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 7
111 888 4 7
88 888 4 7
10 888 4 7
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1', header = TRUE)
# ordered vector of unique values for "Day"
ord_day <- unique(df$Day[order(df$Day)])
ord_day
[1] 885 886 887 888 1798 1799 1800 1930 1931 1932 1933
2. Generate list of data frames, for each Day
For each element in ord_day we introduce a data.frame as element of a list called df_new_aug.
Such data frames are defined through a for loop for all values in ord_day except ord_day[2] and ord_day[1] which are treated separately.
Idea behind the looping: for each unique ord_day[i] with i > 2 we check which days between ord_day[i-1] and ord_day[i-2] (or both!) contribute (through the variable "Count") to the value "Count_Group" at ord_day[i].
We therefore introduce if else statements in the loop.
Here we go
# Recursive generation of the list of data.frames (for days > 886)
#-----------------------------------------------------------------
df_new <- list()
df_new_aug <- list()
# we exclude cases i=1, 2: they are manually treated below
for ( i in 3: length(ord_day) ) {
# is "Count_Group" for ord_day[i] equal to the sum of "Count" at ord_day[i-1] and ord_day[i-2]?
if ( unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) + unique(df[df$Day == ord_day[i-2], "Count"])
) {
# we create columns ID | Day | Count
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1] | df$Day == ord_day[i-2],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame( df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else if (unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) ) #only "Count" at i and i-1 contribute to "Count_group" at i
{
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else #only "Count" at i contributes to "Count_group" at i
df_new[[i]] <- data.frame(df[df$Day == ord_day[i],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
#closing the for loop
}
# for ord_day[2] = "886" (both "Count" at i =2 and i = 1 contribute to "Count_group" at i=2)
#-------------------------------------------------------------------------------------
df_new[[2]] <- data.frame(df[df$Day == ord_day[2] | df$Day == ord_day[1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[2]] <- data.frame(df_new[[2]],
Count_group = rep(unique(df[df$Day == ord_day[2], "Count_group"]), nrow(df_new[[2]]) ) )
# for ord_day[1] = "885" (only "count" at i = 1 contributes to "Count_group" at i =1)
#------------------------------------------------------------------------------------
df_new[[1]] <- data.frame(df[df$Day == ord_day[1], c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[1]] <- data.frame(df_new[[1]], Count_group = rep(unique(df[df$Day == ord_day[1], "Count_group"]), nrow(df_new[[1]]) ) )
# produced list
df_new_aug
3. Generate final data frame, collapsing the list in 2.
We collapse df_new_aug through an ugly loop, but other solutions (for example with Reduce() and merge() are possible):
# merging the list (mechanically): final result
df_result <- df_new_aug[[1]]
for (i in 1:10){
df_result <- rbind(df_result, df_new_aug[[i+1]])
}
One arrives at df_result and the analysis is stopped.
Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)
cI have the following dataframe:
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970
I want to produce a dataframe like this:
teamID rank_X3M rank_TR rank_AS rank_ST rank_BK rank_PTS rank_FGP rank_FTP
1 5
2 6
3 9
4 10
5 3
6 1
7 4
8 2
9 8
10 7
I tried apply(-df[,c(2:9)], 1, rank, ties.method='min') and got this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X3M 4 4 5 4 4 6 4 4 5 6
TR 2 2 2 2 2 2 2 2 2 2
AS 3 3 3 3 3 3 3 3 3 3
ST 5 5 4 5 5 4 5 5 4 5
BK 6 6 6 6 6 5 6 6 6 4
PTS 1 1 1 1 1 1 1 1 1 1
FGP 8 8 8 8 8 8 8 8 8 8
FTP 7 7 7 7 7 7 7 7 7 7
Any suggestions about what to try next? Thanks!
Try sapply like below, you can change names of the variables later
cl <- read.table(text="
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970", header=T)
new <- cbind(cl$teamID, sapply(cl[,c(2:9)], rank))
new
X3M TR AS ST BK PTS FGP FTP
[1,] 1 5 8 5 8 9 3 10 3
[2,] 2 6 6 1 1 3 5 7 6
[3,] 3 9 1 6 9 2 8 1 7
[4,] 4 10 5 7 10 7 10 3 10
[5,] 5 3 2 4 3 5 1 4 5
[6,] 6 1 10 3 2 10 6 9 1
[7,] 7 4 3 2 4 6 2 2 4
[8,] 8 2 7 9 7 4 4 6 2
[9,] 9 8 4 8 5 1 7 5 9
[10,] 10 7 9 10 6 8 9 8 8