Ranking multiple columns in R - r

cI have the following dataframe:
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970
I want to produce a dataframe like this:
teamID rank_X3M rank_TR rank_AS rank_ST rank_BK rank_PTS rank_FGP rank_FTP
1 5
2 6
3 9
4 10
5 3
6 1
7 4
8 2
9 8
10 7
I tried apply(-df[,c(2:9)], 1, rank, ties.method='min') and got this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X3M 4 4 5 4 4 6 4 4 5 6
TR 2 2 2 2 2 2 2 2 2 2
AS 3 3 3 3 3 3 3 3 3 3
ST 5 5 4 5 5 4 5 5 4 5
BK 6 6 6 6 6 5 6 6 6 4
PTS 1 1 1 1 1 1 1 1 1 1
FGP 8 8 8 8 8 8 8 8 8 8
FTP 7 7 7 7 7 7 7 7 7 7
Any suggestions about what to try next? Thanks!

Try sapply like below, you can change names of the variables later
cl <- read.table(text="
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970", header=T)
new <- cbind(cl$teamID, sapply(cl[,c(2:9)], rank))
new
X3M TR AS ST BK PTS FGP FTP
[1,] 1 5 8 5 8 9 3 10 3
[2,] 2 6 6 1 1 3 5 7 6
[3,] 3 9 1 6 9 2 8 1 7
[4,] 4 10 5 7 10 7 10 3 10
[5,] 5 3 2 4 3 5 1 4 5
[6,] 6 1 10 3 2 10 6 9 1
[7,] 7 4 3 2 4 6 2 2 4
[8,] 8 2 7 9 7 4 4 6 2
[9,] 9 8 4 8 5 1 7 5 9
[10,] 10 7 9 10 6 8 9 8 8

Related

How to export tibble to .csv

I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)

how to calculate Riemann Sums in R?

Can any one help how to find approximate area under the curve using Riemann Sums in R?
It seems we do not have any package in R which could help.
Sample data:
MNo1 X1 Y1 MNo2 X2 Y2
1 2981 -66287 1 595 -47797
1 2981 -66287 1 595 -47797
2 2973 -66087 2 541 -47597
2 2973 -66087 2 541 -47597
3 2963 -65887 3 485 -47397
3 2963 -65887 3 485 -47397
4 2952 -65687 4 430 -47197
4 2952 -65687 4 430 -47197
5 2942 -65486 5 375 -46998
5 2942 -65486 5 375 -46998
6 2935 -65286 6 322 -46798
6 2935 -65286 6 322 -46798
7 2932 -65086 7 270 -46598
7 2932 -65086 7 270 -46598
8 2936 -64886 8 222 -46398
8 2936 -64886 8 222 -46398
9 2948 -64685 9 176 -46198
9 2948 -64685 9 176 -46198
10 2968 -64485 10 135 -45999
10 2968 -64485 10 135 -45999
11 2998 -64284 11 97 -45799
11 2998 -64284 11 97 -45799
12 3035 -64084 12 65 -45599
12 3035 -64084 12 65 -45599
13 3077 -63883 13 37 -45399
13 3077 -63883 13 37 -45399
14 3122 -63683 14 14 -45199
14 3122 -63683 14 14 -45199
15 3168 -63482 15 -5 -44999
15 3168 -63482 15 -5 -44999
16 3212 -63282 16 -20 -44799
16 3212 -63282 16 -20 -44799
17 3250 -63081 17 -31 -44599
17 3250 -63081 17 -31 -44599
18 3280 -62881 18 -38 -44399
18 3280 -62881 18 -38 -44399
19 3301 -62680 19 -43 -44199
19 3301 -62680 19 -43 -44199
20 3313 -62480 20 -45 -43999
Check this demo :
> library(zoo)
> x <- 1:10
> y <- -x^2
> Result <- sum(diff(x[x]) * rollmean(y[x], 2))
> Result
[1] -334.5
After check this question, I found function trapz() from package pracma be more efficient:
> library(pracma)
> Result.2 <- trapz(x, y)
> Result.2
[1] -334.5

Shift up rows in R

This is a simple example of how my data looks like.
Suppose I got the following data
>x
Year a b c
1962 1 2 3
1963 4 5 6
. . . .
. . . .
2001 7 8 9
I need to form a time series of x with 7 column contains the following variables:
Year a lag(a) b lag(b) c lag(c)
What I did is the following:
> x<-ts(x) # converting x to a time series
> x<-cbind(x,x[,-1]) # adding the same variables to the time series without repeating the year column
> x
Year a b c a b c
1962 1 2 3 1 2 3
1963 4 5 6 4 5 6
. . . . . . .
. . . . . . .
2001 7 8 9 7 8 9
I need to shift the last three column up so they give the lags of a,b,c. then I will rearrange them.
Here's an approach using dplyr
df <- data.frame(
a=1:10,
b=21:30,
c=31:40)
library(dplyr)
df %>% mutate_each(funs(lead(.,1))) %>% cbind(df, .)
# a b c a b c
#1 1 21 31 2 22 32
#2 2 22 32 3 23 33
#3 3 23 33 4 24 34
#4 4 24 34 5 25 35
#5 5 25 35 6 26 36
#6 6 26 36 7 27 37
#7 7 27 37 8 28 38
#8 8 28 38 9 29 39
#9 9 29 39 10 30 40
#10 10 30 40 NA NA NA
You can change the names afterwards using colnames(df) <- c("a", "b", ...)
As #nrussel noted in his answer, what you described is a leading variable. If you want a lagging variable, you can change the lead in my answer to lag.
X <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts <- ts(X)
Xts[1:(nrow(Xts)-1),c(4,5,6)] <- Xts[2:nrow(Xts),c(4,5,6)]
Xts[nrow(Xts),c(4,5,6)] <- c(NA,NA,NA)
> head(Xts)
a b c laga lagb lagc
[1,] 1 2 3 2 4 6
[2,] 2 4 6 3 6 9
[3,] 3 6 9 4 8 12
[4,] 4 8 12 5 10 15
[5,] 5 10 15 6 12 18
[6,] 6 12 18 7 14 21
##
> tail(Xts)
a b c laga lagb lagc
[95,] 95 190 285 96 192 288
[96,] 96 192 288 97 194 291
[97,] 97 194 291 98 196 294
[98,] 98 196 294 99 198 297
[99,] 99 198 297 100 200 300
[100,] 100 200 300 NA NA NA
I'm not sure if by shift up you literally mean shift the rows up 1 place like above (because that would mean you are using lagging values not leading values), but here's the other direction ("true" lagged values):
X2 <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts2 <- ts(X2)
Xts2[2:nrow(Xts2),c(4,5,6)] <- Xts2[1:(nrow(Xts2)-1),c(4,5,6)]
Xts2[1,c(4,5,6)] <- c(NA,NA,NA)
##
> head(Xts2)
a b c laga lagb lagc
[1,] 1 2 3 NA NA NA
[2,] 2 4 6 1 2 3
[3,] 3 6 9 2 4 6
[4,] 4 8 12 3 6 9
[5,] 5 10 15 4 8 12
[6,] 6 12 18 5 10 15
##
> tail(Xts2)
a b c laga lagb lagc
[95,] 95 190 285 94 188 282
[96,] 96 192 288 95 190 285
[97,] 97 194 291 96 192 288
[98,] 98 196 294 97 194 291
[99,] 99 198 297 98 196 294
[100,] 100 200 300 99 198 297

Get row number data frame R

I have a dataset like this
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12
What I would like to do is to set the Year and Month and get the correspondent row number, like
MONTH <- 12
YEAR <- 1850
ROWNUMBER = 1
Many thanks
A simple which call would be enough, e.g.:
df <- read.table(textConnection("
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12"), header=TRUE)
which(df$year == 1850 & df$month == 12)
# [1] 1
which(df$year == 1852 & df$month == 12)
# [1] 10
Sorry I found the answer
TIMEC <- which(df$year==YEAR & df$month==MONTH)

Resources