Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)
Related
In my example, I want to use the following code:
# Classifiction dataset
library(dplyr)
nest <- c(1,3,4,7,12,13,21,25,26,28)
finder_max <- c(9,50,25,50,25,50,9,9,9,3)
max_TA <- c(7.4,29.4,17.0,33.1,16.2,34.4,4.3,3.52,7.47,1.4)
ds.class <- data.frame(nest,finder_max,max_TA)
ds.class$ClassType <- ifelse(ds.class$finder_max==3,"Class_1_3",
ifelse(ds.class$finder_max==9,"Class_3_9",
ifelse(ds.class$finder_max==25,"Class_9_25",
ifelse(ds.class$finder_max==50,"Class_25_50","Class_50_51"))))
ds.class
# nest finder_max max_TA ClassType
# 1 1 9 7.40 Class_3_9
# 2 3 50 29.40 Class_25_50
# 3 4 25 17.00 Class_9_25
# 4 7 50 33.10 Class_25_50
# 5 12 25 16.20 Class_9_25
# 6 13 50 34.40 Class_25_50
# 7 21 9 4.30 Class_3_9
# 8 25 9 3.52 Class
# 9 26 9 7.47 Class_3_9
# 10 28 3 1.40 Class_1_3
# Custom ordination vector
custom.vec <- c("Class_0_1","Class_1_3","Class_3_9",
"Class_9_25","Class_25_50","Class_50")
# Original dataset
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/test_ants.csv")
my.ds$ClassType <- cut(my.ds$AT,breaks=c(-Inf,1,2.9,8.9,24.9,49.9,Inf),
right=FALSE,labels=c("Class_0_1","Class_1_3","Class_3_9",
"Class_9_25","Class_25_50","Class_50"))
str(my.ds)
# 'data.frame': 55 obs. of 4 variables:
# $ days : int 0 47 76 0 47 76 118 160 193 227 ...
# $ nest : int 2 2 2 3 3 3 3 3 3 3 ...
# $ AT : num 10.92 22.86 23.24 0.14 0.48 ...
# $ ClassType: Factor w/ 6 levels "Class_0_1","Class_1_3",..: 4 4 4 1 1 1 1 1 1 1 ...
I'd like to remove the rows in the my.ds with equal ClassType find in ds.class by nest. I need to remove too, the classes
higher in my custom ordination than ClassType (custom.vec). Example: If I have ClassType Class_25_50 in nest 3 in ds.class, I need to remove the data with this ClassType and higher classes ("Class_50"), if exist, for nest 3 in the file my.ds
My new output must to be for new.my.ds:
new.my.ds
# days nest AT ClassType
# 1 0 2 10.9200 Class_9_25
# 2 47 2 22.8600 Class_9_25
# 3 76 2 23.2400 Class_9_25
# 4 0 3 0.1400 Class_0_1
# 5 47 3 0.4800 Class_0_1
# 6 76 3 0.8300 Class_0_1
# 7 118 3 0.8300 Class_0_1
# 8 160 3 0.9400 Class_0_1
# 9 193 3 0.9400 Class_0_1
# 10 227 3 0.9400 Class_0_1
# 11 262 3 0.9400 Class_0_1
# 12 306 3 0.9400 Class_0_1
# 13 355 3 11.9300 Class_9_25
# 14 396 3 12.8100 Class_9_25
# 16 0 4 1.0000 Class_1_3
# 17 76 4 1.5600 Class_1_3
# 18 160 4 2.8800 Class_1_3
# 19 193 4 2.8800 Class_1_3
# 20 227 4 2.8800 Class_1_3
# 21 262 4 2.8800 Class_1_3
# 22 306 4 2.8800 Class_1_3
# 24 0 7 11.7100 Class_9_25
# 25 47 7 24.7900 Class_9_25
#...
# 55 349 1067 0.9600 Class_0_1
Please, any help with it?
I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.
cI have the following dataframe:
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970
I want to produce a dataframe like this:
teamID rank_X3M rank_TR rank_AS rank_ST rank_BK rank_PTS rank_FGP rank_FTP
1 5
2 6
3 9
4 10
5 3
6 1
7 4
8 2
9 8
10 7
I tried apply(-df[,c(2:9)], 1, rank, ties.method='min') and got this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X3M 4 4 5 4 4 6 4 4 5 6
TR 2 2 2 2 2 2 2 2 2 2
AS 3 3 3 3 3 3 3 3 3 3
ST 5 5 4 5 5 4 5 5 4 5
BK 6 6 6 6 6 5 6 6 6 4
PTS 1 1 1 1 1 1 1 1 1 1
FGP 8 8 8 8 8 8 8 8 8 8
FTP 7 7 7 7 7 7 7 7 7 7
Any suggestions about what to try next? Thanks!
Try sapply like below, you can change names of the variables later
cl <- read.table(text="
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970", header=T)
new <- cbind(cl$teamID, sapply(cl[,c(2:9)], rank))
new
X3M TR AS ST BK PTS FGP FTP
[1,] 1 5 8 5 8 9 3 10 3
[2,] 2 6 6 1 1 3 5 7 6
[3,] 3 9 1 6 9 2 8 1 7
[4,] 4 10 5 7 10 7 10 3 10
[5,] 5 3 2 4 3 5 1 4 5
[6,] 6 1 10 3 2 10 6 9 1
[7,] 7 4 3 2 4 6 2 2 4
[8,] 8 2 7 9 7 4 4 6 2
[9,] 9 8 4 8 5 1 7 5 9
[10,] 10 7 9 10 6 8 9 8 8
I have a dataset like this
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12
What I would like to do is to set the Year and Month and get the correspondent row number, like
MONTH <- 12
YEAR <- 1850
ROWNUMBER = 1
Many thanks
A simple which call would be enough, e.g.:
df <- read.table(textConnection("
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12"), header=TRUE)
which(df$year == 1850 & df$month == 12)
# [1] 1
which(df$year == 1852 & df$month == 12)
# [1] 10
Sorry I found the answer
TIMEC <- which(df$year==YEAR & df$month==MONTH)