I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I have a dataset where I need to count any values that contain the string "E9" and what I have so far seems to... sort of work. Here's an example dataset I'm working with:
ColA ColB ColC ColD ColE ColF ColG ColH ColI ColJ ColK ColL
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 407 345 381 0 E9 E9 E9 12 E9 E9 E9 E9
2 328 301 314 0 6 6 7 10 7 8 7 6
3 295 261 267 0 7 8 8 8 8 7 7 6
4 163 199 298 0 2 6 3 6 7 6 6 3
5 599 499 576 0 E9 E9 E9 17 E9 E9 E9 E9
6 566 436 545 0 12 16 16 16 17 15 15 11
7 200 168 170 0 5 5 5 5 5 5 5 5
8 617 507 435 0 13 18 17 18 18 17 16 12
9 624 0 629 18 13 0 0 18 18 17 16 14
10 177 163 161 0 4 5 4 5 5 5 4 3
Now, if I run this code:
df2$Exclusions <- str_count(df1, "E9")
I get the following error:
Warning message:
In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
argument is not an atomic vector; coercing
However, it does give me the end result I want, which looks like this:
Device Exclusions
<chr> <int>
1 ColA 0
2 ColB 0
3 ColC 0
4 ColD 16
5 ColE 19
6 ColF 19
7 ColG 19
8 ColH 0
9 ColI 19
10 ColJ 19
From what I understand, str_count() is just mad that I'm using a dataframe rather than a vector. For some reason it works fine anyway each time I use it like this, but when I try to put it inside of a loop, it stops dead in its tracks. How can I achieve the same result but with a function that is meant to work with dataframes rather than vectors?
I have a data frame with thousands of rows looking like this:
time Unique_ID Unix_Time Event Version
<dbl> <dbl> <dbl> <lgl> <dbl>
1 1404 4961657804 1565546745 FALSE 6
2 2534 4453645779 1550934792 FALSE 5
3 2114 3602935494 1512593418 TRUE 3
4 2605 5343699852 1586419012 TRUE 6
5 1246 5095942046 1572689498 FALSE 6
6 2519 3206995213 1495881898 TRUE 3
7 1419 4958551504 1565434177 TRUE 6
8 2262 5441937631 1590754817 TRUE 6
9 1650 3024892331 1488210079 TRUE 2
10 1880 3163703804 1494173662 FALSE 2
I manipulate the data frame using the following command:
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60))) %>%
summarise(Number= n(),
Won = sum(Event))
Now my data frame looks like this:
minute Number Won
<int> <int> <int>
1 55 264 128
2 71 34 17
3 31 1427 728
4 80 9 5
5 24 1197 673
6 141 1 1
7 53 326 163
8 30 1572 802
9 77 14 9
10 97 1 1
I would want something like this though:
minute Number Won Version
<int> <int> <int> <int>
1 55 264 128 1
2 55 34 17 2
3 55 1427 728 3
4 80 9 5 1
5 24 1197 673 1
6 141 1 1 2
7 53 326 163 3
8 53 1572 802 4
9 77 14 9 2
10 97 1 1 6
Is it possible to keep the rows with different Versions seperated while grouping time?
I think you can group by 2 columns: minute and Version
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60)), Version)
Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)
cI have the following dataframe:
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970
I want to produce a dataframe like this:
teamID rank_X3M rank_TR rank_AS rank_ST rank_BK rank_PTS rank_FGP rank_FTP
1 5
2 6
3 9
4 10
5 3
6 1
7 4
8 2
9 8
10 7
I tried apply(-df[,c(2:9)], 1, rank, ties.method='min') and got this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X3M 4 4 5 4 4 6 4 4 5 6
TR 2 2 2 2 2 2 2 2 2 2
AS 3 3 3 3 3 3 3 3 3 3
ST 5 5 4 5 5 4 5 5 4 5
BK 6 6 6 6 6 5 6 6 6 4
PTS 1 1 1 1 1 1 1 1 1 1
FGP 8 8 8 8 8 8 8 8 8 8
FTP 7 7 7 7 7 7 7 7 7 7
Any suggestions about what to try next? Thanks!
Try sapply like below, you can change names of the variables later
cl <- read.table(text="
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970", header=T)
new <- cbind(cl$teamID, sapply(cl[,c(2:9)], rank))
new
X3M TR AS ST BK PTS FGP FTP
[1,] 1 5 8 5 8 9 3 10 3
[2,] 2 6 6 1 1 3 5 7 6
[3,] 3 9 1 6 9 2 8 1 7
[4,] 4 10 5 7 10 7 10 3 10
[5,] 5 3 2 4 3 5 1 4 5
[6,] 6 1 10 3 2 10 6 9 1
[7,] 7 4 3 2 4 6 2 2 4
[8,] 8 2 7 9 7 4 4 6 2
[9,] 9 8 4 8 5 1 7 5 9
[10,] 10 7 9 10 6 8 9 8 8