I have the following dataframe, which is the result of a cluster analysis with ten 7-likert attitude scales for specific product benefits (see 'variable' column). At this, n is the number of persons stating a specific value for each Benefit and sum is the total sum of persons for each cluster. n2 is just the relative share of answers to all answers per cluster (n2=n/cum*100, which is basically %).
Now, I want to create a new column, aggregating / summing up the top-n (indicated in 'value' column) percent (indicated in n2) for each benefit, e.g. a new column "Top-3-Box" with e.g. a value of 46.5 for rows 1-7/Benefit.1 (which is the sum of the n2 of the rows with the top-3 value 7,6,5). It would be great if there would be a solution for this, which is instantly applicable in dplyr.
Please see the dataframe below:
cluster variable value n cum n2
<int> <chr> <dbl> <int> <int> <dbl>
1 1 Benefit.1 1 11 86 12.8
2 1 Benefit.1 2 11 86 12.8
3 1 Benefit.1 3 6 86 7
4 1 Benefit.1 4 18 86 20.9
5 1 Benefit.1 5 16 86 18.6
6 1 Benefit.1 6 14 86 16.3
7 1 Benefit.1 7 10 86 11.6
8 1 Benefit.10 1 10 86 11.6
9 1 Benefit.10 2 13 86 15.1
10 1 Benefit.10 3 8 86 9.3
# ... with 40 more rows
I highly appreciate your support!
We can do a group by sum of 'n2' by subsetting the values corresponding to the first 3 'value'
library(dplyr)
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[value %in% 1:3]))
If the 'value' is already ordered per 'cluster', 'variable', then we can just subset the 'n2'
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[1:3]))
Related
I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
Use the data below to make the cumsum_a column look like the should column.
Data to start with:
> demo
th seq group
1 20.1 1 10
2 24.1 2 10
3 26.1 3 10
4 1.1 1 20
5 2.1 2 20
6 4.1 3 20
The "should" column below is the goal.
demo<-data.frame(th=c(c(20.1,24.1,26.1),(c(1.1,2.1,4.1))),
seq=(c(1:3,1:3)),group=c(rep(10,3),rep(20,3)))
library(magrittr)
library(dplyr)
demo %>%
group_by(group) %>%
mutate(
cumsum_a= cumsum((group)^seq*
(((th)/cummax(th)))))%>%
ungroup()%>%
mutate(.,
cumsum_m=c( #As an example only, this manually does exactly what cumsum_a is doing (which is wrong)
10^1*20.1/20.1, #good
10^1*20.1/20.1 + 10^2*24.1/24.1, #different denominators, bad
10^1*20.1/20.1 + 10^2*24.1/24.1 + 10^3*26.1/26.1, #different denominators, bad
20^1*1.1/1.1, #good
20^1*1.1/1.1 + 20^2*2.1/2.1, #different denominators, bad
20^1*1.1/1.1 + 20^2*2.1/2.1 + 20^3*4.1/4.1 #different denominators, bad
),
should=c( #this is exactly the kind of calculation I want
10^1*20.1/20.1, #good
10^1*20.1/24.1 + 10^2*24.1/24.1, #good
10^1*20.1/26.1 + 10^2*24.1/26.1 + 10^3*26.1/26.1, #good
20^1*1.1/1.1, #good
20^1*1.1/2.1 + 20^2*2.1/2.1, #good
20^1*1.1/4.1 + 20^2*2.1/4.1 + 20^3*4.1/4.1 #good
)
)
Most simply put, denominators need to be the same for each row so 24.1 and 24.1 instead of 20.1 and 24.1 on the second row of cumsum_m or the underlying calculations for cumsum_a.
Here are the new columns, where should is what cumsum_a or cumsum_m should be.
th seq group cumsum_a cumsum_m should
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 20.1 1 10 10 10 10
2 24.1 2 10 110 110 108.
3 26.1 3 10 1110 1110 1100.
4 1.1 1 20 20 20 20
5 2.1 2 20 420 420 410.
6 4.1 3 20 8420 8420 8210.
You can use the following solution:
purrr::accumulate takes a two argument function, the first one which is represented by .x or ..1 is the accumulated value of the previous iterations and .y represents the current value of our vector (2:n()). So our first accumulated value will be first element of group value as I supplied it as .init argument
Since you would like to change the denominator of the previous iterations/ calculations, I multiplied the result .x by the ratio of the previous value of cmax to the current value of cmax
I think the rest is pretty clear but if you have any more question about it just let me know.
library(dplyr)
library(purrr)
demo %>%
group_by(group) %>%
mutate(cmax = cummax(th),
should = accumulate(2:n(), .init = group[1],
~ (.x * cmax[.y - 1] / cmax[.y]) + (group[.y] ^ seq[.y]) * (th[.y] / cmax[.y])))
# A tibble: 6 x 5
# Groups: group [2]
th seq group cmax should
<dbl> <int> <dbl> <dbl> <dbl>
1 20.1 1 10 20.1 10
2 24.1 2 10 24.1 108.
3 26.1 3 10 26.1 1100.
4 1.1 1 20 1.1 20
5 2.1 2 20 2.1 410.
6 4.1 3 20 4.1 8210.
I want to develop a rule to extract certain rows from a matrix. I set up the example as follows:
mat1 = data.frame(matrix(nrow=508, ncol =5))
mat1[1:20,1] = rep(1,20)
mat1[1:20,2:5] = rnorm(20*4,0,1)
mat2 = data.frame(matrix(nrow=508, ncol =5))
seq1 <- seq(1,3,1)
mat2[1:27,1] = rep(seq1,9)
mat2[1:27,2:5] = rnorm(27*4,0,1)
mat3 = data.frame(matrix(nrow=508, ncol =5))
mat3[1:32,1] = rep(seq(1,4,1),8)
mat3[1:32,2:5] = rnorm(32*4,0,1)
colnames(mat1) = colnames(mat2) = colnames(mat3) = c("Cohort Number", "Alpha(t-1)", "date1", "date2", "date3")
mat.list <- list(mat1,mat2,mat3)
Example matrix
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -1.76745451 -1.3227308 2.7099501 -0.13797329
2 1 -0.72651808 -0.8714317 1.3200554 0.76964663
3 1 -0.50325892 0.0742336 -0.6460628 0.30148135
4 1 0.79592650 0.1353875 -0.5694022 -0.59019913
5 1 1.94064961 0.2255595 0.3156252 -0.90996475
6 1 0.27134932 0.3966957 -1.9198976 0.23998928
7 1 -1.13272507 -0.8603225 -1.2042036 0.06609958
8 1 -2.12392748 1.0905405 -0.3788234 0.92850110
9 1 0.22038996 0.4500683 -1.4617004 0.58498275
10 1 0.26348734 -0.8340913 1.2631368 -1.48490518
11 1 0.26931077 -0.5230622 -0.6615288 1.45668453
12 1 -2.03067695 -0.6432484 0.4801026 0.01808834
13 1 1.25915656 -0.1116544 -0.3004298 -1.04072722
14 1 -2.27894271 -2.1058424 -0.3351053 -1.04132045
15 1 0.47742052 2.1564274 -0.4733351 -0.53152019
16 1 -1.57680089 -0.1340645 -0.3134633 0.53223567
17 1 0.25245813 -0.8243152 0.5998211 -1.01892301
18 1 0.18391447 -1.3500645 1.6059798 1.43359399
19 1 -0.09602031 1.4921338 -0.6455687 0.66385823
20 1 -0.13613759 2.2474816 0.7311762 -2.46849071
mat2[1:27,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.76033920 1.317636591 -0.09684526 -0.08796725
2 2 0.05123185 -0.731591674 -0.37247406 0.04470346
3 3 -0.78460201 0.890336570 1.26737475 -0.39062992
4 1 -0.14111920 1.255008475 -0.32799815 -0.77277716
5 2 -0.46044451 1.175157970 0.82187906 0.54326905
6 3 -0.46804365 0.704203273 -2.04539007 -1.74782065
7 1 0.42009824 0.488807461 3.21093186 -0.13745029
8 2 1.27083389 -1.316989452 0.43565921 0.07870330
9 3 -0.16581119 1.872955624 -0.22399155 -0.79334562
10 1 -1.33436656 0.589311311 -1.03871415 -1.06221057
11 2 1.56584985 0.020699064 0.45691456 0.15858065
12 3 1.07756426 -0.045200151 0.05124461 -1.86633279
13 1 -1.01264994 -0.229406681 1.24954420 0.88846407
14 2 -0.09950713 -0.515798138 1.62560454 -0.20191909
15 3 -0.28319479 0.450854419 1.42963386 -1.11964154
16 1 0.51771608 -1.407248379 0.62626313 0.97775246
17 2 -0.43951262 -0.368739441 0.66564013 -0.79980882
18 3 -0.15865277 -0.231475146 0.37582330 0.93685867
19 1 -0.57758129 0.235550070 0.42480442 -0.14379249
20 2 -0.81726414 -1.207593079 -0.30000514 0.68967230
21 3 -0.72926703 -0.458849409 1.51162785 1.40921409
22 1 -0.32220454 0.334996561 1.26073381 -2.03405958
23 2 -0.51450039 -0.305634241 1.51021957 0.39775430
24 3 1.15476297 -1.040126709 -0.36192432 -0.37346894
25 1 -0.88053587 -0.006829769 -0.89855797 -0.39840858
26 2 -0.64435448 0.209561006 -0.13986834 -0.61308957
27 3 1.22492942 0.812693992 -1.32371617 -1.21852365
and
> mat3[1:32,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.7657871 -0.35390862 -0.23539987 -1.8365309
2 2 -0.6631690 1.36450837 0.78403072 -0.8344993
3 3 -1.0134022 -0.28380021 0.72149463 -0.7890273
4 4 2.6419455 0.26998803 2.03606725 0.8099134
5 1 -0.1383910 0.90845134 1.09273919 0.4651443
6 2 -0.7549340 -0.23185551 2.21119705 -0.1386960
7 3 0.7296121 -1.09145187 -1.18092505 0.1510642
8 4 -0.5583415 0.71988405 0.09454476 -0.8661514
9 1 -0.2420894 -0.03215026 -2.51249946 1.1659027
10 2 -0.6434337 -0.13910557 -1.10373674 1.2377968
11 3 -0.6297123 2.09797419 0.87128407 -0.1351845
12 4 0.6674166 0.48707847 0.36373509 1.0680623
13 1 0.6254708 -0.61311671 0.82542494 1.7320687
14 2 -2.4704173 0.98460064 -1.10416042 2.9627952
15 3 -0.2544887 0.63177246 -0.39138717 1.6942072
16 4 -0.9807623 1.11882794 -0.47669974 1.2383798
17 1 -0.6900549 1.68086482 -0.01405476 -1.3099288
18 2 1.4510505 -0.04752782 1.49735258 0.2963673
19 3 -1.1355194 -1.76263532 -1.49318214 1.3524114
20 4 0.7168833 -0.76833639 0.60752304 -1.0647885
21 1 2.0004745 2.13931057 -1.35036048 -0.7694501
22 2 2.0985591 0.01569677 0.33975952 -1.4979973
23 3 0.1703261 -1.47625208 -1.13228671 0.5686501
24 4 0.2632233 -0.55672667 0.33428217 0.5341078
25 1 -0.2741324 -1.61301237 0.78861248 0.4982554
26 2 -0.8793897 -1.07266362 -0.78158128 0.9127354
27 3 0.3920579 -0.59869834 -0.76775259 1.8137107
28 4 -1.4088488 -0.54954542 0.32421016 0.7284813
29 1 -1.2421837 0.50599077 1.62464999 0.6801672
30 2 -2.8980422 0.42197236 0.45243582 1.4939070
31 3 0.3965108 -1.35877353 1.52230797 -1.6552039
32 4 0.8112229 0.51970084 0.30830797 -2.0563928
What I want to do:
For every matrix in mat.list I want to extract 6 rows of data, according to certain criteria, and place these rows as a data.frame in a list labelled Output1. I want to store all remaining rows as a data.frame in Output2.
The process:
1) Group data by cohort number.
2a. If there is 1 group (Cohort Number can only = 1). Move to column 2 and extract the 6 rows of matrix with the highest value for "Alpha(t-1)". Store these rows as a data.frame in a list named "Output1". Store all remaining rows as a data.frame in a list named "Output2".
2b. If there are 2 groups (Cohort number can = 1 or Cohort Number can =2) move to column 2 and extract the 3 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1 and extract the 3 rows with largest"Alpha(t-1)" corresponding to Cohort Number == 2. Place the 6 rows extracted as a data.frame in a list named "Output1". Place all remaining rows as a data.frame in a list named "Output2".
2c. If there are 3 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3 ) move to column 2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1, extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =3
2d. If there are 4 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3, Cohort Number = 4) move to column 2. Extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1. Extract the 2 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==2. Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==3 and Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==4. Store the 6 key rows as a data.frame in Output1. Store all remaining rows as a data.frame in the list Output2.
Desired Output:
Output1 <- c()
Output2 <- c()
Output1[[1]] = mat1 %>% group_by(`Cohort Number`) %>% top_n(6, `Alpha(t-1)`)
Output1[[2]] = mat2 %>% group_by(`Cohort Number`) %>% top_n(2, `Alpha(t-1)`)
> Output1[[1]]
# A tibble: 6 x 5
# Groups: Cohort Number [1]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.796 0.135 -0.569 -0.590
2 1 1.94 0.226 0.316 -0.910
3 1 0.271 0.397 -1.92 0.240
4 1 0.269 -0.523 -0.662 1.46
5 1 1.26 -0.112 -0.300 -1.04
6 1 0.477 2.16 -0.473 -0.532
> Output1[[2]]
# A tibble: 6 x 5
# Groups: Cohort Number [3]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.420 0.489 3.21 -0.137
2 2 1.27 -1.32 0.436 0.0787
3 2 1.57 0.0207 0.457 0.159
4 1 0.518 -1.41 0.626 0.978
5 3 1.15 -1.04 -0.362 -0.373
6 3 1.22 0.813 -1.32 -1.22
Overall I need a function to do this because i have over 1000 matrices in my actual application and can't do this manually.
We can count the number of distinct values in Cohort Number and based on that select the value of n in top_n. For distinct values which are more than 3, we create vector of values to select in top_n for each Cohort Number.
library(tidyverse)
output1 <- map(mat.list, function(x) {
dist <- n_distinct(x$`Cohort Number`, na.rm = TRUE)
if(dist <= 3)
x %>%
group_by(`Cohort Number`) %>%
top_n(6/dist, `Alpha(t-1)`)
else
map2_df(list(2, 2, 1, 1),x %>% na.omit %>% group_split(`Cohort Number`),
~.y %>% top_n(.x, `Alpha(t-1)`))
})
and for output2, we use map2 with ant_join
output2 <- map2(mat.list, output1, anti_join)
Confirming the output
map_dbl(output1, nrow)
#[1] 6 6 6
map_dbl(output2, nrow)
#[1] 502 502 502
I apologize if is a basic or duplicate question, but I am a beginner R user.
I am attempting to match every row in Dataframe A by Sex and Age to the two corresponding columns in Dataframe B. I know there will be a match for sure, so I want to extract values from the matching rows of two different columns in Dataframe B and store them in Dataframe C.
Dataframe A Dataframe B
ID Sex Age Weight Row Sex Age X1 X2
1 1 24 36 1 1 24 18.2 12.3
2 1 34 56 2 2 87 15.4 16.5
3 2 87 12 3 1 64 16.3 11.2
4 2 21 08 4 2 21 15.6 14.7
5 1 64 33 5 1 34 17.7 18.9
...
Dataframe C
ID Age Sex Weight Y1 Y2
1 1 24 36 18.2 12.3
2 1 34 56 17.7 18.9
3 2 87 12 15.4 16.5
4 2 21 08 15.6 14.7
5 1 64 33 16.3 11.2
There are 9000 IDs in my dataframe. I've looked at similar questions like this one
Fill column values by matching values in each row in two dataframe
But I don't think this I am applying this code correctly. Will a for loop be useful here?
for(i in 1:nrow(ID){
dfC[i,Y1] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X1"]
dfC[i,Y2] <-df2[match(paste(dfA$Sex,dfa$Age),paste(dfB$Sex,dfB$Age)),"X2"]
}
I know the merge function was also suggested, but these two variables are not actually named the same way in my data set.
Thanks!
Try this bro... reduce function in R for such operations
set.seed(1)
list.of.data.frames = list(data.frame(id=1:10, sex=1:10 , age =1:10 , weight=1:20), data.frame(row=5:14, sex=11:20 , age :1:20 , x1:1:10, x2:1:10), data.frame(id=8:14, sex=11:20 , age :1:20 ,weight:20:30, y1:1:10, y2:1:10))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
I have a dataset that has several hundred variables with hundreds of observations. Each observation has a unique identifier, and is associated with one of approximately 50 groups. It looks like so (the variables I'm not concerned about have been ignored below):
ID Group Score
1 10 400
2 11 473
3 12 293
4 13 382
5 14 283
6 11 348
7 11 645
8 13 423
9 10 434
10 10 124
etc.
I would like to calculate an adjusted mean for each observation that needs to use the N-count for each Group, the sum of Scores for that Group, as well as the means for the Scores of each group. (So, in the example above, the N-count for Group 11 is three, the sum is 1466, and the mean is 488.67, and I would use these numbers only on IDs 2, 6, and 7).
I've been fiddling with plyr, and am able to extract the n-counts and means as follows (accounting for missing Scores and Group values):
new_data <- ddply(main_data, "Group", N = sum(!is.na(Scores)), mean = mean(Scores, na.rm = TRUE).
I'm stuck, though, on how to get the sum of the scores for a particular group, and then how to calculate the adjusted means either within the main_data set or a new dataset. Any help would be appreciated.
Here is the plyr way.
ddply(main_data, .(Group), summarize, N = sum(!is.na(Score)), mean = mean(Score, na.rm = TRUE), total = sum(Score))
Group N mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283
Check out the dplyr package.
main_data %>% group_by(Group) %>% summarize(n = n(), mean = mean(Score, na.rm=TRUE), total = sum(Score))
Source: local data frame [5 x 4]
Group n mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283