I have a dataset and I am trying to find a solution for it using dplyr. My goal is to summarize the values in the columns value and percentage, but only for the value smaller than 10 and add this to a new item name called: "cheap_stuff", while removing the rows with the low values.
My data looks like this:
df <- data.frame(group=c(rep("A",4), rep("B",4), rep("C",4), rep("D",4)),
value=c(1, 23, 15, 5, 3, 45, 7, 21, 4, 8, 26, 30, 3, 9, 37, 68),
percentage=c(2.27, 52.27, 34.09, 11.36 ,3.95 ,59.21 ,9.21 ,27.63 ,5.88 ,11.76 ,38.24 ,44.12 ,2.56 ,7.69, 31.62, 58.12),
item=c("cheap1","expensive1" ,"expensive2", "cheap2",
"cheap1", "expensive1","cheap2","expensive2",
"cheap1","cheap2","expensive1","expensive2",
"cheap1","cheap2","expensive1","expensive2"))
view(df)
group value percentage item
1 A 1 2.27 cheap1
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 A 5 11.36 cheap2
5 B 3 3.95 cheap1
6 B 45 59.21 expensive1
7 B 7 9.21 cheap2
8 B 21 27.63 expensive2
9 C 4 5.88 cheap1
10 C 8 11.76 cheap2
11 C 26 38.24 expensive1
12 C 30 44.12 expensive2
13 D 3 2.56 cheap1
14 D 9 7.69 cheap2
15 D 37 31.62 expensive1
16 D 68 58.12 expensive2
My desired output looks like this:
group value percentage item
1 A 6 13.64 cheap_stuff
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 B 10 13.16 cheap_stuff
5 B 45 59.21 expensive1
6 B 21 27.63 expensive2
7 C 12 17.65 cheap_stuff
8 C 26 38.24 expensive1
9 C 30 44.12 expensive2
10 D 12 10.26 cheap_stuff
11 D 37 31.62 expensive1
12 D 68 58.12 expensive2
This post comes in the right direction,
Summarize with mathematical conditions in dplyr
But, there all values are summed, and a new column is created.
I have tried something like this:
library(dplyr)
df%>%
group_by(group) %>%
mutate(item= replace(item, which(value <10),"cheap_stuff")) %>%
mutate(value = sum(value[value < 10]))
But that fails in the sense that I can not removed the rows that I want, and it write over the rows with expensive values.
# A tibble: 16 × 4
# Groups: group [4]
group value percentage item
<chr> <dbl> <dbl> <chr>
1 A 6 2.27 cheap_stuff
2 A 6 52.3 expensive1
3 A 6 34.1 expensive2
4 A 6 11.4 cheap_stuff
5 B 10 3.95 cheap_stuff
6 B 10 59.2 expensive1
7 B 10 9.21 cheap_stuff
8 B 10 27.6 expensive2
9 C 12 5.88 cheap_stuff
10 C 12 11.8 cheap_stuff
11 C 12 38.2 expensive1
12 C 12 44.1 expensive2
13 D 12 2.56 cheap_stuff
14 D 12 7.69 cheap_stuff
15 D 12 31.6 expensive1
16 D 12 58.1 expensive2
Using value<10 instead of grepl:
df %>%
group_by(group,item=case_when(value < 10~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))%>%
ungroup
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1
Original answer:
df %>%
group_by(group,item=case_when(grepl("cheap",item,fixed=T)~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1
Trying to create a simple function that summarizes a variable of choice via a dplyr chain. Here's my attempt:
get_mutated_df <- function(data, outcome){
{{data}} %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
data(cars)
get_mutated_df(cars, "dist")
However, this returns a tibble of NAs:
# A tibble: 19 × 2
speed dist_mean
<dbl> <dbl>
1 4 NA
2 7 NA
3 8 NA
4 9 NA
What's the appropriate way of doing this?
No {} at data, and remove "" to dist will works.
get_mutated_df <- function(data, outcome){
data %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df(cars, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85
Code for pre_ thing
carss <- cars
carss$pre_dist <- cars$dist
get_mutated_df_2 <- function(data, outcome){
outcome <- deparse(substitute(outcome))
outcome <- paste0("pre_", outcome)
outcome <- as.symbol(outcome)
data %>% group_by(speed) %>%
summarize(dist_mean := mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df_2(carss, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85
I'm aggregating data with variable bin sizes (see previous question here: R: aggregate every n rows with variable n depending on sum(n) of second column). In addition to calculating sums and means over groups of variable ranges, I need to pull out single-value covariates at the midpoint of each group range. When I try to do this on the fly, I only get a value for the first group and NAs for the remaining.
df.summary<-as.data.frame(df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE), d=sum(d, na.rm=T), ,i.start=first(rn), i.end=last(rn), y=nth(y, round(first(rn)+(last(rn)-first(rn))/2-1))))
head(df.summary)
grp x d i.start i.end y
1 1 0.07458317 88.99342 1 4 19.78992
2 2 0.07594546 97.62130 5 8 NA
3 3 0.05353308 104.69683 9 12 NA
4 4 0.06498291 106.23468 13 16 NA
5 5 0.08601759 98.24939 17 20 NA
6 6 0.06262427 84.43745 21 23 NA
sample data:
structure(list(x = c(0.10000112377193, 0.110742170350877, 0.0300274304561404,
0.0575619395964912, 0.109060465438596, 0.0595491225614035, 0.0539270264912281,
0.0812452063859649, 0.0341699389122807, 0.0391744879122807, 0.0411787485614035,
0.0996091644385965, 0.0970479474912281, 0.0595715843684211, 0.0483489989122807,
0.0549631194561404, 0.0705080555964912, 0.080437472631579, 0.105883664631579,
0.0872411613684211, 0.103236660631579, 0.0381296894912281, 0.0465064491578947,
0.0936565184561403, 0.0410095752631579, 0.0311180032105263, 0.0257758157894737,
0.0354721928947368, 0.0584999394736842, 0.0241286060175439, 0.112053376666667,
0.0769823868596491, 0.0558137530526316, 0.0374491000701754, 0.0419279142631579,
0.0260257506842105, 0.0544360374561404, 0.107411071842105, 0.103873468,
0.0419322114035088, 0.0483912961052632, 0.0328373653157895, 0.0866868717719298,
0.063990467245614, 0.0799280314035088, 0.123490407070175, 0.145676836280702,
0.0292878782807018, 0.0432093036666667, 0.0203547443684211),
d = c(22.2483512600033, 22.2483529247042, 22.2483545865809,
22.2483562542823, 22.24835791863, 25.1243105415557, 25.1243148759953,
25.1243192107884, 25.1243235416981, 25.1243278750792, 27.2240858553058,
27.2240943134697, 27.2241027638674, 27.224111222031, 27.2241196741942,
24.5623431981188, 24.5623453409221, 24.5623474809012, 24.562349626705,
24.5623517696847, 28.1458125837154, 28.1458157376341, 28.1458188889053,
28.1458220452951, 28.1458251983314, 27.8293318542146, 27.8293366652115,
27.8293414829159, 27.829346292148, 27.8293511094993, 27.5271773325046,
27.5271834011289, 27.5271894694002, 27.5271955369655, 27.5272016048837,
28.0376097925214, 28.0376146410729, 28.0376194959786, 28.0376243427651,
28.0376291969647, 26.8766095768196, 26.8766122563318, 26.8766149309023,
26.8766176123562, 26.8766202925746, 27.8736950101666, 27.8736960528853,
27.8736971017815, 27.8736981446767, 27.8736991932199), y = c(19.79001,
19.789922, 19.789834, 19.789746, 19.789658, 19.78957, 19.789468,
19.789366, 19.789264, 19.789162, 19.78906, 19.78896, 19.78886,
19.78876, 19.78866, 19.78856, 19.788458, 19.788356, 19.788254,
19.788152, 19.78805, 19.787948, 19.787846, 19.787744, 19.787642,
19.78754, 19.787442, 19.787344, 19.787246, 19.787148, 19.78705,
19.786956, 19.786862, 19.786768, 19.786674, 19.78658, 19.786486,
19.786392, 19.786298, 19.786204, 19.78611, 19.786016, 19.785922,
19.785828, 19.785734, 19.78564, 19.785544, 19.785448, 19.785352,
19.785256)), row.names = c(NA, 50L), class = "data.frame")
Let's add variable z and n in summarise part. Those variables are defined as below.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
z = round(first(rn)+(last(rn)-first(rn))/2-1),
n = n())
grp x d i.start i.end z n
<dbl> <dbl> <dbl> <int> <int> <dbl> <int>
1 1 0.0746 89.0 1 4 2 4
2 2 0.0759 97.6 5 8 6 4
3 3 0.0535 105. 9 12 10 4
4 4 0.0650 106. 13 16 14 4
5 5 0.0860 98.2 17 20 18 4
6 6 0.0626 84.4 21 23 21 3
7 7 0.0479 112. 24 27 24 4
8 8 0.0394 83.5 28 30 28 3
9 9 0.0706 110. 31 34 32 4
10 10 0.0575 112. 35 38 36 4
11 11 0.0647 83.0 39 41 39 3
12 12 0.0659 108. 42 45 42 4
13 13 0.0854 111. 46 49 46 4
14 14 0.0204 27.9 50 50 49 1
In dataframe above, n indicates sample size of each groups separated by grp. However, as you state group_by(grp), when you call nth(y, z), YOU WILL CALL Z-TH VALUE BY GROUP.
It means that for 5th group, although there exists only 4 values, you call 18th value of y. So it prints NA.
To get this easy, the most simple way I think is use n().
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
y=nth(y, round(n()/2)))
grp x d i.start i.end y
<dbl> <dbl> <dbl> <int> <int> <dbl>
1 1 0.0746 89.0 1 4 19.8
2 2 0.0759 97.6 5 8 19.8
3 3 0.0535 105. 9 12 19.8
4 4 0.0650 106. 13 16 19.8
5 5 0.0860 98.2 17 20 19.8
6 6 0.0626 84.4 21 23 19.8
7 7 0.0479 112. 24 27 19.8
8 8 0.0394 83.5 28 30 19.8
9 9 0.0706 110. 31 34 19.8
10 10 0.0575 112. 35 38 19.8
11 11 0.0647 83.0 39 41 19.8
12 12 0.0659 108. 42 45 19.8
13 13 0.0854 111. 46 49 19.8
14 14 0.0204 27.9 50 50 NA
You'll call floor(n/2)th y, which means y that locates middle of each group. Note that you can also try floor(n/2)+1.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d = sum(d, na.rm=T),
i.start=first(rn),
i.end=last(rn),
y = nth(y, floor(median(rn)) - i.start))
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25