Purrr Multiply index data frame with dataframe - r

Thank you all for reading this problem.
What i would like to do is multiply my testdata with my index file while matching columns.
So multiplying Dp_water with Dp_water and iterating over all index vars kcal, fat, prot, carbs.
In my test data i have for 10 individuals data on consumption of 4 food groups in grams.
for each individual i would like to calculate the kcal fat prot carb intake.
For each individual i would like to make a new variable
Dp_water_kcal, Dp_coffee_kcal, Dp_soup_kcal , Dp_soda_kcal
Dp_water_fat, Dp_coffee_fat, Dp_soup_fat , Dp_soda_fat
ect...
library(tidyverse)
Sample data
Index file
index <- data.frame(Variable=c("Dp_water","Dp_coffee","Dp_soup","Dp_soda"),
kcal=c(0,10,20,40),
fat=c(0,5,10,15),
prot=c(2,4,6,8),
carbs=c(3,6,9,12))
index <- index %>%
pivot_longer(c(kcal,fat,prot,carbs)) %>%
pivot_wider(names_from = Variable, values_from = value)
> index
# A tibble: 4 x 5
name Dp_water Dp_coffee Dp_soup Dp_soda
<chr> <dbl> <dbl> <dbl> <dbl>
1 kcal 0 10 20 40
2 fat 0 5 10 15
3 prot 2 4 6 8
4 carbs 3 6 9 12
Below subject data consumption of 4 foodgroups.
test_data <- data.frame(Dp_water=c(11:20),
Dp_coffee=c(31:40),
Dp_soup=c(21:30),
Dp_soda=c(41:50),
id=1:10)
Dp_water Dp_coffee Dp_soup Dp_soda id
1 11 31 21 41 1
2 12 32 22 42 2
3 13 33 23 43 3
4 14 34 24 44 4
5 15 35 25 45 5
6 16 36 26 46 6
7 17 37 27 47 7
8 18 38 28 48 8
9 19 39 29 49 9
10 20 40 30 50 10
If i do the following it works. But i would like to do this for all variables and not only kcal. And i would like to be able to keep the id column.
test_data %>%
select(-id) %>%
map2_dfr(., test_data[match(names(.), names(test_data))], ~.x/100 * .y) %>%
set_names(paste0(names(.), "_kcal"))
# A tibble: 10 x 4
Dp_water_kcal Dp_coffee_kcal Dp_soup_kcal Dp_soda_kcal
<dbl> <dbl> <dbl> <dbl>
1 1.21 9.61 4.41 16.8
2 1.44 10.2 4.84 17.6
3 1.69 10.9 5.29 18.5
4 1.96 11.6 5.76 19.4
5 2.25 12.2 6.25 20.2
6 2.56 13.0 6.76 21.2
7 2.89 13.7 7.29 22.1
8 3.24 14.4 7.84 23.0
9 3.61 15.2 8.41 24.0
10 4 16 9 25
Thank you all for any help!

Related

Summarize in a column using a condition and return a new row with the summed value

I have a dataset and I am trying to find a solution for it using dplyr. My goal is to summarize the values in the columns value and percentage, but only for the value smaller than 10 and add this to a new item name called: "cheap_stuff", while removing the rows with the low values.
My data looks like this:
df <- data.frame(group=c(rep("A",4), rep("B",4), rep("C",4), rep("D",4)),
value=c(1, 23, 15, 5, 3, 45, 7, 21, 4, 8, 26, 30, 3, 9, 37, 68),
percentage=c(2.27, 52.27, 34.09, 11.36 ,3.95 ,59.21 ,9.21 ,27.63 ,5.88 ,11.76 ,38.24 ,44.12 ,2.56 ,7.69, 31.62, 58.12),
item=c("cheap1","expensive1" ,"expensive2", "cheap2",
"cheap1", "expensive1","cheap2","expensive2",
"cheap1","cheap2","expensive1","expensive2",
"cheap1","cheap2","expensive1","expensive2"))
view(df)
group value percentage item
1 A 1 2.27 cheap1
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 A 5 11.36 cheap2
5 B 3 3.95 cheap1
6 B 45 59.21 expensive1
7 B 7 9.21 cheap2
8 B 21 27.63 expensive2
9 C 4 5.88 cheap1
10 C 8 11.76 cheap2
11 C 26 38.24 expensive1
12 C 30 44.12 expensive2
13 D 3 2.56 cheap1
14 D 9 7.69 cheap2
15 D 37 31.62 expensive1
16 D 68 58.12 expensive2
My desired output looks like this:
group value percentage item
1 A 6 13.64 cheap_stuff
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 B 10 13.16 cheap_stuff
5 B 45 59.21 expensive1
6 B 21 27.63 expensive2
7 C 12 17.65 cheap_stuff
8 C 26 38.24 expensive1
9 C 30 44.12 expensive2
10 D 12 10.26 cheap_stuff
11 D 37 31.62 expensive1
12 D 68 58.12 expensive2
This post comes in the right direction,
Summarize with mathematical conditions in dplyr
But, there all values are summed, and a new column is created.
I have tried something like this:
library(dplyr)
df%>%
group_by(group) %>%
mutate(item= replace(item, which(value <10),"cheap_stuff")) %>%
mutate(value = sum(value[value < 10]))
But that fails in the sense that I can not removed the rows that I want, and it write over the rows with expensive values.
# A tibble: 16 × 4
# Groups: group [4]
group value percentage item
<chr> <dbl> <dbl> <chr>
1 A 6 2.27 cheap_stuff
2 A 6 52.3 expensive1
3 A 6 34.1 expensive2
4 A 6 11.4 cheap_stuff
5 B 10 3.95 cheap_stuff
6 B 10 59.2 expensive1
7 B 10 9.21 cheap_stuff
8 B 10 27.6 expensive2
9 C 12 5.88 cheap_stuff
10 C 12 11.8 cheap_stuff
11 C 12 38.2 expensive1
12 C 12 44.1 expensive2
13 D 12 2.56 cheap_stuff
14 D 12 7.69 cheap_stuff
15 D 12 31.6 expensive1
16 D 12 58.1 expensive2
Using value<10 instead of grepl:
df %>%
group_by(group,item=case_when(value < 10~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))%>%
ungroup
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1
Original answer:
df %>%
group_by(group,item=case_when(grepl("cheap",item,fixed=T)~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1

Using mean in dplyr chain with curly braces always returns NA

Trying to create a simple function that summarizes a variable of choice via a dplyr chain. Here's my attempt:
get_mutated_df <- function(data, outcome){
{{data}} %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
data(cars)
get_mutated_df(cars, "dist")
However, this returns a tibble of NAs:
# A tibble: 19 × 2
speed dist_mean
<dbl> <dbl>
1 4 NA
2 7 NA
3 8 NA
4 9 NA
What's the appropriate way of doing this?
No {} at data, and remove "" to dist will works.
get_mutated_df <- function(data, outcome){
data %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df(cars, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85
Code for pre_ thing
carss <- cars
carss$pre_dist <- cars$dist
get_mutated_df_2 <- function(data, outcome){
outcome <- deparse(substitute(outcome))
outcome <- paste0("pre_", outcome)
outcome <- as.symbol(outcome)
data %>% group_by(speed) %>%
summarize(dist_mean := mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df_2(carss, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85

R:dplyr summarise data by group with nth() call with variable n calculated during aggregation

I'm aggregating data with variable bin sizes (see previous question here: R: aggregate every n rows with variable n depending on sum(n) of second column). In addition to calculating sums and means over groups of variable ranges, I need to pull out single-value covariates at the midpoint of each group range. When I try to do this on the fly, I only get a value for the first group and NAs for the remaining.
df.summary<-as.data.frame(df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE), d=sum(d, na.rm=T), ,i.start=first(rn), i.end=last(rn), y=nth(y, round(first(rn)+(last(rn)-first(rn))/2-1))))
head(df.summary)
grp x d i.start i.end y
1 1 0.07458317 88.99342 1 4 19.78992
2 2 0.07594546 97.62130 5 8 NA
3 3 0.05353308 104.69683 9 12 NA
4 4 0.06498291 106.23468 13 16 NA
5 5 0.08601759 98.24939 17 20 NA
6 6 0.06262427 84.43745 21 23 NA
sample data:
structure(list(x = c(0.10000112377193, 0.110742170350877, 0.0300274304561404,
0.0575619395964912, 0.109060465438596, 0.0595491225614035, 0.0539270264912281,
0.0812452063859649, 0.0341699389122807, 0.0391744879122807, 0.0411787485614035,
0.0996091644385965, 0.0970479474912281, 0.0595715843684211, 0.0483489989122807,
0.0549631194561404, 0.0705080555964912, 0.080437472631579, 0.105883664631579,
0.0872411613684211, 0.103236660631579, 0.0381296894912281, 0.0465064491578947,
0.0936565184561403, 0.0410095752631579, 0.0311180032105263, 0.0257758157894737,
0.0354721928947368, 0.0584999394736842, 0.0241286060175439, 0.112053376666667,
0.0769823868596491, 0.0558137530526316, 0.0374491000701754, 0.0419279142631579,
0.0260257506842105, 0.0544360374561404, 0.107411071842105, 0.103873468,
0.0419322114035088, 0.0483912961052632, 0.0328373653157895, 0.0866868717719298,
0.063990467245614, 0.0799280314035088, 0.123490407070175, 0.145676836280702,
0.0292878782807018, 0.0432093036666667, 0.0203547443684211),
d = c(22.2483512600033, 22.2483529247042, 22.2483545865809,
22.2483562542823, 22.24835791863, 25.1243105415557, 25.1243148759953,
25.1243192107884, 25.1243235416981, 25.1243278750792, 27.2240858553058,
27.2240943134697, 27.2241027638674, 27.224111222031, 27.2241196741942,
24.5623431981188, 24.5623453409221, 24.5623474809012, 24.562349626705,
24.5623517696847, 28.1458125837154, 28.1458157376341, 28.1458188889053,
28.1458220452951, 28.1458251983314, 27.8293318542146, 27.8293366652115,
27.8293414829159, 27.829346292148, 27.8293511094993, 27.5271773325046,
27.5271834011289, 27.5271894694002, 27.5271955369655, 27.5272016048837,
28.0376097925214, 28.0376146410729, 28.0376194959786, 28.0376243427651,
28.0376291969647, 26.8766095768196, 26.8766122563318, 26.8766149309023,
26.8766176123562, 26.8766202925746, 27.8736950101666, 27.8736960528853,
27.8736971017815, 27.8736981446767, 27.8736991932199), y = c(19.79001,
19.789922, 19.789834, 19.789746, 19.789658, 19.78957, 19.789468,
19.789366, 19.789264, 19.789162, 19.78906, 19.78896, 19.78886,
19.78876, 19.78866, 19.78856, 19.788458, 19.788356, 19.788254,
19.788152, 19.78805, 19.787948, 19.787846, 19.787744, 19.787642,
19.78754, 19.787442, 19.787344, 19.787246, 19.787148, 19.78705,
19.786956, 19.786862, 19.786768, 19.786674, 19.78658, 19.786486,
19.786392, 19.786298, 19.786204, 19.78611, 19.786016, 19.785922,
19.785828, 19.785734, 19.78564, 19.785544, 19.785448, 19.785352,
19.785256)), row.names = c(NA, 50L), class = "data.frame")
Let's add variable z and n in summarise part. Those variables are defined as below.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
z = round(first(rn)+(last(rn)-first(rn))/2-1),
n = n())
grp x d i.start i.end z n
<dbl> <dbl> <dbl> <int> <int> <dbl> <int>
1 1 0.0746 89.0 1 4 2 4
2 2 0.0759 97.6 5 8 6 4
3 3 0.0535 105. 9 12 10 4
4 4 0.0650 106. 13 16 14 4
5 5 0.0860 98.2 17 20 18 4
6 6 0.0626 84.4 21 23 21 3
7 7 0.0479 112. 24 27 24 4
8 8 0.0394 83.5 28 30 28 3
9 9 0.0706 110. 31 34 32 4
10 10 0.0575 112. 35 38 36 4
11 11 0.0647 83.0 39 41 39 3
12 12 0.0659 108. 42 45 42 4
13 13 0.0854 111. 46 49 46 4
14 14 0.0204 27.9 50 50 49 1
In dataframe above, n indicates sample size of each groups separated by grp. However, as you state group_by(grp), when you call nth(y, z), YOU WILL CALL Z-TH VALUE BY GROUP.
It means that for 5th group, although there exists only 4 values, you call 18th value of y. So it prints NA.
To get this easy, the most simple way I think is use n().
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
y=nth(y, round(n()/2)))
grp x d i.start i.end y
<dbl> <dbl> <dbl> <int> <int> <dbl>
1 1 0.0746 89.0 1 4 19.8
2 2 0.0759 97.6 5 8 19.8
3 3 0.0535 105. 9 12 19.8
4 4 0.0650 106. 13 16 19.8
5 5 0.0860 98.2 17 20 19.8
6 6 0.0626 84.4 21 23 19.8
7 7 0.0479 112. 24 27 19.8
8 8 0.0394 83.5 28 30 19.8
9 9 0.0706 110. 31 34 19.8
10 10 0.0575 112. 35 38 19.8
11 11 0.0647 83.0 39 41 19.8
12 12 0.0659 108. 42 45 19.8
13 13 0.0854 111. 46 49 19.8
14 14 0.0204 27.9 50 50 NA
You'll call floor(n/2)th y, which means y that locates middle of each group. Note that you can also try floor(n/2)+1.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d = sum(d, na.rm=T),
i.start=first(rn),
i.end=last(rn),
y = nth(y, floor(median(rn)) - i.start))

Using "first" in mutate

My dataframe looks something like the first four columns of the following:
ID Obs Seconds Mean Ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 1815522 1 1 NA 1/10.6
2 1815522 2 26 NA 26/10.6
3 1815522 3 4.68 10.6 4.68/10.6
4 1815522 4 0 10.2 0/10.6
5 1815522 5 1.5 2.06 1.5/10.6
6 1815522 6 2.22 1.24 2.22/10.6
7 1815676 1 12 NA 12/9.67
8 1815676 2 6 NA 6/9.67
9 1815676 3 11 9.67 11/9.67
10 1815676 4 1 6 1/9.67
11 1815676 5 30 14 30/9.67
12 1815676 6 29 20 29/9.67
13 1815676 7 23 27.3 23/9.67
14 1815676 8 51 34.3 51/9.67
I am trying to add a fifth column "Ratio", containing the ratio of each row's value for Seconds, and the ID-group's first not-NA value of Mean. How do I do that?
I've tried several things:
temp %>%
group_by(ID) %>%
mutate(Ratio = case_when(all(is.na(Mean)) ~ NA_real_,
!all(is.na(Mean)) ~ Seconds/(first(Mean[!is.na(Mean)]))))
This gives me the following error:
Error in mutate_impl(.data, dots) :
Column `Ratio` must be length 2 (the group size) or one, not 0
I also tried
temp %>%
group_by(ID) %>%
mutate(Ratio = ifelse(!all(is.na(Mean)), Seconds/(first(Mean[!is.na(Mean)])), NA_real_))
But in this case, it will create a column that looks like this:
Ratio
<dbl>
1 0.0947
2 0.0947
3 0.0947
4 0.0947
5 0.0947
6 0.0947
7 1.24
8 1.24
9 1.24
10 1.24
11 1.24
12 1.24
13 1.24
14 1.24
I really don't know what else to try. Please help! :)
An idea is to use fill with .direction = 'up' since you are interested in the first value, to fill your NAs and simply divide with the first value. No need for case_when to capture all NAs since it will by default give NA as an answer, i.e.
library(tidyverse)
df %>%
group_by(ID) %>%
fill(Mean, .direction = 'up') %>%
mutate(ratio = Seconds / first(Mean))
which gives,
# A tibble: 14 x 5
# Groups: ID [2]
ID Obs Seconds Mean ratio
<int> <int> <dbl> <dbl> <dbl>
1 1815522 1 1 10.6 0.0943
2 1815522 2 26 10.6 2.45
3 1815522 3 4.68 10.6 0.442
4 1815522 4 0 10.2 0
5 1815522 5 1.5 2.06 0.142
6 1815522 6 2.22 1.24 0.209
7 1815676 1 12 9.67 1.24
8 1815676 2 6 9.67 0.620
9 1815676 3 11 9.67 1.14
10 1815676 4 1 6 0.103
11 1815676 5 30 14 3.10
12 1815676 6 29 20 3.00
13 1815676 7 23 27.3 2.38
14 1815676 8 51 34.3 5.27
Try this:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(
isNA = mean(is.na(Mean)),
Ratio = if_else(isNA == 1, NA_real_, Seconds / first(Mean[!is.na(Mean)]))
)

R program - getting particular values depending on another column

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

Resources