Summation of variables by Groups in R

Summation of variables by Groups in R - r

I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:
BEFORE:
data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60))
AFTER:
data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60),
sum = c(30, 30, 70, 70, 110, 110))
In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.

You are looking for ave
> data2 <- transform(data1, sum=ave(value, month, FUN=sum))
month sex value sum
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110
data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform
Also data.table is helpful
library(data.table)
DT <- data.table(data1)
DT[, sum:=sum(value), by=month]
UPDATE
We can also use a tidyverse approach which is simple, yet elegant:
> library(tidyverse)
> data1 %>%
group_by(month) %>%
mutate(sum=sum(value))
# A tibble: 6 x 4
# Groups: month [3]
month sex value sum
<dbl> <fct> <dbl> <dbl>
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110

Related

Compute the difference between two columns by pair in R

I have the following data:
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
I want to "extend" this data frame to make name pairs for every possible combination of names without repetition like so:
names_1 <- c("a", "a", "a", "b", "b", "c")
names_2 <- c("b", "c", "d", "c", "d", "d")
scores_1 <- c(95, 95, 95, 55, 55, 100)
scores_2 <- c(55, 100, 60, 100, 60, 60)
df_extended <- cbind.data.frame(names_1, names_2, scores_1, scores_2)
In the extended data, scores_1 are the scores for the corresponding name in names_1, and scores_2 are for names_2.
The following bit of code makes the appropriate name pairs. But I do not know how to get the scores in the right place after that.
t(combn(df$names,2))
The final goal is to get the row-wise difference between scores_1 and scores_2.
df_extended$score_diff <- abs(df_extended$scores_1 - df_extended$scores_2)

df_ext <- data.frame(t(combn(df$names, 2,\(x)c(x, df$scores[df$names %in%x]))))
df_ext <- setNames(type.convert(df_ext, as.is =TRUE), c('name_1','name_2', 'type_1', 'type_2'))
df_ext
name_1 name_2 type_1 type_2
1 a b 95 55
2 a c 95 100
3 a d 95 60
4 b c 55 100
5 b d 55 60
6 c d 100 60

names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
library(tidyverse)
map(df, ~combn(x = .x, m = 2)%>% t %>% as_tibble) %>%
imap_dfc(~set_names(x = .x, nm = paste(.y, seq(ncol(.x)), sep = "_"))) %>%
mutate(score_diff = scores_1 - scores_2)
#> # A tibble: 6 × 5
#> names_1 names_2 scores_1 scores_2 score_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 a b 95 55 40
#> 2 a c 95 100 -5
#> 3 a d 95 60 35
#> 4 b c 55 100 -45
#> 5 b d 55 60 -5
#> 6 c d 100 60 40
Created on 2022-06-06 by the reprex package (v2.0.1)

First, we can create a new data frame with the unique combinations of names. Then, we can merge on the scores to match the names for both names_1 and names_2 to get the final data.frame.
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
new_df <- data.frame(t(combn(df$names,2)))
names(new_df)[1] <- "names_1"; names(new_df)[2] <- "names_2"
new_df <- merge(new_df, df, by.x = 'names_1', by.y = 'names')
new_df <- merge(new_df, df, by.x = 'names_2', by.y = 'names')
names(new_df)[3] <- "scores_1"; names(new_df)[4] <- "scores_2"
> new_df
names_2 names_1 scores_1 scores_2
1 b a 95 55
2 c a 95 100
3 c b 55 100
4 d a 95 60
5 d b 55 60
6 d c 100 60

How to add a row, which is a sum of some values in a column based on specific values in other column?

I am trying to add two rows to the data frame.
Regarding the first row, its value in MODEL column should be X, total_value should be the sum of total value of rows, with the MODEL being A and C and total_frequency should be the sum of total_frequency of rows, with the MODEL being A and C.
In the second row, the value in MODEL column should be Z, total_value should be the sum of total_value of rows, with the MODEL being D, Fand E, and total_frequency should be the sum of total_frequency of rows, with the MODEL being D,Fand E.
I am stuck, as I do not know how to select specific values of MODEL and then sum these two other columns.
Here is my data
data.frame(MODEL=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), total_value= c(62, 54, 78, 38, 16, 75, 39, 13, 58, 37),
total_frequency = c(78, 83, 24, 13, 22, 52, 16, 16, 20, 72))

You can try with dplyr, calculating the "new rows", then put together with the data df:
library(dplyr)
first <- df %>%
# select the models you need
filter(MODEL %in% c("A","C")) %>%
# call them x
mutate(MODEL = 'X') %>%
# grouping
group_by(MODEL) %>%
# calculate the sums
summarise_all(sum)
# same with the second
second <- df %>%
filter(MODEL %in% c("D","F","E")) %>%
mutate(MODEL = 'Z') %>%
group_by(MODEL) %>% summarise_all(sum)
# put together
rbind(df, first, second)
# A tibble: 12 x 3
MODEL total_value total_frequency
1 A 62 78
2 B 54 83
3 C 78 24
4 D 38 13
5 E 16 22
6 F 75 52
7 G 39 16
8 H 13 16
9 I 58 20
10 J 37 72
11 X 140 102
12 Z 129 87

The following code is a straightforward solution to the problem.
i1 <- df1$MODEL %in% c("A", "C")
total_value <- sum(df1$total_value[i1])
total_frequency <- sum(df1$total_frequency[i1])
df1 <- rbind(df1, data.frame(MODEL = "X", total_value, total_frequency))
i2 <- df1$MODEL %in% c("D", "E", "F")
total_value <- sum(df1$total_value[i2])
total_frequency <- sum(df1$total_frequency[i2])
df1 <- rbind(df1, data.frame(MODEL = "Z", total_value, total_frequency))
df1
# MODEL total_value total_frequency
#1 A 62 78
#2 B 54 83
#3 C 78 24
#4 D 38 13
#5 E 16 22
#6 F 75 52
#7 G 39 16
#8 H 13 16
#9 I 58 20
#10 J 37 72
#11 X 140 102
#12 Z 129 87
It is also possible to write a function to avoid repeating the same code.
fun <- function(X, M, vals){
i1 <- X$MODEL %in% vals
total_value <- sum(X$total_value[i1])
total_frequency <- sum(X$total_frequency[i1])
rbind(X, data.frame(MODEL = M, total_value, total_frequency))
}
df1 <- fun(df1, M = "X", vals = c("A", "C"))
df1 <- fun(df1, M = "Z", vals = c("D", "E", "F"))

Multiply columns from different data frames if dates match

I have the following two data frames:
df1 <- data.frame(Category = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Date = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003),
Beta1 = c(1, 3, 4, 4, 5, 3, 5, 3, 1),
Beta2 = c(2, 4, 6, 1, 1, 2, 5, 4, 2))
df2 <- data.frame(Date = c(2001, 2002, 2003),
Column1 = c(10, 20, 30),
Column2 = c(40, 50, 60))
Say I assign category A to Column1 and and category C to Column2. I want to multiply the row value from Column1 with the row betas from category A, if the dates match. Similarly, I want to multiply the row value from Column2 with the row betas from category C, if the dates match.
The match between a category and a column is of my own choosing. Assigning this myself won’t be a problem I think because I have relatively few columns.
Preferably, I want the output to look like this:
results <- data.frame(Date = c(2001, 2002, 2003),
Column1_categoryA_beta1 = c(10, 60, 120),
Column1_categoryA_beta2 = c(20, 80, 180),
Column2_categoryC_beta1 = c(200, 150, 60),
Column2_categoryC_beta2 = c(200, 200, 120))
Any help in how I best can approach this problem is very much appreciated!

With some data wrangling using tidyr and dplyr this can be achieved like so:
df1 <- data.frame(Category = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Date = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003),
Beta1 = c(1, 3, 4, 4, 5, 3, 5, 3, 1),
Beta2 = c(2, 4, 6, 1, 1, 2, 5, 4, 2))
df2 <- data.frame(Date = c(2001, 2002, 2003),
Column1 = c(10, 20, 30),
Column2 = c(40, 50, 60))
library(dplyr)
library(tidyr)
df2_long <- df2 %>%
pivot_longer(-Date, names_to = "Column", values_to = "Value") %>%
mutate(Category = ifelse(Column == "Column1", "A", "C"))
df2_long %>%
left_join(df1) %>%
mutate(Beta1 = Value * Beta1,
Beta2 = Value * Beta2) %>%
select(Date, Category, Column, Beta1, Beta2) %>%
pivot_wider(id_cols = Date, names_from = c("Column", "Category"), values_from = c("Beta1", "Beta2"))
#> Joining, by = c("Date", "Category")
#> Warning: Column `Category` joining character vector and factor, coercing into
#> character vector
#> # A tibble: 3 x 5
#> Date Beta1_Column1_A Beta1_Column2_C Beta2_Column1_A Beta2_Column2_C
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 10 200 20 200
#> 2 2002 60 150 80 200
#> 3 2003 120 60 180 120
Created on 2020-04-14 by the reprex package (v0.3.0)

One way to get there while keeping the Category variable in the final data frame is the following:
df3 <- left_join(df1, df2, by="Date")
df4 <- df3 %>%
group_by(Date, Category) %>%
mutate(Col1Bet1 = Column1 * Beta1, Col1Bet2 = Column1 * Beta2, Col2Bet1 = Column2 * Beta1, Col2Bet2 = Column2 * Beta2)
which gives the following:
# A tibble: 9 x 10
# Groups: Date, Category [9]
Category Date Beta1 Beta2 Column1 Column2 Col1Bet1 Col1Bet2 Col2Bet1 Col2Bet2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2001 1 2 10 40 10 20 40 80
2 A 2002 3 4 20 50 60 80 150 200
3 A 2003 4 6 30 60 120 180 240 360
4 B 2001 4 1 10 40 40 10 160 40
5 B 2002 5 1 20 50 100 20 250 50
6 B 2003 3 2 30 60 90 60 180 120
7 C 2001 5 5 10 40 50 50 200 200
8 C 2002 3 4 20 50 60 80 150 200
9 C 2003 1 2 30 60 30 60 60 120

This could be a start. The result data.table has all information you want just in another format.
df3 <- merge(df1, df2)
df3$b1 <- ifelse(df3$Category=="A", df3$Beta1*df3$Column1, ifelse(df3$Category=="C", df3$Beta1*df3$Column2, NA))
df3$b2 <- ifelse(df3$Category=="A", df3$Beta2*df3$Column1, ifelse(df3$Category=="C", df3$Beta2*df3$Column2, NA))
# Date Category Beta1 Beta2 Column1 Column2 b1 b2
# 1 2001 A 1 2 10 40 10 20
# 2 2001 C 5 5 10 40 200 200
# 3 2001 B 4 1 10 40 NA NA
# 4 2002 A 3 4 20 50 60 80
# 5 2002 B 5 1 20 50 NA NA
# 6 2002 C 3 4 20 50 150 200
# 7 2003 B 3 2 30 60 NA NA
# 8 2003 A 4 6 30 60 120 180
# 9 2003 C 1 2 30 60 60 120

Is there a way to create new columns in R based on manipulations from multiple data frames?

Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?
For example, I have two dataframes:
df1:
deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"),
Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4),
Percent_Column_in_each_bin = c(20, 10, 25))
df2:
sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))
I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:
i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
'20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
'20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5),
'20180101_HH1_1_3_Bin' = c(1, 1, 2, 3, 3, 4, 1, 3, 1, 2))
I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.
I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!
Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from #Ben) is using dplyr to slice and find the max ID like this:
20180101_HH1_1_1 <- sp.c %>%
group_by(20180101_HH1_1_1, species) %>%
arrange(desc(ct)) %>%
slice(1) %>%
group_by(20180101_HH1_1_1) %>%
mutate(Count_Total_Per_Bin = sum(ct)) %>%
group_by(species, add=TRUE) %>%
mutate(species_percent_of_total_in_bin =
paste0((100*ct/Count_Total_Per_Bin) %>%
mutate(ID= "20180101_HH1_1_1 ") %>%
ungroup()
but I have to do this for over 100 IDs. My desired output would be something like:
end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
bin = c(1, 1, 1, 2, 3, 4, 4, 5),
Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))
I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.

Here is another approach, which does not require a loop.
Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.
res <- sapply(deploy$Percent_Column_in_each_bin, function(x) {
cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)
Or using purrr:
library(purrr)
cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID),
~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))
Output
species ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1 RR 25 11 1 2 1
2 GS 66 15 1 2 1
3 GT 1 33 2 4 2
4 BR 12 68 4 7 3
5 RS 30 71 4 8 3
6 BA 6 100 5 10 4
7 GS 1 2 1 1 1
8 RS 22 65 4 7 3
9 SH 500 5 1 1 1
10 RR 6 42 3 5 2
Edit:
To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.
First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.
library(tidyverse)
res %>%
pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
group_by(species, site, bin) %>%
summarise(max_ct = max(ct)) %>%
arrange(site, bin)
Output
# A tibble: 26 x 4
# Groups: species, site [21]
species site bin max_ct
<fct> <chr> <fct> <dbl>
1 GS 20180101_HH1_1_1 1 66
2 RR 20180101_HH1_1_1 1 25
3 SH 20180101_HH1_1_1 1 500
4 GT 20180101_HH1_1_1 2 1
5 RR 20180101_HH1_1_1 3 6
6 BR 20180101_HH1_1_1 4 12
7 RS 20180101_HH1_1_1 4 30
8 BA 20180101_HH1_1_1 5 6
9 GS 20180101_HH1_1_2 1 1
10 SH 20180101_HH1_1_2 1 500
11 GS 20180101_HH1_1_2 2 66
12 RR 20180101_HH1_1_2 2 25
13 GT 20180101_HH1_1_2 4 1
14 RR 20180101_HH1_1_2 5 6
15 BR 20180101_HH1_1_2 7 12
16 RS 20180101_HH1_1_2 7 22
17 RS 20180101_HH1_1_2 8 30
18 BA 20180101_HH1_1_2 10 6
19 GS 20180101_HH1_1_3 1 66
20 RR 20180101_HH1_1_3 1 25
21 SH 20180101_HH1_1_3 1 500
22 GT 20180101_HH1_1_3 2 1
23 RR 20180101_HH1_1_3 2 6
24 BR 20180101_HH1_1_3 3 12
25 RS 20180101_HH1_1_3 3 30
26 BA 20180101_HH1_1_3 4 6

It is helpful to distinguish between the contents of your two dataframes.
df2 appears to contain measurements from some sites
df1 appears to contain parameters by which you want to process/summarise the measurements in df2
Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:
max_dist = max(df2$percent_dist_from_surf)
for(ii in 1:nrow(df1)){
# extract parameters
this_ID = df1[[ii,"ID"]]
this_depth = df1[[ii,"Site_Depth"]]
this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
this_percent = df1[[ii,"Percent_Column_in_each_bin"]]
# add column to df2
df2 = df2 %>%
mutate(!!sym(this_ID) := insert_your_calculation_here)
}
The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.
And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)

Find rows that are different by ID for specific columns

Suppose that we have the following data frame:
ID <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5)
age <- c(25, 25, 25, 22, 22, 56, 56, 56, 80, 33, 33, 90, 90, 90)
gender <- c("m", "m", "m", "f", "f", "m", "m", "m", "m", "m", "m", "f", "f", "m")
company <- c("c1", "c2", "c2", "c3", "c3", "c1", "c1", "c1", "c1", "c5", "c5", "c3", "c4", "c5")
income <- c(1000, 1000, 1000, 500, 1700, 200, 200, 250, 500, 700, 700, 300, 350, 300)
df <- data.frame(ID, age, gender, company, income)
I need to find the row that have different values by ID for age, gender, and income. I don't care about the company whether they are same or different.
So after processing, here is the output:
BONUS,
Can we create another data frame include the list of variables that are different by id. For example:

An option would be to group by 'ID', check whether the number of distinct elements in 'age', 'gender', 'income' is equal to 1 and then negate (!)
library(dplyr)
out <- df %>%
group_by(ID) %>%
filter(!(n_distinct(age) == 1 &
n_distinct(gender) == 1 &
n_distinct(income) == 1))
out
# A tibble: 9 x 5
# Groups: ID [3]
# ID age gender company income
# <dbl> <dbl> <fct> <fct> <dbl>
#1 2 22 f c3 500
#2 2 22 f c3 1700
#3 3 56 m c1 200
#4 3 56 m c1 200
#5 3 56 m c1 250
#6 3 80 m c1 500
#7 5 90 f c3 300
#8 5 90 f c4 350
#9 5 90 m c5 300
If there are many variable, another option i filter_at
df %>%
group_by(ID) %>%
filter_at(vars(age, gender, income), any_vars(!(n_distinct(.) == 1)))
From the above, we can get the ssecond output with
library(tidyr)
out %>%
select(-company) %>%
gather(key, val, - ID) %>%
group_by(key, add = TRUE) %>%
filter(n_distinct(val) > 1) %>%
group_by(ID) %>%
summarise(Different = toString(unique(key)))
# A tibble: 3 x 2
# ID Different
# <dbl> <chr>
#1 2 income
#2 3 age, income
#3 5 gender, income

In base R, we can split c("age", "gender", "income") column based on ID find out ID's which have more than 1 unique row and subset them.
df[df$ID %in% unique(df$ID)[sapply(split(df[c("age", "gender", "income")], df$ID),
function(x) nrow(unique(x)) > 1)], ]
# ID age gender company income
#4 2 22 f c3 500
#5 2 22 f c3 1700
#6 3 56 m c1 200
#7 3 56 m c1 200
#8 3 56 m c1 250
#9 3 80 m c1 500
#12 5 90 f c3 300
#13 5 90 f c4 350
#14 5 90 m c5 300

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summation of variables by Groups in R - r

Related

Compute the difference between two columns by pair in R

How to add a row, which is a sum of some values in a column based on specific values in other column?

Multiply columns from different data frames if dates match

Is there a way to create new columns in R based on manipulations from multiple data frames?

Find rows that are different by ID for specific columns

Categories

Resources