Summarize one variable/column over all possible values of other variables/columns - r

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))

You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

Related

Extract all row.names in a data.frame that match a value in another data.frame

I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1

Summarizing a collection of data frames - improving upon a clumsy solution

I have a collection of data frames, df_i, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:
df_1 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2), "D"),
Dates = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
Weight = seq(100, 190, 10),
Visit = rep(1, 10)
)
df_2 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2)),
Dates = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2)),
Weight = seq(100, 180, 10),
Visit = rep(2, 5)
)
df_3 <- data.frame(
ID = c(rep("A",4), rep("B",3)),
Dates = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
Sex = rep("Male",7),
Weight = seq(140, 200, 10),
Visit = rep(3, 7)
)
I'm looking to generate the following result:
> df_sum
Visit Patients Men Women
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1
df_tmp <- df_1 %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
ID n Men Women
<chr> <int> <int> <int>
1 A 1 1 0
2 B 1 1 0
3 C 1 0 1
4 D 1 0 1
Next, sum each of the columns in df_tmp to create the first row for the summary column.
r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))
Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?
Many thanks in advance
Thomas Philips
Could also make into a tibble with bind_rows:
library(tidyverse)
bind_rows(df_1, df_2, df_3, .id = "day") %>%
group_by(day, ID) %>%
slice_min(Dates) %>%
group_by(day) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
Result
# A tibble: 3 x 4
day n Men Women
* <chr> <int> <int> <int>
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
Put the data in a list and iterate over them through map so that you don't have to repeat the code for each dataframe. Using janitor::adorn_totals you can add a new row in the output with the total and get the data in wide format.
library(tidyverse)
list_df <- list(df_1, df_2, df_3)
map_df(list_df, ~.x %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
ungroup %>%
count(Sex) %>%
janitor::adorn_totals(name = 'Patients'), .id = 'Visit') %>%
pivot_wider(names_from = Sex, values_from = n, values_fill = 0)
# Visit Female Male Patients
# <chr> <int> <int> <int>
#1 1 2 2 4
#2 2 1 2 3
#3 3 0 2 2

Adding a Proportion Column with Dplyr

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1
X <- (1:10)
Y<- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
(the code below was provided by a stack exchange member, thank you!)
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...
My expected output would be something like
Z=0 Z=1 PropZ=1
a 2 2 .5
b 1 2 .66
c 0 3 1
Perhaps this helps
library(dplyr)
library(tidyr)
test_df %>%
group_by(Y, Z) %>%
summarise(N = n(), .groups = 'drop') %>%
left_join(test_df %>%
group_by(Y) %>%
summarise(Prop = mean(Z == 1), .groups = 'drop')) %>%
pivot_wider(names_from = Z, values_from = N, values_fill = 0)
-output
# A tibble: 3 x 4
# Y Prop `0` `1`
# <chr> <dbl> <int> <int>
#1 a 0.5 2 2
#2 b 0.667 1 2
#3 c 1 0 3
test_df %>% group_by(Y) %>%
summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())
I am not sure if what is your expected output, but below might be some options
u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
Z
Y 0 1
a 2 2
b 1 2
c 0 3
or
> prop.table(u)
Z
Y 0 1
a 0.2 0.2
b 0.1 0.2
c 0.0 0.3
To calculate proportions of 1 for each letter you can use rowSums.
transform(res, prop_1 = `1`/rowSums(res[-1]))
In dplyr :
library(dplyr)
res %>%
ungroup %>%
mutate(prop_1 = `1`/rowSums(.[-1]))
# Y `0` `1` prop_1
# <chr> <int> <int> <dbl>
#1 a 2 2 0.5
#2 b 1 2 0.667
#3 c 0 3 1

Rename a dataframe Column with text from within the column itself

Given a (simplified) dataframe with format
df <- data.frame(a = c(1,2,3,4),
b = c(4,3,2,1),
temp1 = c("-","-","-","foo: 3"),
temp2 = c("-","bar: 10","-","bar: 4")
)
a b temp1 temp2
1 4 - -
2 3 - bar: 10
3 2 - -
4 1 foo: 3 bar: 4
I need to rename all temp columns with the names contained within the column, My end goal is to end up with this:
a b foo bar
1 4 - -
2 3 - 10
3 2 - -
4 1 3 4
the df column names and the data contained within them will be unknown, however the columns that need changing will contain temp and the delimiter will always be a ":"
As such I can easily remove the name from within the columns using dplyr like this:
df <- df %>%
mutate_at(vars(contains("temp")), ~(substr(., str_locate(., ":")+1,str_length(.))))
but first I need to rename the columns based on some function method, that scans the column and returns the value(s) within it, ie.
rename_at(vars(contains("temp")), ~(...some function.....))
As per the example given there's no guarantee that specific rows will have data so I can't simply grab value from row 1
Any ideas welcome.
Thanks in advance
One possibility involving dplyr and tidyr could be:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values")
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 <NA> <NA>
2 2 3 <NA> 10
3 3 2 <NA> <NA>
4 4 1 3 4
If you want to further replace the NAs with -:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values") %>%
mutate_at(vars(-a, -b), ~ replace_na(., "-"))
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 - -
2 2 3 - 10
3 3 2 - -
4 4 1 3 4
This will do the job:
colnames(df)[which(grepl("temp", colnames(df)))] <- unique(unlist(sapply(df[,grepl("temp", colnames(df))],
function(x){gsub("[:].*",
"",
grep("\\w+",
x,
value = TRUE))})))

R Filter to remove rows within map function

I am simulating events from the following data table using the map function and filtering zero value events.
However I would like to filter within the map function, thereby reducing the size of the event table that gets created.
The following simulates events based on the Poisson distribution for a given mean (it includes freq = 0 but to manage memory I don't want these):
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x))) %>%
mutate(freq = map(freq, ~ data.frame(freq = .x, sim = seq_along(.x)))) %>%
unnest()
I can then filter with event %>% filter(freq != 0). How can I slot this into the map function please? This will make the memory footprint a lot more manageable for my code. Thank you!
An option would be discard
library(tidyverse)
data %>%
mutate(freq = map(exp, ~rpois(n, .x) %>%
discard(. == 0) %>%
tibble(freq = ., sim = seq_along(.)))) %>%
unnest
if 'sim' should be based on the original sequence, then create a tibble of 'rpois' output and the sequence of the elements, then do the filter within map
data %>%
mutate(freq = map(exp, ~ rpois(n , .x) %>%
tibble(freq = ., sim = seq_along(.)) %>%
filter(freq != 0))) %>%
unnest
Or using mutate in between
data %>%
mutate(freq = map(exp, ~ tibble(freq = rpois(n, .x)) %>%
mutate(sim = row_number()) %>%
filter(freq != 0))) %>%
unnest
Here is one idea. No need to create data.frame. Create list with freq and sim, and then unnest them.
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x)),
sim = map(freq, ~which(.x > 0)),
freq = map(freq, ~.x[.x > 0]))%>%
unnest()
event
# # A tibble: 45 x 4
# locid exp freq sim
# <int> <dbl> <int> <int>
# 1 1 2 1 1
# 2 1 2 1 2
# 3 1 2 2 3
# 4 1 2 4 4
# 5 1 2 1 5
# 6 1 2 4 6
# 7 1 2 4 7
# 8 1 2 2 8
# 9 1 2 2 9
# 10 2 2 1 1
# # ... with 35 more rows

Resources