Consider the following data frame:
set.seed(123)
dat <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
Individual observations are denoted with a unque ID. There are three grouping variables for the individual observations: Region, State, and Loc. Lets say that I know the following conditions to be true:
- When count1 equals 1 then count2 should equal 2
- When count1 equals 2 then count2 should equal 4
- When count1 equals 3 then count2 should equal 1
- When count1 equals 4 then count2 should equal 3
I want to answer the following types of questions:
1. How many observations that belong to each grouping variable (Region, State, Loc) are in each level of count1 and count2
Which IDs are in which level of count1 and count2(and what grouping variables do these IDs belong to)
How often do the conditions outlined above hold true, and how often do they not hold true
For which grouping variables and IDs do these conditions hold true, and for which do they not hold true
When the conditions do not hold true, what is actually observed (e.g., when count1 equals 1 then count2 should equal 2; so when count1 equals 1 but count2 does not equal 2, then what does count2 equal instead).
How can I specify these conditions and produce tidy summary-like tables to answer these questions?
You can think of the levels of count1 and count2 as being associated with certain characteristics, and I want understand the relationship between those levels with each other, and with the grouping variables. If anyone has any graphical visualization ideas for these types of questions, that would be very helpful as well!
Here's one way to go for questions 1 & 2 although this feels a little involved. I am using tidyr pivot_wider to create columns for each unique value of count1 and count2. The function length in values_fn counts the number of elements in the vectors created by pivot_wider for relevant combinations. As we need answers for count1 and count2 separately I run pivot_wider twice.
Results are then combined with bind_cols and superfluous columns are removed.
All of this can probably be improved on with a bit more thought.
library(dplyr)
library(tidyr)
library(tibble)
set.seed(123)
data <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
# 1. How many observations that belong to each grouping variable (Region, State, Loc) are in each level of count1 and count2
level_count1 <- data %>%
select(-count2) %>%
pivot_wider(id_cols = c(Region, State, Loc),
values_from = count1,
values_fn = list(count1 = length),
names_from = count1,
names_prefix = "count1_")
level_count2 <- data %>%
select(-count1) %>%
pivot_wider(id_cols = c(Region, State, Loc),
values_from = count2,
values_fn = list(count2 = length),
names_from = count2,
names_prefix = "count2_")
level_count <- bind_cols(level_count1, level_count2) %>% select(-Region1, -State1, -Loc1)
# 2. Which IDs are in which level of count1 and count2(and what grouping variables do these IDs belong to)
ID_count1 <- data %>%
select(-count2) %>%
pivot_wider(id_cols = ID,
values_from = count1,
values_fn = list(count1 = length),
names_from = count1,
names_prefix = "count1_") %>%
left_join(data %>% select(Region, State, Loc, ID), by = "ID")
ID_count2 <- data %>%
select(-count1) %>%
pivot_wider(id_cols = ID,
values_from = count2,
values_fn = list(count2 = length),
names_from = count2,
names_prefix = "count2_") %>%
left_join(data %>% select(Region, State, Loc, ID), by = "ID")
ID_count <- bind_cols(ID_count1, ID_count2) %>% select(-Region1, -State1, -Loc1, -ID)
Results are like this:
> head(level_count)
# A tibble: 6 x 11
Region State Loc count1_3 count1_1 count1_2 count1_4 count2_4 count2_2 count2_1 count2_3
<fct> <fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 a NY a 3 2 NA NA 2 1 1 1
2 a NY b 2 1 2 NA NA 2 1 2
3 a NY c NA NA 3 2 NA 2 2 1
4 a NY d 1 3 NA 1 1 3 NA 1
5 a NY e 2 3 NA NA 2 3 NA NA
6 a NY f 1 2 1 1 NA NA 2 3
The value of 3 in the first row of column count1_3 means that the combination of Region == "a", State == "NY" and Loc == "a" occurs 3 times for the value 3 for count1. Likewise, the value 2 in the second row indicates that the value 3 occurs twice in count1 for the combination Region == "a", State == "NY" and Loc == "b".
NA values indicate that 3 does not occur for the given combination of categorical columns. And so on. Is this useful for you?
The approach for ID is similar.
Related
I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.
I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1
I have a collection of data frames, df_i, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:
df_1 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2), "D"),
Dates = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
Weight = seq(100, 190, 10),
Visit = rep(1, 10)
)
df_2 <- data.frame(
ID = c(rep("A",4), rep("B",3), rep("C",2)),
Dates = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
Sex = c(rep("Male",4), rep("Male",3), rep("Female",2)),
Weight = seq(100, 180, 10),
Visit = rep(2, 5)
)
df_3 <- data.frame(
ID = c(rep("A",4), rep("B",3)),
Dates = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
Sex = rep("Male",7),
Weight = seq(140, 200, 10),
Visit = rep(3, 7)
)
I'm looking to generate the following result:
> df_sum
Visit Patients Men Women
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1
df_tmp <- df_1 %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
ID n Men Women
<chr> <int> <int> <int>
1 A 1 1 0
2 B 1 1 0
3 C 1 0 1
4 D 1 0 1
Next, sum each of the columns in df_tmp to create the first row for the summary column.
r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))
Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?
Many thanks in advance
Thomas Philips
Could also make into a tibble with bind_rows:
library(tidyverse)
bind_rows(df_1, df_2, df_3, .id = "day") %>%
group_by(day, ID) %>%
slice_min(Dates) %>%
group_by(day) %>%
summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
Result
# A tibble: 3 x 4
day n Men Women
* <chr> <int> <int> <int>
1 1 4 2 2
2 2 3 2 1
3 3 2 2 0
Put the data in a list and iterate over them through map so that you don't have to repeat the code for each dataframe. Using janitor::adorn_totals you can add a new row in the output with the total and get the data in wide format.
library(tidyverse)
list_df <- list(df_1, df_2, df_3)
map_df(list_df, ~.x %>%
group_by(ID) %>%
filter(Dates == min(Dates)) %>%
ungroup %>%
count(Sex) %>%
janitor::adorn_totals(name = 'Patients'), .id = 'Visit') %>%
pivot_wider(names_from = Sex, values_from = n, values_fill = 0)
# Visit Female Male Patients
# <chr> <int> <int> <int>
#1 1 2 2 4
#2 2 1 2 3
#3 3 0 2 2
I'm working on a dataset where every participant (ID) was evaluated 1, 2 or 3 times. It's a longitudinal study. Unfortunately, when the first analyst coded the dataset, she/he did not assign any information about that.
Because all participant have age information (in months), it's easy to identify when was the first evaluation, when was the second and so on. In the first evaluation, the participant was younger than the second and so on.
I used tidyverse tools to deal with that and everything is working. Howerver,I really know (imagine...) there is many other (much more) elegant solution, and I came to this forum to ask for that. Could someone give me thoughts about how to make this code shorter and clear?
This is a fake data to reproduce the code:
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
#add how many times each child was acessed
ds <- ds %>% group_by(id) %>% mutate(how_many = n())
#Add position
ds %>% group_by(id) %>%
mutate(first = min(months),
max = max(months),
med = median(months)) -> ds
#add label to the third evaluation (the second will be missing)
ds %>%
mutate(group = case_when((how_many == 3) & (months %in% first) ~ "First evaluation",
(how_many == 3) & (months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> ds
#add label to the second evaluation for all children evaluated two times
ds %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> ds
This is my original code:
temp <- dataset %>% select(idind, arm, infant_sex,infant_age_months)
#add how many times each child was acessed
temp <- temp %>% group_by(idind) %>% mutate(how_many = n())
#Add position
temp %>% group_by(idind) %>%
mutate(first = min(infant_age_months),
max = max(infant_age_months),
med = median(infant_age_months)) -> temp
#add label to the first evaluation
temp %>%
mutate(group = case_when(how_many == 1 ~ "First evaluation")) -> temp
#add label to the second evaluation (and keep all previous results)
temp %>%
mutate(group = case_when((how_many == 2) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 2) & (infant_age_months %in% max) ~ "Second evaluation",
TRUE ~ group)) -> temp
#add label to the third evaluation (the second will be missing)
temp %>%
mutate(group = case_when((how_many == 3) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 3) & (infant_age_months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> temp
#add label to the second evaluation for all children evaluated two times
temp %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> temp
Please, keep in mind I used search box before asking that and I really imagine other people can figure the same question when programing.
Thanks much
There you go. I used rank() to give the order of the treatments.
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
ds2 = ds %>% group_by(id) %>% mutate(rank = rank(months,ties.method="first"))
labels = c("First", "Second","Third")
ds2$labels = labels[ds2$rank]
Or just arrange by age and use 1:n() instead of n(), which creates a sequence:
ds <- ds %>% group_by(id) %>% arrange(months) %>% mutate(how_many = 1:n())
ds %>% arrange(id, months)
# A tibble: 18 x 7
# Groups: id [6]
id months x1 x2 x3 x4 how_many
<int> <dbl> <int> <int> <int> <int> <int>
1 1 10 1 2 0 1 1
2 1 11 1 2 0 1 2
3 1 12 1 2 0 1 3
4 2 11 0 1 2 2 1
5 2 14 0 1 2 2 2
6 2 14 0 1 2 2 3
You can then use factor to attach a label, if you wish.
ds$label <- factor(ds$how_many, level = 1:3, label = c("First", "Second","Third"))
head(ds)
# A tibble: 18 x 8
# Groups: id [6]
id months x1 x2 x3 x4 how_many label
<int> <dbl> <int> <int> <int> <int> <int> <fct>
1 1 10 1 2 0 1 1 First
2 1 11 1 2 0 1 2 Second
3 1 12 1 2 0 1 3 Third
4 2 11 0 1 2 2 1 First
5 2 14 0 1 2 2 2 Second
6 2 14 0 1 2 2 3 Third
I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1
Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.
You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)