Missing combinations when collapsing a data.frame [because 0 occurrences] - r

Let suppose we have a big data.frame named df with three different variables:
Gender: which can be M or F (2 possible answers)
Hair: which can be "black", "brown", "blond", "red", "other" (5 possible values)
Sport: which can be "yes" or "no" (2 different values)
Value: always 1 in order to count the number of events
When I use the collap function from collapse package I run the following code
collap (df, ~ Gender + Hair + Sport, FUN = sum, cols ="Value")
What I expect is a data.frame with 20 different rows (one per each combination); however, if there is a combination with no occurrences, the row does not appear.
Do you know how can I get all the possible combinations with a 0 in case there are no events with the required values?

You can complete unused factor levels like this, resulting in a row for the females despite all rows in the data are males:
library(tidyverse)
library(collapse)
#> collapse 1.7.6, see ?`collapse-package` or ?`collapse-documentation`
#>
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#>
#> D
data <- tribble(
~Gender, ~Hair, ~Value,
"M", "black", 1
)
data %>%
mutate(Gender = Gender %>% factor(levels = c("M", "F"))) %>%
complete(Gender, fill = list(Value = 0)) %>%
collap(~ Gender + Hair, FUN = sum, cols = "Value")
#> # A tibble: 2 × 3
#> Gender Hair Value
#> <fct> <chr> <dbl>
#> 1 M black 1
#> 2 F <NA> 0
Created on 2022-05-03 by the reprex package (v2.0.0)

This is the answer to my question based on the response by #danloo
df %<%
complete(Gender, Hair, Sport) %>%
collap( ~Gender + Hair + Sport, FUN = sum, cols = "Value")
Running that I get a data.frame with 20 different rows where NA are placed for those combinations with no events.

Related

Creating a grouped boxplot with different numbers of rows for each grouped column?

I have data that I would like to compare in a grouped boxplot, meaning comparing the before/after response to each treatment. The issue is my trial number for each type of treatment is different so I cannot create a dataframe (I am getting an error in the dataframe)
QXpre <- c(3,4,2,1,4,5,4,2,8)
QXpost <- c(0,4,0,0,0,7,0,1,6)
lidopre <-c(5,3,4,5,6)
lidopost <- c(0,0,0,1,2)
vehipre <- c(3,3,5,3,4,3,4)
vehipost <- c(4,3,3,12,6,4,10)
DF1D <- data.frame(QXpre, QXpost, lidopre, lidopost, vehipre, vehipost)
To clarify, I would like: within each group to compare the pre and post values, but have each group show up on the same plot so I can compare statistics across groups.
Thank you!
Instead of putting all vectors in one dataframe create a list of data frames per treatment. Afterwards reshape each one to long or tidy format using e.g. tidyr::pivot_longer and bind them by rows for which I use purrr::imap_dfr for convenience:
library(tidyverse)
dat <- list(
QX = data.frame(QXpre, QXpost),
lido = data.frame(lidopre, lidopost),
vehi = data.frame(vehipre, vehipost)
) |>
purrr::imap_dfr(~ tidyr::pivot_longer(.x, everything(), names_prefix = .y), .id = "treatment")
head(dat)
#> # A tibble: 6 × 3
#> treatment name value
#> <chr> <chr> <dbl>
#> 1 QX pre 3
#> 2 QX post 0
#> 3 QX pre 4
#> 4 QX post 4
#> 5 QX pre 2
#> 6 QX post 0
dat$name <- factor(dat$name, levels = c("pre", "post"))
ggplot(dat, aes(treatment, value, fill = name)) +
geom_boxplot()
Just to offer another solution. You can create a named list of all your vectors and then use stack() to create a data.frame in the long format. Afterwards you can use strsplit() to create two variables for your groups and timepoints. The rest is the same as in stefans answer.
library(ggplot2)
vector.list = list(
QXpre = c(3,4,2,1,4,5,4,2,8),
QXpost = c(0,4,0,0,0,7,0,1,6),
lidopre =c(5,3,4,5,6),
lidopost = c(0,0,0,1,2),
vehipre = c(3,3,5,3,4,3,4),
vehipost = c(4,3,3,12,6,4,10)
)
df <- stack(vector.list) # creates a data.frame in long format
df[, c("group", "time")] <- do.call(rbind, strsplit(as.character(df$ind), "(?<=.)(?=pre|post)", perl = TRUE)) # splits the names into two variables
df$time <- factor(df$time, levels = c("pre", "post")) # set the order of pre and post
ggplot(df, aes(group, values, fill = time)) +
geom_boxplot()
Created on 2023-02-16 by the reprex package (v2.0.1)

Assigning new field values based on ifelse logic with lag/lead function in R

Have seen several posts on this, but can't seem to get it to work for my specific use case.
I'm trying to assign a new field value based on ifelse logic. My input dataset looks like:
If the value for X is missing, I am trying to replace it with the previous value of X, only when the value of unique_id is the same as the previous value of unique_id. I would like the output dataset to look like this:
The code I've written (I'm a total beginner) doesn't throw an error, but the data doesn't change:
within(data3, data3$Output <- ifelse(data3$unique_id == lag(data3$unique_id) & is.na(data3$Output), data3$Output == lag(data3$Output), data3$Output == data3$Output))
I do change missing data values ("-") in the input dataset to official NA missing values in a previous step... hopefully allowing me to use the is.na function.
data.table option where you replace the NA with the non-NA value per group:
df <- data.frame(unique_id = c("m", "m"),
X = c(73500, NA),
MoM = c("4%", "0%"))
library(data.table)
setDT(df)
df[, X := X[!is.na(X)][1L], by = unique_id]
df
#> unique_id X MoM
#> 1: m 73500 4%
#> 2: m 73500 0%
Created on 2022-07-09 by the reprex package (v2.0.1)
In addition to the provided solutions: One of these:
fill()
suggest by #jared_marot in the comments
library(dplyr)
library(tidyr)
df %>%
fill(X)
first()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = first(X))
lag()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = lag(X, default = X[1]))
base R
df[2,2] <- df[1,2]
You could group the IDs, then use fill to copy down the values replacing NAs by group. See the reproducible example below.
(If you have NAs which could appear before or after the value, then you could add , .direction = "downup" to the fill.
library(tidyverse)
# Sample data
df <- tribble(
~unique_id, ~x, ~mom,
"m", 73500, 4,
"m", NA, 0,
"z", 4000, 5,
"z", NA, 0,
)
df2 <- df |>
group_by(unique_id) |>
fill(x, .direction = "downup") |>
ungroup()
#> # A tibble: 4 × 3
#> unique_id x mom
#> <chr> <dbl> <dbl>
#> 1 m 73500 4
#> 2 m 73500 0
#> 3 z 4000 5
#> 4 z 4000 0
Created on 2022-07-09 by the reprex package (v2.0.1)

How to merge duplicate rows in R

I am new to R and very stuck on a problem which I've tried to solve in various ways.
I have data I want to plot to a graph that shows twitter engagements per day.
To do this, I need to merge all the 'created at' rows, so there is only one data per row, and each date has the 'total engagements' assigned to it.
This is the data:
So far, I've tried to do this, but can't seem to get the grouping to work.
I mutated the data to get a new 'total engage' column:
lgbthm_data_2 <- lgbthm_data %>%
mutate(
total_engage = favorite_count + retweet_count
) %>%
Then I've tried to merge the dates:
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise_all(na.omit)
But no idea!
Any help would be great
Thanks
You are looking for:
library(dplyr)
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise(n = n())
And there is even a shorthand for this in dplyr:
only_one_date <- lgbthm_data_2 %>%
count(created_at)
group_by + summarise can be used for many things that involve summarising all values in a group to one value, for example the mean, max and min of a column. Here I think you simply want to know how many rows each group has, i.e., how many tweets were created in one day. The special function n() tells you exactly that.
From experience with Twitter, I also know that the column created_at is usually a time, not a date format. In this case, it makes sense to use count(day = as.Date(created_at)) to convert it to a date first.
library(tidyverse)
data <- tribble(
~created_at, ~favorite_count, ~retweet_count,
"2022-02-01", 0, 2,
"2022-02-01", 1, 3,
"2022-02-02", 2, NA
)
summary_data <-
data %>%
type_convert() %>%
group_by(created_at) %>%
summarise(total_engage = sum(favorite_count, retweet_count, na.rm = TRUE))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> created_at = col_date(format = "")
#> )
summary_data
#> # A tibble: 2 × 2
#> created_at total_engage
#> <date> <dbl>
#> 1 2022-02-01 6
#> 2 2022-02-02 2
qplot(created_at, total_engage, geom = "col", data = summary_data)
Created on 2022-04-04 by the reprex package (v2.0.0)

Adding values from lookup-table based on condition to data frame in R

I've got a data frame containing data of participants who rated images (column image_index):
Now I want to add a new column with gender specific values of the rated image from a another dataframe.
Look-up table of image data:
Final data frame:
How can I accomplish this task?
Sample data:
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
complete_dataset <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19),
external_value = c(125,100,130,150)
)
You need to make a few manipulations on your data to join them together.
Pivot lookup_data longer with tidyr::pivot_longer() so the gender info is in a column to help merge on.
Use dplyr::rename() to make sure the column names are the same between the two tables.
Transform the gender column so it is just 1 letter to match the other table. Here I use stringr::str_sub(x, 1,1) which just takes the first character of a string.
Then I use left_join() to merge. Because the joining column names are already the same I don't need to specify.
Finally I just reorder and sort the data to match your expected output.
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
lookup_data %>%
pivot_longer(-index, names_to = "gender", values_to = "external_value") %>%
rename(image_index = index) %>%
mutate(gender = str_sub(gender, 1, 1)) %>%
left_join(., participants_data) %>%
drop_na(ID) %>%
select(ID, gender, image_index, external_value) %>%
arrange(ID)
#> Joining, by = c("image_index", "gender")
#> # A tibble: 4 x 4
#> ID gender image_index external_value
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 f 19 125
#> 2 2 m 2 100
#> 3 3 d 2 130
#> 4 4 f 19 125
Created on 2022-02-18 by the reprex package (v2.0.1)

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Resources