Alternative to aggregate function that doesn't collapse df - r

I have person-level data and want to create a new variable that has the number of kids in a family. I have created a dummy variable for kids (1 if age<18, 0 otherwise). I'm currently using the aggregate function, where HH_ID is a household identifier.
No_kids <- aggregate(child ~ HH_ID, data = df, sum)
This code works but the data frame collapses whereas I want to assign the number of kids to each observation for that household. Is there an alternative to the aggregate function that doesn't collapse the data set?

another option is dplyr ... of course
library(dplyr)
> player_df = data.frame(team = c('ARI', 'BAL', 'BAL', 'CLE', 'CLE'),
+ player =c('A', 'B', 'C', 'D', 'F'),
+ '1' = floor(runif(5, min=1, max=2)*10),
+ '2' = floor(runif(5, min=1, max=2)*10))
and then using group_by and mutate from dplyr
player_df %>% group_by(team) %>% mutate(count = n())
Source: local data frame [5 x 5]
Groups: team [3]
team player X1 X2 count
<fctr> <fctr> <dbl> <dbl> <int>
1 ARI A 12 12 1
2 BAL B 10 12 2
3 BAL C 14 12 2
4 CLE D 10 14 2
5 CLE F 18 17 2

Alternatively, you could do a merge after aggregation (so in base R):
ag <- aggregate(child ~ HH_ID, data = df, sum)
setNames(merge(df, ag, by="HH_ID"), c("HH_ID", "child", "No_kids"))

Using the dplyr package:
# Create sample data
set.seed(3252)
df <- data.frame(
HH_ID = sample(1:10, 50, replace = TRUE),
child = sample(0:1, 50, replace = TRUE)
)
# Count number of children
df %>%
group_by(HH_ID) %>%
mutate(child_count = sum(child)) %>%
ungroup()

Related

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Reshape dataframe so that matching family members have their own column

I have a dataframe...
df <- tibble(
id = 1:5,
family = c("a","a","b","b","c"),
twin = c(1,2,1,2,1),
datacol1 = 11:15,
datacol2 = 21:25
)
For every twin pair (members of the same family) I need to have a second 'datacol' with the other twins' data. This should only happen for matching twins, so the 5th row (from family "c") should have duplicate columns that are empty.
Ideally, by the end the data would look like the following...
df <- tibble(
id = 1:5,
family = c("a","a","b","b","c"),
twin = c(1,2,1,2,1),
datacol1 = 11:15,
datacol1.b = c(12,11,14,13,NA),
datacol2 = 21:25,
datacol2.b = c(22,21,24,23,NA)
)
I have added an image to help illustrate what I am trying to get to.
I would like to be able to do this for all columns or for selected columns and preferably using tidyverse.
We can also use mutate_at
library(dplyr)
df %>%
group_by(family) %>%
mutate_at(vars(starts_with('datacol')), list(`2` =
~if(n() == 1) NA_integer_ else rev(.)))
# A tibble: 5 x 7
# Groups: family [3]
# id family twin datacol1 datacol2 datacol1_2 datacol2_2
# <int> <chr> <dbl> <int> <int> <int> <int>
#1 1 a 1 11 21 12 22
#2 2 a 2 12 22 11 21
#3 3 b 1 13 23 14 24
#4 4 b 2 14 24 13 23
#5 5 c 1 15 25 NA NA
cols = c("datacol1", "datacol2")
df %>%
group_by(family) %>%
mutate_at(vars(cols), function(x){
if (n() == 2){
rev(x)
} else {
NA
}
}) %>%
ungroup() %>%
select(cols) %>%
rename_all(funs(paste0(., ".b"))) %>%
cbind(df, .)
Base R
cols = c("datacol1", "datacol2")
do.call(rbind, lapply(split(df, df$family), function(x){
cbind(x, setNames(lapply(x[cols], function(y) {
if (length(y) == 2) {
rev(y)
} else {
NA
}}),
paste0(cols, ".b")))
}))

Create a new variable that is the average of one variable conditional on two other variables (and maintain all other variables in the data set)

Here is a (shortened) sample from a data set I am working on. The sample represents data from an experiment with 2 sessions (session_number), in each session participants completed 5 trials (trial_number) of a hand grip exercise (so, 10 in total; 2 * 5 = 10). Each of the 5 trials has 3 observations of hand grip strength (percent_of_maximum). I want to get the average (below, I call it mean_by_trial) of these 3 observations for each of the 10 trials.
Finally, and this is what I am stuck on, I want to output a data set that is 20 rows long (one row for each unique trial, there are 2 participants and 10 trials for each participant; 2 * 10 = 20), AND retains all other variables. All the other variables (in the example there are: placebo, support, personality, and perceived_difficulty) will be the same for each unique Participant, trial_number, or session_number (see sample data set below).
I have tried this using ddply, which is pretty much what I want, but the new data set does not contain the other variables in the data set (new_dat only contains trial_number, session_number, Participant and the new mean_by_trial variable). How can I maintain the other variables?
#create sample data frame
dat <- data.frame(
Participant = rep(1:2, each = 30),
placebo = c(replicate(15, "placebo"), replicate(15, "control"), replicate(15, "control"), replicate(15, "placebo")),
support = rep(sort(rep(c("support", "control"), 3)), 10),
personality = c(replicate(30, "nice"), replicate(30, "naughty")),
session_number = c(rep(1:2, each = 15), rep(1:2, each = 15)),
trial_number = c(rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3)),
percent_of_maximum = runif(60, min = 0, max = 100),
perceived_difficulty = runif(60, min = 50, max = 100)
)
#this is what I have tried so far
library(plyr)
new_dat <- ddply(dat, .(trial_number, session_number, Participant), summarise, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)
I want new_dat to contain all the variables in dat, plus the mean_by_trial variable. Thank you!
We can use mutate instead of summarise to create a column in the dataset and then do slice
library(dplyr)
out <- ddply(dat, .(trial_number, session_number, Participant),
plyr::mutate, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)
out %>%
group_by(trial_number, session_number, Participant) %>%
slice(1)
If we use dplyr, then this can all be inside a chain
newdat <- dat %>%
group_by(trial_number, session_number, Participant) %>%
mutate(mean_by_trial = mean(percent_of_maximum)) %>%
slice(1)
head(newdat)
# A tibble: 6 x 9
# Groups: trial_number, session_number, Participant [6]
Participant placebo support personality session_number trial_number percent_of_maximum perceived_difficulty mean_by_trial
# <int> <fct> <fct> <fct> <int> <int> <dbl> <dbl> <dbl>
#1 1 placebo control nice 1 1 71.5 95.5 73.9
#2 2 control control naughty 1 1 38.9 63.8 67.7
#3 1 control support nice 2 1 97.1 54.2 68.4
#4 2 placebo support naughty 2 1 62.9 86.2 40.4
#5 1 placebo support nice 1 2 49.0 95.8 65.7
#6 2 control support naughty 1 2 80.9 74.6 68.3
Here’s a tidyverse answer. First you want to group_by the variables of interest. Then calculate the desired mean in a new column using mutate.
As the value in the new mean column will be repeated across the variables, use the distinct function to retain uniqe rows. In other words, select a single row for each combination of Participant, session_number, and trial_number.
This is the answer (https://stackoverflow.com/a/39092166/9941764)
provided in: R - dplyr Summarize and Retain Other Columns
new_dat <- dat %>%
group_by(Participant, session_number, trial_number) %>%
mutate(mean = mean(percent_of_maximum)) %>%
distinct(mean, .keep_all = TRUE)

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

Looping and creating New columns

Let's say I have a few columns in my data frame, that come from a bunch of similar factors:
For eg: A1_Factor1, A1_Factor2, A1_Factor3, B1_Factor1,B1_Factor2,C1_Factor1 etc
What I want is to create additional columns using this data. So:
A1_Mean - This should be the average of columns starting with A1
B1_Mean - This should be the average of columns starting with B1
A1_Min - This should be the minimum value of columns starting with A1
B1_Min - This should be the minimum value of columns starting with B1
A1_SD - This should be the Standard Deviation of columns starting with A1
B1_SD - This should be the Standard Deviation of columns starting with B1
How can it be done in R, so that the code first extract the columns having similar initials, and then perform the required analysis on it. And then create new columns out of it using same initials?
Thanks for your help in advance! :)
You can do this using tidyverse package
Input:
library(tidyverse)
set.seed(123)
df <- tibble(A1_abc = sample(1:10, 5),
A1_cde = sample(10:15, 5),
B1_abc = sample(1:10, 5),
B1_cde = sample(15:20, 5))
df
# A tibble: 5 x 4
A1_abc A1_cde B1_abc B1_cde
<int> <int> <int> <int>
1 3 10 10 20
2 8 12 5 16
3 4 13 6 15
4 7 11 9 18
5 6 15 1 19
Method:
df %>%
gather(key, value) %>%
separate(key, c("gp", "rand"), sep = "_") %>%
select(-rand) %>%
group_by(gp) %>%
mutate(id = 1:n()) %>%
spread(gp, value) %>%
summarise_at(vars(2:3), funs(Min = min(.),
Max = max(.),
Mean = mean(.),
SD = sd(.)))
Output:
# A tibble: 1 x 8
A1_Min B1_Min A1_Max B1_Max A1_Mean B1_Mean A1_SD B1_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3. 1. 15. 20. 8.90 11.9 3.96 6.61
If you want to add more functions, just add it at the funs() function inside the summarise_at()
I created a small example and this is what I have,
df <- data.frame("A1_factor1" = rnorm(5), "A1_factor2" = rnorm(5),
"B1_factor1" = rnorm(5), "B1_factor2" = rnorm(5))
col.names <- names(df)
group <- unique(substr(col.names, 1, 2))
for (i in 1:length(group)){
group.df <- df[, substr(names(df), 1, 2) == group[i]]
df[, ncol(df)+1] <- apply(group.df, 1, mean)
df[, ncol(df)+1] <- apply(group.df, 1, min)
df[, ncol(df)+1] <- apply(group.df, 1, sd)
df[, ncol(df)+1] <- apply(group.df, 1, max)
names(df)[(ncol(df)-3):ncol(df)] <- paste(group[i], c("Mean", "Min", "SD", "Max"), sep = "_")
}
df
I hope this helps!

Resources