Using tidyverse to reshape a data.frame and its column names - r

I have a data.frame of some experiment with several factors and measured values for each sample. For example:
factors <- c("age","sex")
The data.frame looks like this:
library(dplyr)
set.seed(1)
df <- do.call(rbind,lapply(1:10,function(i) expand.grid(age=c("Y","O"),sex=c("F","M")) %>% dplyr::mutate(val=rnorm(4))))
grouped.mean.val.df <- df %>% dplyr::group_by_(.dots=factors) %>% dplyr::summarise(mean.val=mean(val))
I want to create a data.frame which has a single row and the number of columns is the number of factor combinations (i.e. nrow(expand.grid(age=c("Y","O"),sex=c("F","M")) in this example), where the value is the mean df$val for the corresponding combination of factors.
To get the mean df$val for each combination of factors I do:
grouped.mean.val.df <- df %>% dplyr::group_by_(.dots=factors) %>% dplyr::summarise(mean.val=mean(val))
And the resulting data.frame I'd like to obtain is:
res.df <- data.frame(Y.F=grouped.mean.val.df$mean.val[1],
Y.M=grouped.mean.val.df$mean.val[2],
O.F=grouped.mean.val.df$mean.val[3],
O.M=grouped.mean.val.df$mean.val[4])
Is there a tidyverse way to get that?

We can do unite and then a spread. unite the 'age' and 'sex' to create a single column, mutate the values to factor (to make the order as the same as in the expected) and do a spread to 'wide' format
library(tidyverse)
grouped.mean.val.df %>%
unite(agesex, age, sex, sep=".") %>%
mutate(agesex = factor(agesex, levels = unique(agesex))) %>%
spread(agesex, mean.val)
# A tibble: 1 x 4
# Y.F Y.M O.F O.M
# <dbl> <dbl> <dbl> <dbl>
#1 0.0695 0.411 -0.118 0.00577
Also, instead of group_by_, we can use group_by_atwhich takes strings as variables
df %>%
group_by_at(factors) %>%
summarise(mean.val = mean(val)) %>%
unite(agesex, age, sex, sep=".") %>%
mutate(agesex = factor(agesex, levels = unique(agesex))) %>%
spread(agesex, mean.val)

Related

Create multiple datafame

I intend to create multiple data frame from a data like below:
ID Time Ethnicity LDL HDL ....
1 1 black
2 2 white
3 1 black
4 2 White
each data frame is mean values of the column LDL, HDL, ... in 4 rows displayed in the data. I used the following code but the problem is all the data frames are identical. I mean DF[[1]] is the same as DF[[2]], ...DF[[15]]. I would appreciate if you could help me find the solution.
dv=c(names(data[,4:15]))
library(ggplot2)
require(plyr)
for (i in 1:12) {
DF[[i]] = ddply(data, c("Time", "Ethnicity"), summarize,
Mean = mean(data[[paste(dv[i])]], na.rm = T))
}
plyr is retired, you could use dplyr. When you do mean(data[[paste(dv[i])]], you are subsetting the entire column and not respecting groups. Hence, you get the same mean for all the values in DF[[1]], DF[[2]] etc.
library(dplyr)
output_df <- data %>%
group_by(Time, Ethnicity) %>%
summarise_at(4:15, mean, na.rm = TRUE) %>%
ungroup
If you want list of dataframes you could use group_split :
DF <- output_df %>% group_split(Time, Ethnicity)

Getting two different means in R using same numbers

I'm trying to calculate the mean of some grouped data, but I'm running into an issue where the mean generated using base::mean() is generating a different value than when I use base:rowMeans() or try to replicate the mean in Excel.
Here's the code with a simplified data frame looking at just a small piece of the data:
df <- data.frame("ID" = 1101372,
"Q1" = 5.996667,
"Q2" = 6.005556,
"Q3" = 5.763333)
avg1 <- df %>%
summarise(new_avg = mean(Q1,
Q2,
Q3)) # Returns a value of 5.99667
avg2 <- rowMeans(df[,2:4]) # Returns a value of 5.921852
The value in avg2 is what I get when I use AVERAGE in Excel, but I can't figure out why mean() is not generating the same number.
Any thoughts?
Here, the mean is taking only the first argument i.e. Q1 as 'x' because the usage for ?mean is
mean(x, trim = 0, na.rm = FALSE, ...)
i.e. the second and third argument are different. In the OP's code, x will be taken as "Q1", trim as "Q2" and so on.. The ... at the end also means that the user can supply n number of parameters without any error and leads to confusions like this (if we don't check the usage)
We can specify the data as ., subset the columns of interest and use that in rowMeans
df %>%
summarise(new_avg = rowMeans(.[-1]))
This would be more efficient. But, if we want to use mean as such, then do a rowwise
df %>%
rowwise() %>%
summarise(new_avg = mean(c(Q1, Q2, Q3)))
# A tibble: 1 x 1
# new_avg
# <dbl>
#1 5.92
Or convert to 'long' format and then do the group_by 'ID' and get the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% # can skip this step if there is only a single row
summarise(new_avg = mean(value))
# A tibble: 1 x 2
# ID new_avg
# <dbl> <dbl>
#1 1101372 5.92

Compare column values against another column

I have the following data:
set.seed(1)
data <- data.frame(
id = 1:500, ht_1 = rnorm(500,10:20), ht_2 = rnorm(500,15:25),
ht_3 = rnorm(500,20:30), ht_4 = rnorm(500,25:35),
ht_5 = rnorm(500,20:40)
)
I would like to identify the values in columns ht_1:ht_4 that are greater than the values in column ht_5 (number of observations and means).
For each of these columns, I would then like to replace any values that are greater than ht_5 with ht_5.
Hi you can use the mutate_at function like this:
library(tidyverse)
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~if_else(.x > ht_5, ht_5, .x))
In this case you can also use pmin instead of if_else which should be faster.
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~pmin(.x, ht_5))
To see how many values are greater than ht_5 you can use the summarise_atfunction:
data %>% as_tibble %>%
summarize_at(vars(paste0("ht_", 1:4)), ~ length(.x[.x > ht_5]))
# A tibble: 1 x 4
ht_1 ht_2 ht_3 ht_4
<int> <int> <int> <int>
1 6 39 131 258

R run T-test/anova for each row with 2 groups with 3 samples

My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438

unexpected row when going from long to wide format with dplyr and tidyr

I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)
We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset

Resources