Convert plyr::ddply to dplyr - r

I have a dataframe like this:
tmp <- read.table(header = T, text = "gene_id gene_symbol ensembl_id keep val1 val2 val3
x a Multiple Yes 1 2 3
x1 a Multiple No 2 3 4
x2 a Multiple No 1 4 3
y b Multiple Yes 22 20 12
y1 b Multiple No 98 7 97
y2 b Multiple No 8 76 6")
I am trying to group by the gene_symbol variable and calculating correlation between each row that is keep == "Yes" with all other rows (keep == "No") and returning an average correlation along with the gene_symbol and gene_id. This is the function:
# function to calculate avg. correlation
calc.mean.corr <- function(x){
gene.id <- x[which(x$keep == "Yes"),"gene_id"]
x1 <- x %>%
filter(keep == "Yes") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep)) %>%
as.numeric()
x2 <- x %>%
filter(keep == "No") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep))
# correlation of kept id with discarded ids
cor <- mean(apply(x2, 1, FUN = function(y) cor(x1, y)))
cor <- round(cor, digits = 2)
df <- data.frame(avg.cor = cor, gene_id = gene.id)
return(df)
}
# call using ddply
for.corr <- plyr::ddply(tmp, .variables = "gene_symbol", .fun = function(x) calc.mean.corr(x))
The final output looks like this:
> for.corr
gene_symbol avg.cor gene_id
1 a 0.83 x
2 b 0.02 y
I am using plyr::ddply for this but want to use dplyr instead. However, I am not sure how to convert it to dplyr format. Any help would be much appreciated.

If we don't want to change the function, one option it to do a group_split and apply the function
library(dplyr)
library(purrr)
tmp %>%
group_split(gene_symbol) %>%
map_dfr(calc.mean.corr)
To include the gene_symbol
tmp %>%
split(.$gene_symbol) %>%
map_dfr(~ calc.mean.corr(.), .id = 'gene_symbol')
# gene_symbol avg.cor gene_id
#1 a 0.83 x
#2 b 0.02 y

Related

In R, iterate over a data.frame by row and get access to the column values

I'm trying to iterate over throws of a data frame and get access to values in the columns of each row. Perhaps, I need a paradigm shift. I've attempted a vectorization approach. My ultimate objective is to use specific column values in each row to filter another data frame.
Any help would be appreciated.
df <- data.frame(a = 1:3, b = letters[24:26], c = 7:9)
f <- function(row) {
var1 <- row$a
var2 <- row$b
var3 <- row$c
}
pmap(df, f)
Is there a way to do this in purrr?
Using pmap, we can do
library(purrr)
pmap(df, ~ f(list(...)))
#[[1]]
#[1] 7
#[[2]]
#[1] 8
#[[3]]
#[1] 9
Or use rowwise with cur_data
library(dplyr)
df %>%
rowwise %>%
transmute(new = f(cur_data()))
-output
# A tibble: 3 x 1
# Rowwise:
# new
# <int>
#1 7
#2 8
#3 9
library(tidyverse)
df <- data.frame(a = 1:3, b = letters[24:26], c = 7:9)
f <- function(row) {
var1 <- row$a
var2 <- row$b
var3 <- row$c
}
df %>%
split(rownames(.)) %>%
map(
~f(.x)
)

product of multiple selected columns in a data frame in R

I have a data frame with a subset of variables that starts with 'AA_' (e.g., AA_1, AA_2, ... AA_100) along with other variables X, Y, Z.
If I would like to get the produce of all 'AA_' variables, what would be the most efficient way in R to achieve this?
I am thinking something like
mydata = mydata %>%
mutate(AA_product = reduce(starts_with('AA_'), `*`))
but it does not quite work
Here, we need to select the data
library(dplyr)
library(purrr)
mydata %>%
mutate(AA_product = reduce(select(., starts_with( 'AA_')), `*`))
-output
# X Y Z AA_1 AA_2 AA_3 AA_product
#1 1 2 3 1 2 3 6
#2 2 3 4 2 3 4 24
#3 3 4 5 3 4 5 60
Another less efficient approach is rowwise with c_across
mydata %>%
rowwise() %>%
mutate(AA_prod = prod(c_across(starts_with('AA')))) %>%
ungroup
data
mydata <- data.frame(X = 1:3, Y = 2:4, Z = 3:5,
AA_1 = 1:3, AA_2 = 2:4, AA_3 = 3:5)
If you want row-wise product for "AA_" columns, you can do this in base R with Reduce :
cols <- grep('AA_', names(mydata))
mydata$AA_product <- Reduce(`*`, mydata[cols])
and apply :
mydata$AA_product <- apply(mydata[cols], 1, prod)

Separate rows by matching two columns in similar pattern

i have data like
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C = c("P2,Q2","X2,Y2"))
i am looking for output like
output <- data.frame(A = c("P","Q","X","Y"), B = c("P1","Q1","",""), C = c("P2","Q2","X2","Y2"))
i tried using separate_rows like mentioned below but it is not matching the strings seperated by comma.
separate_rows(df1, A, sep=",") %>%
separate_rows(B) %>%
separate_rows(C)
I like splitstackshape package for such operations,
library(splitstackshape)
cSplit(df1, splitCols = names(df1), sep = ',', direction = 'long')
# A B C
#1: P P1 P2
#2: Q Q1 Q2
you simply have to do :
library(tidyr)
separate_rows(df1, A, B, C, convert = TRUE)
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
Edit if you have NA and empty strings :
data:
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C =
c("P2,Q2","X2,Y2"))
Code:
df1 <- data.frame(lapply(df1, as.character), stringsAsFactors=FALSE)
df1[df1 == ""] <- "0,0"
df1 <- separate_rows(df1, A, B, C, convert = TRUE)
df1[df1 == "0"] <- ""
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
3 X X2
4 Y Y2
An option using base R with strsplit
data.frame(lapply(df1, function(x) strsplit(as.character(x), ",")[[1]]))
# A B C
#1 P P1 P2
#2 Q Q1 Q2
Or with scan
data.frame(lapply(df1, function(x)
scan(text = as.character(x), what = "", sep=",", quiet = TRUE)))
As suggested by Gainz's answer, separate_rows(df1, A, B, C, convert = T) works really well.
However, if you do have blank cells in the dataframe then it does become harder to use, since it will give you an error about all the columns not having the same number of rows.
I suggest using a column that you know will have no blank values. Let's assume it is column A.
I would first then convert the dataframe to a tibble, and all factor columns to character columns. Then I would replace the blank cells with a string with the correct number of commas. Then separate_rows() should be able to work correctly.
Then the code will look as follows:
df1_tibble <- df1 %>%
as_tibble() %>%
mutate_if(is.factor, as.character)
df1_clean <- df1_tibble %>%
mutate(count = str_count(A, ",") + 1) %>%
mutate(temp_str = map_chr(count, ~ rep("", .x) %>% paste0(collapse = ","))) %>%
mutate_at(vars(B, C), funs(ifelse(str_length(.) == 0, temp_str, .))) %>%
select(A, B, C)
df1_clean
#> # A tibble: 2 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P,Q P1,Q1 P2,Q2
#> 2 X,Y , X2,Y2
df1_clean %>% separate_rows(A, B, C)
#> # A tibble: 4 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P P1 P2
#> 2 Q Q1 Q2
#> 3 X "" X2
#> 4 Y "" Y2

Group data by factor level, then transform to data frame with colname being levels?

There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!

Grouping Over All Possible Combinations of Several Variables With dplyr

Given a situation such as the following
library(dplyr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
I would like to group `myData' to eventually find summary data grouping by all possible combinations of var2, var3, and var4.
I can create a list with all possible combinations of variables as character values with
groupNames <- names(myData)[2:4]
myGroups <- Map(combn,
list(groupNames),
seq_along(groupNames),
simplify = FALSE) %>%
unlist(recursive = FALSE)
My plan was to make separate data sets for each variable combination with a for() loop, something like
### This Does Not Work
for (i in 1:length(myGroups)){
assign( myGroups[i]%>%
unlist() %>%
paste0(collapse = "")%>%
paste0("Data"),
myData %>%
group_by_(lapply(myGroups[[i]], as.symbol)) %>%
summarise( n = length(var1),
avgVar2 = var2 %>%
mean()))
}
Admittedly I am not very good with lists, and looking up this issue was a bit challenging since dpyr updates have altered how grouping works a bit.
If there is a better way to do this than separate data sets I would love to know.
I've gotten a loop similar to above working when I am only grouping by a single variable.
Any and all help is greatly appreciated! Thank you!
This seems convulated, and there's probably a way to simplify or fancy it up with a do, but it works. Using your myData and myGroups,
results = lapply(myGroups, FUN = function(x) {
do.call(what = group_by_, args = c(list(myData), x)) %>%
summarise( n = length(var1),
avgVar1 = mean(var1))
}
)
> results[[1]]
Source: local data frame [3 x 3]
var2 n avgVar1
1 a 31 0.38929738
2 b 31 -0.07451717
3 c 38 -0.22522129
> results[[4]]
Source: local data frame [9 x 4]
Groups: var2
var2 var3 n avgVar1
1 a A 11 -0.1159160
2 a B 11 0.5663312
3 a C 9 0.7904056
4 b A 7 0.0856384
5 b B 13 0.1309756
6 b C 11 -0.4192895
7 c A 15 -0.2783099
8 c B 10 -0.1110877
9 c C 13 -0.2517602
> results[[7]]
# I won't paste them here, but it has all 27 rows, grouped by var2, var3 and var4.
I changed your summarise call to average var1 since var2 isn't numeric.
I have created a function based on the answer of #Gregor and the comments that followed:
library(magrittr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
Function combSummarise
combSummarise <- function(data, variables=..., summarise=...){
# Get all different combinations of selected variables (credit to #Michael)
myGroups <- lapply(seq_along(variables), function(x) {
combn(c(variables), x, simplify = FALSE)}) %>%
unlist(recursive = FALSE)
# Group by selected variables (credit to #konvas)
df <- eval(parse(text=paste("lapply(myGroups, function(x){
dplyr::group_by_(data, .dots=x) %>%
dplyr::summarize_( \"", paste(summarise, collapse="\",\""),"\")})"))) %>%
do.call(plyr::rbind.fill,.)
groupNames <- c(myGroups[[length(myGroups)]])
newNames <- names(df)[!(names(df) %in% groupNames)]
df <- cbind(df[, groupNames], df[, newNames])
names(df) <- c(groupNames, newNames)
df
}
Call of combSummarise
combSummarise (myData, var=c("var2", "var3", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)"))
etc
Inspired by the answers by Gregor and dimitris_ps, I wrote a dplyr style function that runs summarise for all combinations of group variables.
summarise_combo <- function(data, ...) {
groupVars <- group_vars(data) %>% map(as.name)
groupCombos <- map( 0:length(groupVars), ~combn(groupVars, ., simplify=FALSE) ) %>%
unlist(recursive = FALSE)
results <- groupCombos %>%
map(function(x) {data %>% group_by(!!! x) %>% summarise(...)} ) %>%
bind_rows()
results %>% select(!!! groupVars, everything())
}
Example
library(tidyverse)
mtcars %>% group_by(cyl, vs) %>% summarise_combo(cyl_n = n(), mean(mpg))
Using unite to create a new column is the simplest way
library(tidyverse)
df = tibble(
a = c(1,1,2,2,1,1,2,2),
b = c(3,4,3,4,3,4,3,4),
val = c(1,2,3,4,5,6,7,8)
)
print(df)#output1
df_2 = unite(df, 'combined_header', a, b, sep='_', remove=FALSE) #remove=F doesn't remove existing columns
print(df_2)#output2
df_2 %>% group_by(combined_header) %>%
summarize(avg_val=mean(val)) %>% print()#output3
#avg 1_3 = mean(1,5)=3 avg 1_4 = mean(2, 6) = 4
RESULTS
Output:
output1
a b val
<dbl> <dbl> <dbl>
1 1 3 1
2 1 4 2
3 2 3 3
4 2 4 4
5 1 3 5
6 1 4 6
7 2 3 7
8 2 4 8
output2
combined_header a b val
<chr> <dbl> <dbl> <dbl>
1 1_3 1 3 1
2 1_4 1 4 2
3 2_3 2 3 3
4 2_4 2 4 4
5 1_3 1 3 5
6 1_4 1 4 6
7 2_3 2 3 7
8 2_4 2 4 8
output3
combined_header avg_val
<chr> <dbl>
1 1_3 3
2 1_4 4
3 2_3 5
4 2_4 6

Resources