I have a list of lists of dataframes:
library(dplyr)
library(magrittr)
a <- list(first = data.frame(x=runif(1), y=runif(1)),
second = data.frame(x=runif(5), y=runif(5)))
b <- list(first = data.frame(x=runif(1), y=runif(1)),
second = data.frame(x=runif(5), y=runif(5)))
a <- a %>% set_names(1:length(a))
b <- b %>% set_names(1:length(b))
c <- list(a, b)
c <- c %>% set_names(1:length(c))
I want to assign the two levels of list names as new columns to the dataframe, and then bind them into one dataframe. The desired output is something like:
x y name1 name2
.23 .43 1 1
.23 .43 1 2
.23 .43 2 1
.23 .43 2 2
Where the values of x and y are not the point. I am struggling with this as lapply does not access the name of the element of the list.
Thanks.
May be this helps:
library(reshape2)
library(tidyr)
library(dplyr)
res <- melt(c) %>%
group_by(variable) %>%
mutate(indx=row_number()) %>%
spread(variable, value) %>%
ungroup() %>%
select(-indx)
Related
this is my first post here :)
So I encountered some weird behavior today: When using the dplyr mutate function together with the paste function, the outcome is the same for every row.
Here is an example:
vec1 <- c(2, 5)
vec2 <- c(4, 6)
test_df <- data.frame(vec1, vec2)
test_df %>% mutate(new_col = paste(vec1:vec2, collapse = ","))
with the output
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 2,3,4
but thats not what I wanted or expected.
Here is what I wanted, achieved with a loop:
df <- test_df %>% mutate(new_col = 1)
for(i in 1:nrow(test_df)){
df$new_col[i] <- paste(df$vec1[i]:df$vec2[i], collapse = ",")
}
With the output:
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 5,6
Whats going on and how can I achieve the same with mutate and paste?
We can get the sequence by loop over the vec1, vec2 elements with map2, and paste (str_c) the sequence values to a single string
library(dplyr)
library(purrr)
library(stringr)
test_df %>%
mutate(new_col = map2_chr(vec1, vec2, ~ str_c(.x:.y, collapse = ",")))
-output
vec1 vec2 new_col
1 2 4 2,3,4
2 5 6 5,6
Or with rowwise
test_df %>%
rowwise %>%
mutate(new_col = str_c(vec1:vec2, collapse = ",")) %>%
ungroup
# A tibble: 2 × 3
vec1 vec2 new_col
<dbl> <dbl> <chr>
1 2 4 2,3,4
2 5 6 5,6
Suppose we have a data frame (df) like this:
a b
1 2
2 4
3 6
If I want to compute the ratio of each element in vectors a and b and assign to variable c, we'd do this:
c <- df$a / df$b
However, I was wondering how the same thing could be done using the dplyr package? I.e. are there any ways that this can be achieved using functions from dplyr?
Maybe you can try the code below
df %>%
mutate(c = do.call("/", .))
or
df %>%
mutate(c = Reduce("/", .))
or
df %>%
mutate(c = a/b)
An option with invoke
library(dplyr)
library(purrr)
df %>%
mutate(c = invoke('/', .))
-output
# a b c
#1 1 2 0.5
#2 2 4 0.5
#3 3 6 0.5
data
df <- data.frame(a = c(1,2,3), b= c(2,4,6))
You can use mutate function from dplyr library:
df <- data.frame(a = c(1,2,3), b= c(2,4,6))
library(dplyr)
df <- df %>%
dplyr::mutate(c = a/b)
Console output:
a b c
1 1 2 0.5
2 2 4 0.5
3 3 6 0.5
I have a data frame with a subset of variables that starts with 'AA_' (e.g., AA_1, AA_2, ... AA_100) along with other variables X, Y, Z.
If I would like to get the produce of all 'AA_' variables, what would be the most efficient way in R to achieve this?
I am thinking something like
mydata = mydata %>%
mutate(AA_product = reduce(starts_with('AA_'), `*`))
but it does not quite work
Here, we need to select the data
library(dplyr)
library(purrr)
mydata %>%
mutate(AA_product = reduce(select(., starts_with( 'AA_')), `*`))
-output
# X Y Z AA_1 AA_2 AA_3 AA_product
#1 1 2 3 1 2 3 6
#2 2 3 4 2 3 4 24
#3 3 4 5 3 4 5 60
Another less efficient approach is rowwise with c_across
mydata %>%
rowwise() %>%
mutate(AA_prod = prod(c_across(starts_with('AA')))) %>%
ungroup
data
mydata <- data.frame(X = 1:3, Y = 2:4, Z = 3:5,
AA_1 = 1:3, AA_2 = 2:4, AA_3 = 3:5)
If you want row-wise product for "AA_" columns, you can do this in base R with Reduce :
cols <- grep('AA_', names(mydata))
mydata$AA_product <- Reduce(`*`, mydata[cols])
and apply :
mydata$AA_product <- apply(mydata[cols], 1, prod)
I have a dataframe like this:
tmp <- read.table(header = T, text = "gene_id gene_symbol ensembl_id keep val1 val2 val3
x a Multiple Yes 1 2 3
x1 a Multiple No 2 3 4
x2 a Multiple No 1 4 3
y b Multiple Yes 22 20 12
y1 b Multiple No 98 7 97
y2 b Multiple No 8 76 6")
I am trying to group by the gene_symbol variable and calculating correlation between each row that is keep == "Yes" with all other rows (keep == "No") and returning an average correlation along with the gene_symbol and gene_id. This is the function:
# function to calculate avg. correlation
calc.mean.corr <- function(x){
gene.id <- x[which(x$keep == "Yes"),"gene_id"]
x1 <- x %>%
filter(keep == "Yes") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep)) %>%
as.numeric()
x2 <- x %>%
filter(keep == "No") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep))
# correlation of kept id with discarded ids
cor <- mean(apply(x2, 1, FUN = function(y) cor(x1, y)))
cor <- round(cor, digits = 2)
df <- data.frame(avg.cor = cor, gene_id = gene.id)
return(df)
}
# call using ddply
for.corr <- plyr::ddply(tmp, .variables = "gene_symbol", .fun = function(x) calc.mean.corr(x))
The final output looks like this:
> for.corr
gene_symbol avg.cor gene_id
1 a 0.83 x
2 b 0.02 y
I am using plyr::ddply for this but want to use dplyr instead. However, I am not sure how to convert it to dplyr format. Any help would be much appreciated.
If we don't want to change the function, one option it to do a group_split and apply the function
library(dplyr)
library(purrr)
tmp %>%
group_split(gene_symbol) %>%
map_dfr(calc.mean.corr)
To include the gene_symbol
tmp %>%
split(.$gene_symbol) %>%
map_dfr(~ calc.mean.corr(.), .id = 'gene_symbol')
# gene_symbol avg.cor gene_id
#1 a 0.83 x
#2 b 0.02 y
I'll illustrate my question with an example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:
require(magrittr)
require(dplyr)
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B))
df2 <- df %>%
filter(A == "foo") %>%
group_by(ID) %>%
summarize(sumBfoo = sum(B))
left_join(df1, df2)
ID sumB sumBfoo
1 1 6 1
2 2 30 30
3 3 54 NA
4 5 202 NA
However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.
require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)
I thought of using mutate to define a new Bfoo column:
df_sqlite %>%
mutate(Bfoo = ifelse(A=="foo", B, 0))
Unfortunately, this doesn't work on the database end of things.
Error in sqliteExecStatement(conn, statement, ...) :
RS-DBI driver: (error in statement: no such function: IFELSE)
You can do both sums in a single dplyr statement:
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B),
sumBfoo = sum(B[A=="foo"]))
And here is a data.table version:
library(data.table)
dt = setDT(df)
dt1 = dt[ , .(sumB = sum(B),
sumBfoo = sum(B[A=="foo"])),
by = ID]
dt1
ID sumB sumBfoo
1: 1 6 1
2: 2 30 30
3: 3 54 0
4: 5 202 0
Writing up #hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.
df1 <- df %>%
group_by(ID) %>%
summarize(countB = n(),
countBfoo = sum(A=="foo"))
df1
Source: local data frame [4 x 3]
ID countB countBfoo
1 1 2 1
2 2 2 2
3 3 1 0
4 5 1 0
If you wanted to count the rows, instead of summing them, can you pass a variable to the function:
df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
RowCountBfoo = n(A=="foo"))
I get an error both with n() and nrow().