I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))
I have two data frames similar to this:
df<-data.frame("A1"=c(1,2,3), "A2"=c(3,4,5), "A3"=c(6,7,8), "B1"=c(3,4,5))
ref_df<-data.frame("Name"=c("A1","A2","A3","B1"),code=c("Blue" ,"Blue","Green","Green"))
I would like to sum the values in the columns of df based on the code in the ref_df. I would like to store the results in a new data frame with column names matching the code in the ref_df
i.e. I would like a new data frame with Blue and Green as columns and the values representing the sum of A1+A2 and A3&B1 respectively. Like the one here:
result<-data.frame("Blue"=c(4,6,8), "Green"=c(9,11,13))
There are lots of post on summing columns based on conditions, but after a morning of research I cannot find any thing that solves my exact problem.
We can split the columns in df based on values in ref_df$code and then take row-wise sum.
sapply(split.default(df, ref_df$code), rowSums)
# Blue Green
#[1,] 4 9
#[2,] 6 11
#[3,] 8 13
If the order in ref_df do not follow the same order as column names in df, arrange them first.
ref_df <- ref_df[match(ref_df$Name, names(df)),]
We can use tidyverse
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = 'Name') %>%
left_join(ref_df) %>%
group_by(code, rn) %>%
summarise(Sum = sum(value)) %>%
pivot_wider(names_from = code, values_from = Sum) %>% select(-rn)
My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438
library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse
I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)