I have a data frame of 58207 x 6. It is produced as a result of different combination of values. Using tidyverse I have grouped by the first column and used do() to assign each unique 1st column value to its specific dataframe from column 3 to 6. However, I cannot figure out how to do the same for column 2 with the difference that I only need unique values stored in a list and not the repeats.
Here is the head of the data frame.
# A tibble: 58,207 x 6
id pfam go_id name nmspace linkage_type
<chr> <fct> <fct> <fct> <fct> <fct>
1 O00273_~ PF020~ GO:000~ cytoplasm cellular_compo~ IEA
2 O00273_~ PF020~ GO:000~ cytosol cellular_compo~ IDA
3 O00273_~ PF020~ GO:000~ plasma membrane cellular_compo~ IDA
4 O00273_~ PF020~ GO:000~ nuclear chromatin cellular_compo~ IDA
5 O00273_~ PF020~ GO:000~ apoptotic process biological_pro~ IEA
6 O00273_~ PF020~ GO:000~ protein binding molecular_func~ IPI
Any suggestions on how to get the levels() value for each group_by(id) on the second column and storing the to a list corresponding to the id would be appreciated.
And I am new in this. If you have any suggestions on how to handle data such as this please do let me know. Basically I'm hoping to do comparisons between different IDs after.
Does this work ok for you?
# dummy data, using data.table package, converting from tibble
library(data.table)
library(tibble)
library(gtools)
df <- tibble(id = rep(c("id1", "id2", "id3"), each=3),
X1 = c("a", "f", "b",
"b", "a", "e",
"a", "f", "f"))
dt <- as.data.table(df)
dt[]
# retaining data structure
out1 <- dt[, .(unique.X1 = unique(X1)), by = id]
out1[]
# as a list
out2 <- dt[, .(unique.X1 = list(unique(X1))), by = id]
out2[]
# back to original format
out2.df <- as.tibble(out2)
out2.df
# EDIT: getting unique combinations
ids <- unique(df$id)
lookup <- as.data.table(gtools::combinations(length(ids), 2))
lookup[, V1 := ids[lookup$V1]][, V2 := ids[lookup$V2]]
setnames(lookup, c("V1", "V2"), c("ID1", "ID2"))
lookup[, index := .I]
setkey(dt, id)
joined <- lookup[, .(intersect = list(intersect(dt[J(ID1), X1], dt[J(ID2), X1]))), by=index]
out <- merge(joined, lookup, by="index")
out[, index := NULL]
out[]
Related
I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.
The shape of my data is fairly simple:
set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values)
df
id values
1 1 0.57632155
2 2 0.56474213
3 3 0.07399023
4 4 0.45386562
What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...
tags <- list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"
What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.
Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'
library(tidyverse)
df %>%
mutate(tag = tags) %>%
unnest %>%
summarise(nAC = sum(tag %in% c("A", "C")),
meanB = mean(values[tag == "B"], na.rm = TRUE))
That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .
library(tidyr)
library(dplyr)
df$tags=list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
Newdf=df%>%tidyr::unnest(tags)
Q1.
Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
tags Mean
<chr> <dbl>
1 B 0.263927925960161
Q2.
Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
id Count
<int> <lgl>
1 1 FALSE
2 2 NA
3 3 TRUE
4 4 FALSE
I am using two differrent data frames. I would like to complete one using the information that is contained in the other. The first data frame contains a list of observations of individual young animals whose birthdate and natal territory are known. The second data frame contains observations of adult animals that were present in given territories within given time intervals.
Here is a reproducible example:
#First dataframe:
ID_young <- c(rep(c("a", "b", "c"), each=3), "d") # All individuals observed three times except "d", observed once
Territory_young <- c(rep(c("x", "y", "z"), each=3), "x") # All individuals are from different territories, except "a" and "d" who are from the same territory, namely "x".
Birthdate <- c(rep(c("2014-01-29", "2014-12-17", "2013-11-19"), each=3), "2012-12-04")
Birthdate <- as.Date(Birthdate)
# Second dataframe:
ID_adult <- c("e", "f", "g", "h", "i", "j", "e","f")
Territory_adult <- c("x", "x", "y", "z", "z", "z", "z", "w")
First_date <- as.Date(c("2014-01-01", "2014-01-15", "2013-12-14", "2013-05-17", "2013-05-09", "2012-09-01", "2013-06-18", "2011-04-17"))
Last_date <- as.Date(c("2014-02-28", "2014-04-17", "2014-11-02", "2014-01-13", "2015-01-03", "2013-04-17", "2013-12-25", "2014-11-11"))
# Data frames complete:
df1 <- data.frame(ID_young, Territory_young, Birthdate)
df2 <- data.frame(ID_adult, Territory_adult, First_date, Last_date)
My goal is to create a new column in df1 that consists of the number of adult animals present in the young animal's territory at the time of its birth.
In other words,
For each line of df1:
find the corresponding territory in df2
count the number of lines in df2 where the interval between df2$First_date and df2$Last_date includes df1$Birthdate
fill in that number in the new column of df1
For example, for the first three lines of df1 (corresponding to the young animal "a"), that count would be 2, because adults "e" and "f" were present in territory "x" when young "a" was born (2014-01-29).
Could someone help me formulate the right combination of conditional statements that would allow me to do that? I am trying for and if statements at the moment but have nothing worth showing.
Thanks!
nb.adults = apply(df1, 1, function(row, df2) {
terr = as.character(row[2])
bd = row[3]
nb.adults = length(which(df2$First_date < bd & bd < df2$Last_date &
df2$Territory_adult==terr))
return(nb.adults)
}, df2)
df1 = cbind(df1, nb.adults)
The recent versions of data.table support non-equi joins which can be used for this purpose:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
DT1 <- data.table(df1)
DT2 <- data.table(df2)
# right non-equi join to find any adults present in terrority during birth
DT2[unique(DT1),
on = c("Territory_adult==Territory_young",
"First_date<=Birthdate",
"Last_date>=Birthdate")][
# count adults for each young
, .(Count_adult = sum(!is.na(ID_adult))), by = ID_young][
# join counts into each matching row of first data.table
DT1, on = "ID_young"]
ID_young Count_adult Territory_young Birthdate
1: a 2 x 2014-01-29
2: a 2 x 2014-01-29
3: a 2 x 2014-01-29
4: b 0 y 2014-12-17
5: b 0 y 2014-12-17
6: b 0 y 2014-12-17
7: c 3 z 2013-11-19
8: c 3 z 2013-11-19
9: c 3 z 2013-11-19
10: d 0 x 2012-12-04
Note that df1 and DT1, resp., contain duplicate rows which require to use unique() in the non-equi join with the adults and to use another join finally to make sure that the adults count appears on each row.
Consider:
df1 <- data.frame(
row.names = c('Obs1','Obs2','Obs3','Obs4'),
V1 = c(1,2,1,0),
V2 = c(0,0,1,0),
V3 = c(1,1,0,3))
df2 <- data.frame(
Group = c("A", "A", "B"),
Obs = c("Obs1", "Obs2", "Obs3"))
I want to match the observations that make up each Group of df2 to each variable in df1 and return a dataframe that describes if the observation is present or not - ultimately to be able to classify which Groups the variables of df1 should be included in. All observations that make up a Group must have a value > 0 in df1 for the variable in df1 to be considered part of the Group.
output
Group V1 V2 V3
1 A 1 0 1
2 B 1 1 0
Here's a quick way:
library(dplyr)
df1$Obs <- rownames(df1) # rownames are a pain, let's have a real column
# complains because of a factor in df1, but no biggie:
output <- inner_join(df1, df2)
output %>%
group_by(Group) %>%
summarize_at(
vars(starts_with('V')),
function (x) as.numeric(any(x>0))
)
This gives the output you required.
Here is an option using data.table to join the two datasets on 'Obs/rn', grouped by 'Group', check if any of the values are greater than 0 in the columns
library(data.table)
setDT(df1, keep.rownames=TRUE)[df2, on = .(rn = Obs)
][, lapply(.SD, function(x) +any(x > 0)) , Group, .SDcols = V1:V3]
# Group V1 V2 V3
#1: A 1 0 1
#2: B 1 1 0
I'm relatively new to R programing and am trying to figure out how to use custom functions to evaluate new columns of a data frame using dplyr or data.table in a memory efficient manner. Can someone please help
Here is a brief summary of my problem
Data frames 1 and 2 have the same type and number of columns
df1 <- data.frame(col1 = c("A", "B", "C"), col2 = c(10,20,30))
df2 <- data.frame(col1 = c("DA", "EE", "FB", "C"), col2 = c(10,20,30,40))
These data frames have millions of records.
Now I want to add a new column to one of the data frames (say df1) by using the values in df2.
library(dplyr)
calculateCol3 <- function(word) {
df2 %>%
filter(grepl(paste0(word, "$"),col1) )%>%
summarize(col3= sum(col2))
col3
}
df1 %>% group_by(col1) %>% mutate(col3 = calcualteCol3(col1))
This method works but it is painfully slow and I guess this is because of copying the data sets too many times. Can someone suggest a better way of doing the same? The expected result is:
col1 col2 col3
A 10 10
B 20 30
C 30 40
I also tried converting the data frames to data.table as follows
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt1[, col3 := calculateCol3(col1)}, by = 1:nrow(dt1)]
Everything seems to be slow. Am sure there is a better way to achieve this. Can someone help
Thanks
If you want an efficient solution I would suggest you won't use regex and don't do by-row operations. If all your function is doing is to join by the last letter, you could just get that latter without using regex and then do a binary join using data.table (for efficiency)
library(data.table)
setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]
setDT(df1)[df2, col3 := i.col2, on = .(col1 = EndWith)]
df1
# col1 col2 col3
# 1: A 10 10
# 2: B 20 30
# 3: C 30 40
Now, by looking at your function, it seems like you also trying to sum the values in df2$col2 per join. No problem, you can run functions while doing a binary join in data.table too. Lets say this is your df2 (just to illustrate when you have more than a single value per last letter)
df2 <- data.frame(col1 = c("DA", "FA", "EE", "FB", "C", "fC"), col2 = c(10,20,10,30,40,30))
df2
# col1 col2
# 1 DA 10
# 2 FA 20
# 3 EE 10
# 4 FB 30
# 5 C 40
# 6 fC 30
The first step is the same
setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]
While the second step will involve a binary join- just to the opposite way, while adding , by = .EACHI and specifying your desired function
setDT(df2)[df1, .(col2 = i.col2, col3 = sum(col2)), on = .(EndWith = col1), by = .EACHI]
# EndWith col2 col3
# 1: A 10 30
# 2: B 20 30
# 3: C 30 70
Using the fuzzyjoin package, I think you can make this work. E.g.:
#install.packages("fuzzyjoin")
df1$col1regex <- paste0(df1$col1,"$")
regex_join(df2, df1, by=c(col1="col1regex"), mode="right")
# col1.x col2.x col1.y col2.y col1regex
#1 DA 10 A 10 A$
#2 FB 30 B 20 B$
#3 C 40 C 30 C$