I am trying to aggregate multiple columns given certain conditions using R (data.table?)...
I have one data frame df1 with columns 12:262 that contains species abundance (each column) for each sample (rows)
sample species1 species2
sample1 1 21
sample2 47 36
sample3 8 32
In another data frame df2, I have the phylum, genus, etc.. for each species (rows) .
species phylum genus
species1 X A
species2 Y B
I would like to aggregate all columns from df1 whose species belong to the same phylum (defined in df2)...
Does that make sense?
thank you!
The first thing to do is to reshape df1. If you convert the data from a 'wide' format to a 'long' format you will have multiple rows for each sample. You can then merge this with your second data set by the species variable. From here, you haven't given enough detail on exactly how you want to aggregate the data, but I provided two simple examples. You should be able to easily adjust that aggregation code to include whatever you need.
library(tidyr)
library(dplyr)
df1 <- data.frame(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
df2 <- data.frame(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
df1_long <- tidyr::pivot_longer(df1, starts_with("species"),
names_to = "species", values_to = "abundance")
df3 <- dplyr::left_join(df1_long, df2, by = "species")
df3 %>%
group_by(phylum) %>%
summarize(total_abundance = sum(abundance),
avg_abundance = mean(abundance))
A data.table version
library(data.table)
dt1 <- data.table(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
dt2 <- data.table(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
# long format
dt1_long <-
melt(
dt1,
id.vars = 'sample',
variable.name = "species",
value.name = "abundence"
)
# join and group
dt1_long[dt2,on = "species",by = "phylum"]
Related
I have data like this:
df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
and
df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I want to know how many strings are common across all these 3 data frames.
If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.
example (if only identical strings count for duplicates):
library(dplyr)
df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))
bind_rows(df1, df2, df3, .id = "origin") %>%
group_by(origin) %>%
distinct(test) %>% ## remove within-dataframe duplicates
group_by(test) %>%
summarise(replicates = n()) %>%
filter(replicates > 1)
Here is an update in case only identical strings are wished:
library(dplyr)
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
filter(duplicated(test) | duplicated(test, fromLast=TRUE))
id test
1 df1 STOMQ
2 df1 TMEIL1
3 df1 YIPF1
4 df2 STOMQ
5 df2 TMEIL1
6 df2 YIPF1
7 df3 STOMQ
8 df3 YIPF1
First answer:
Here is a suggestion:
First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:
library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
arrange(test)
To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index
Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R
in the group column you could find the similar strings:
You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:
library(tidyverse)
library(stringdist)
map_dfr(x$test, ~ {
i <- which(stringdist(., x$test, "jw") < 0.40)
tibble(index = i, title = x$test[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group)) +
bind_cols(df_id = x$id)
group index title df_id
<int> <int> <chr> <chr>
1 1 1 BRsts df3
2 2 2 GFD df3
3 3 3 RSEST df3
4 3 31 TRS df2
5 3 32 TRSF df3
6 4 4 SNTLK df1
7 4 5 SNTLKM df2
8 4 6 SNTM1 df1
9 4 8 STOLA df1
10 4 12 STR2 df2
# ... with 27 more rows
I have two data frames. The data frames are different lengths, but have the same IDs in their ID columns. I would like to create a column in df called Classification based on the Classification in df2. I would like the Classification column in the df to match up with the appropriate ID listed in df2. Is there a good way to do this?
#Example data set
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
ID2 <- rep(seq(1,5), 1)
Classification2 <- c("A", "B", "C", "D", "E")
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df2 <- data.frame(ID2, Classification)
A dplyr solution using left_join().
left_join(df, df2, c("ID" = "ID2"))
Are you looking for a merge between df and df2? Assuming Classification is a column in df.
df2 <- merge(df2, df, by.x = "ID2", by.y = "ID", all.x = TRUE)
I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.
If you want to join only by id column, you can remove phase in the on clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
After joining two dataframes (df1 and df2) I would like to flag in column check which ids are in df1 but not df2 (example below).
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2))
df3 <-
left_join(df1, df2)
Required result
id check
1 1 Y
2 2 Y
3 3 N
I can achieve the result using a temp column (example below)
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2), temp = "Y")
df3 <-
left_join(df1, df2) %>%
mutate(check = ifelse(is.na(temp), "N", "Y")) %>%
select(-temp)
but I was hoping for a solution where a temp column isn't required, I've tried some different approaches (for example below), but haven't been able to find a better solution.
df3 <-
left_join(x = df1, y = df2) %>%
mutate(check = ifelse(is.na(y.id), "N", "Y"))
but this errors with...
Joining, by = "id"
Error: object 'y.id' not found
I have data like that:
df <- (
tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
)
I want to calculate the ratio between "Height" and "Waist" and between "Waist" and "Hip".
I have the following solution. But my solution requires using spread() and delivers only the calculation for "Waist-to-hip".
df <- rbind(df,
spread(df, Parameter, Value)
%>% transmute(ID = ID,
Group = Group,
Parameter = "Ratio.Height-to-Hip",
Value = Height / Hip,
Parameter = "Ratio.Waist-to-Hip",
Value = Waist / Hip))
Is it possible to stay in tidy data format and avoid switching to the long-format? Why is the calculation for "Height-to-hip" missing?
Here is one the possible solution:
# Calculate ratios "Height" vs "Waist" and "Waist" vs "Hip"
# 1. Load packages
library(tidyr)
library(dplyr)
# 2. Data set
df <- tibble(
id = rep(1:2, 4),
group = c("A", "B", "A", "B","A", "B", "A", "B"),
parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# 3. Filter and transform data set
df <- df %>%
filter(parameter %in% c("Height", "Waist", "Hip")) %>%
spread(parameter, value)
# 4. Convert column names to lower case
colnames(df) <- tolower(colnames(df))
# 5. Calcutate ratios
df <- df %>%
mutate(
ratio_height_vs_waist = round(height / waist, 2),
ratio_waist_vs_hip = round(waist / hip, 2))
The main problem is that the data are not in a tidy format.
Two key features of the tidy format are (Wickham, 2013):
Each variable forms a column;
Each observation forms a row.
In its original format, your data violates these two rules. For example, the Parameter column contains four variables (Blood, Height, Waist, and Hip). The knock-on effect of grouping several variables within Parameter is that each observation has to be repeated across several rows. In general, repeated rows of an identifier (ID in this case) in the absence of repeated measures is a sign that two or more variables have been grouped under a single column.
Anyway, here's my attempt to clean the data (I have used mutate and and not transmute for illustrative purposes).
# Load packages
library(dplyr)
library(tidyr)
library(magrittr) # For the %<>% function, which I love
# Make data frame, df
df <- tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# Wrangle df
df %<>%
# ID and Group appear to be repeated, so use them to group_by
group_by(ID, Group) %>%
# Spread the Value column by the Parameter column
spread(key = Parameter,
value = Value) %>%
# Ungroup, just because its a good habit
ungroup() %>%
# Generate new columns.
mutate(Ratio_height_to_hip = Height / Hip,
Ratio_waist_to_hip = Waist / Hip)
# Print df
df
#> # A tibble: 2 x 8
#> ID Group Blood Height Hip Waist Ratio_height_to_hip
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 A 6.3 180 60 90 3.000000
#> 2 2 B 6.0 170 65 102 2.615385
#> # ... with 1 more variables: Ratio_waist_to_hip <dbl>
df <- df %>%
spread(Parameter, Value) %>%
mutate("Ratio.Height-to-Hip" = Height / Hip) %>%
mutate("Ratio.Waist-to-Hip" = Hip / Waist) %>%
gather("Parameter", "Value", -c("ID", "Group"))
Your data is not in tidy format ;) If you want your data in tidy format remove the last step.