Sorry if this is a super basic question but I've run into an issue while working on my R project. Basically I have two data frame objects, one which is a master list of genes and their level of expression in various patients and one which is only a single column in size. Then one with only a single column is a list of genes that fall under a specific subcategory of genes all of which are in the master list. I am trying to create a data frame where I have my specific subset of genes AND their expression across the different patients which is contained in the master list. I tried using the merge() function but only an empty dataframe was created.
Basically the code goes something like: new_dataframe <- merge(master_list, specific_gene_list, by = "gene"). I thought this code should look at my master list find all the genes in the specific list and then only take those genes and add the columns for patient expression, however my data frame is empty it creates a dataframe with all of the columns of the master list but no values filled in. Any help is greatly appreciated.
A visual example:
Master data frame
x: 1
y: 3
z : 4
w: 6
Specific data frame:
x
y
Desired data frame:
x: 1
y: 3
We can use regex_inner_join from fuzzyjoin
library(fuzzyjoin)
df3 <- regex_inner_join(df1, df2, by = 'gene') %>%
transmute(gene = gene.x)
df3
# gene
#1 x: 1
#2 y: 3
data
df1 <- structure(list(gene = c("x: 1", "y: 3", "z: 4", "w: 6")),
class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(gene = c("x", "y")), class = "data.frame", row.names = c(NA,
-2L))
You could also split the column by the colon and add a new column to merge the dataframes.
mergecol <- c("x: 1",
"y: 3",
"z: 4",
"w: 6")
df <- cbind(mergecol, as.data.frame(do.call(rbind, strsplit(mergecol, ':'))))
df2 <- data.frame(V1 = c('x', 'y'))
mergedf <- merge(df, df2, by="V1")
result <- c('x: 1', 'y: 3')
assertthat::are_equal(result, mergedf$mergecol)
#[1] TRUE
You can separate the columns in master_list using separate, join with specific_gene_list and again combine the columns with unite.
library(dplyr)
library(tidyr)
master_list %>%
separate(gene, c('gene', 'value'), sep = ':\\s*') %>%
inner_join(specific_gene_list, by = 'gene') %>%
unite(gene, gene, value, sep = " : ")
# gene
#1 x : 1
#2 y : 3
Related
I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))
I'm trying to rbind multiple loaded datasets (all of them have the same num. of columns, named "num", "source" and "target"). In case, I have ten dataframes, which names are "test1", "test2", "test3" and so on...
I thought that trying the solution below (creating an empty dataframe and looping through the others) would solve my problem, but I guess that I'm missing something in the second argument of the rbind function. I don't know if the solution using paste0("test", I) to increment the variable (changing the name of the dataframe) it's correct... I'm afraid that I'm just trying to rbind a dataframe with a string object (and getting an error), is that right?
test = as.data.frame(matrix(ncol = 3, nrow = 0)) %>%
setNames(c("num", "source", "target"))
i=1
while (i < 11) {
test = rbind(test, paste0("test", i))
i = i + 1
}
We need replicate to return as a list
out <- setNames(replicate(10, test, simplify = FALSE),
paste0("test", seq_len(10)))
If there are multiple datasets already created in the global env, get those in to a list and rbind within do.call
out <- do.call(rbind, mget(paste0("test", 1:10)))
We could bind test1:test10 using the common pattern in the name:
library(dplyr)
result <- mget(ls(pattern="^test\\d+")) %>%
bind_rows()
If I understood correctly, this might help you
Libraries
library(dplyr)
Example data
list_of_df <-
list(
df1 = data.frame(a = "1"),
df2 = data.frame(a = "2"),
df3 = data.frame(a = "1"),
df4 = data.frame(a = "2")
)
Code
bind_rows(list_of_df,.id = "dataset")
Result
dataset a
1 df1 1
2 df2 2
3 df3 1
4 df4 2
I would appreciate a solution for the following problem: I have the following example data frame:
df1 = data_frame(Tom = c(1,2,3,4), Tina = c(5,6,7,8), Todd = c(9,10,11,12), Brit = c(1,2,3,4))
I have a second data frame with information about Tom, Tina etc.
df2 = data_frame(ID = c("Tom","Todd","Tina","Brit"), value = c(1,3,2,1))
Now I would like to subset colums from data frame df1 if the "value" in df2 fulfils a particular condition, e.g. df2$value = 1 | df2$value = 2
The resulting table should look like:
desired_result_look_like = data_frame(Tom = c(1,2,3,4), Tina = c(5,6,7,8), Brit = c(1,2,3,4))
Thanks for you help.
Because you're using row values in one data frame to select the columns in another data frame, the solution isn't particularly clean, but if you wanted to stick with this approach, you could create a third data frame that filters the second data frame based on your conditions, then select the column names in the first data frame that correspond with values in the filtered data frame. The code would look something like this:
library(dplyr)
df2_filtered <- df2 %>% filter(value == 1 | value == 2)
desired_result <- df1[ , colnames(df1) %in% df2_filtered$ID]
(This is operating under the assumption that in your posted "desired result", you meant to include Tina instead of Todd)
I have this text dataframe with all columns being character vectors.
Gene.ID barcodes value
A2M TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ABCC10 TCGA-BA-5559-01A-01D-1512-08 Missense_Mutation
ABCC11 TCGA-BA-5557-01A-01D-1512-08 Silent
ABCC8 TCGA-BA-5555-01A-01D-1512-08 Missense_Mutation
ABHD5 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ACCN1 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?
The default aggregation function keeps defaulting to length, which I want to avoid if possible.
I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.
geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
barcode = as.factor(c(rep(1, 10), rep(2, 9))))
I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.
geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)
Now melting and casting the data,
library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)
You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.
I have five data.frames with gene expression data for different sets of samples. I have a different number of rows in each data.set and therefore only partly overlapping row.names (genes).
Now I want
a) to filter the five data.frames to contain only genes that are present in all data.frames and
b) to combine the gene expression data for those genes to one data.frame.
All I could find so far was merge, but that can only merge two data.frames, so I'd have to use it multiple times. Is there an easier way?
Merging is not very efficient if you want to exclude row names which are not present in every data frame. Here's a different proposal.
First, three example data frames:
df1 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[1:5]) # letters a to e
df2 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[3:7]) # letters c to g
df3 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[c(1,2,3,5,7)]) # letters a, b, c, e, and g
# row names being present in all data frames: c and e
Put the data frames into a list:
dfList <- list(df1, df2, df3)
Find common row names:
idx <- Reduce(intersect, lapply(dfList, rownames))
Extract data:
df1[idx, ]
a b
c 3 3
e 5 5
PS. If you want to keep the corresponding rows from all data frames, you could replace the last step, df1[idx, ], with the following command:
do.call(rbind, lapply(dfList, "[", idx, ))
Check out the uppermost answer in this SO post. Just list your data frames and apply the following line of code:
Reduce(function(...) merge(..., by = "x"), list.of.dataframes)
You just have to adjust the by argument to specify by which common column the data frames should be merged.