I have this text dataframe with all columns being character vectors.
Gene.ID barcodes value
A2M TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ABCC10 TCGA-BA-5559-01A-01D-1512-08 Missense_Mutation
ABCC11 TCGA-BA-5557-01A-01D-1512-08 Silent
ABCC8 TCGA-BA-5555-01A-01D-1512-08 Missense_Mutation
ABHD5 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ACCN1 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?
The default aggregation function keeps defaulting to length, which I want to avoid if possible.
I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.
geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
barcode = as.factor(c(rep(1, 10), rep(2, 9))))
I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.
geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)
Now melting and casting the data,
library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)
You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.
Related
I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))
Here is what I have done, but with this solution I have new data frame which has numeric columns only but I want to keep my original data frame.
data_without_na <- select_if(new_data,is.numeric)
data_without_na[] <- lapply(
data_without_na,
function(data_without_na) {
data_without_na[is.na(data_without_na)] <- median(data_without_na, na.rm = TRUE)
data_without_na
})
This is what my code is but I would prefer to perform the same operation on my original data frame. The idea is to get the index of columns which are of numeric data type, ind <- which(sapply(new_data, is.numeric)) and get the column number to perform operation on my original data frame, but it's giving me an error
Simulate a dataframe:
d <- data.frame("char1" = sample(letters,100, replace = T),
"char2" = sample(letters,100, replace = T),
"numeric1" = sample(c(NA,seq(1,50,2.5)),100, replace = T),
"numeric2" = sample(c(NA,seq(1,50,2.5)),100, replace = T))
d %>%
mutate_if(is.numeric, ~ifelse(is.na(.x),median(.x, na.rm = T),.x))
We take this data frame and mutate all columns which are numeric. "~" defines an anonymous function and ".x" stands for the variable/column.
I have the following table mytable.tsv:
ABCI15.1 IM3
ABCK16.1 IMNCY
ABCK16.1 IM5
ABCI15.1 IM200/IM605
ABCM13.1 IM4
ABCN06.1 IM1182
ABCN20.1 IM21
ABCN06.1 IMNCY
ABCP20.1 IM4
ABCM13.1 IM630
And I would like to make an UpsetR plot of both this table and the transposed one.
So my first plot (forming intersects out of the groups in the second column by summing up the first column) would be:
df = read.table(file="mytable.tsv", header=F)
df2 = acast(df, V1~V2, value.var="V2")
df3 = setDT(as.data.frame(df2), keep.rownames = TRUE)[]
upset(df3)
and my tranposed one:
df4 = t(df2)
df4 = setDT(as.data.frame(df4), keep.rownames = TRUE)[]
upset(df4)
However, I'm getting in both cases the following error:
Error in start_col:end_col : argument of length 0
Why is that? And how do I resolve it?
I have N columns that start with the String "Factor". I want to create an additional column in the dataframe that finds the row product of those columns.
Example data (My actual data set N = 50):
df <- data.frame(Company = c("A","B","C","D","E"),
Factor1 = c(1,2,3,4,5),
Factor2 = c(5,4,3,2,1),
FactorN = c(2,4,6,8,10))
Expected result
df2 <- data.frame(Company = c("A","B","C","D","E"),
Factor1 = c(1,2,3,4,5),
Factor2 = c(5,4,3,2,1),
FactorN = c(2,4,6,8,10),
Factor_Product = c(10,32,54,64,50))
I've tried rowProds from the matrixStats package, but that requires a matrix format.
Then convert it into matrix format and select columns which start with "Factor"
matrixStats::rowProds(as.matrix(df[grep("^Factor", names(df))]))
#[1] 10 32 54 64 50
You can also use apply row-wise
apply(df[grep("Factor", names(df))], 1, prod)
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL