UpSetR with tranposed data frame in R - r

I have the following table mytable.tsv:
ABCI15.1 IM3
ABCK16.1 IMNCY
ABCK16.1 IM5
ABCI15.1 IM200/IM605
ABCM13.1 IM4
ABCN06.1 IM1182
ABCN20.1 IM21
ABCN06.1 IMNCY
ABCP20.1 IM4
ABCM13.1 IM630
And I would like to make an UpsetR plot of both this table and the transposed one.
So my first plot (forming intersects out of the groups in the second column by summing up the first column) would be:
df = read.table(file="mytable.tsv", header=F)
df2 = acast(df, V1~V2, value.var="V2")
df3 = setDT(as.data.frame(df2), keep.rownames = TRUE)[]
upset(df3)
and my tranposed one:
df4 = t(df2)
df4 = setDT(as.data.frame(df4), keep.rownames = TRUE)[]
upset(df4)
However, I'm getting in both cases the following error:
Error in start_col:end_col : argument of length 0
Why is that? And how do I resolve it?

Related

Select specific columns, where the column names are in another df in r

I couldn't find a solution in stack, so here's my issue:
I have a df with 342 columns.
I want to make a new df with only specific columns
The list of columns to keep is in another df, listed in 3 columns titled X,Y,Z for 3 new dataframes
Here's my code right now:
# Read the data:
data <- data.table::fread("data_30_9.csv")
# Import variable names #
variable.names.full = openxlsx::read.xlsx("variables2.xlsx")
Y.variable.names = na.omit(variable.names.full[1])
X.variable.names = na.omit(variable.names.full[2])
Z.variable.names = na.omit(variable.names.full[3])
# Make new DF with only specific columns:
X.Data = data %>% select(as.character(X.variable.names)) # This works as X has only 1 variable
Y.Data = data %>% select(as.character(Y.variable.names)) # This give an error: Error:
# # Can't subset columns that don't exist.
Help?
the data is available here:
https://github.com/amirnakar/TammyA/blob/main/data_30_9.csv
https://github.com/amirnakar/TammyA/blob/main/Variables2.xlsx
The problem is that Y.variable.names is a data.frame which you cannot use to subset another data.frame.
You can check by typing class(Y.variable.names).
So the solution to your problem is subsetting Y.variable.names:
Y.Data = data %>% select(Y.variable.names[,1])
Use lapply on variable.names.full and select the columns from data.
list_data <- lapply(variable.names.full, function(x)
data[, na.omit(x), drop = FALSE])

How do you replace an entire column in one dataframe with another column in another dataframe?

I have two dataframes. I want to replace the ids in dataframe1 with generic ids. In dataframe2 I have mapped the ids from dataframe1 with the generic ids.
Do I have to merge the two dataframes and after it is merged do I delete the column I don't want?
Thanks.
With dplyr
library(dplyr)
left_join(df1, df2, by = 'ids')
We can use merge and then delete the ids.
dataframe1 <- data.frame(ids = 1001:1010, variable = runif(min=100,max = 500,n=10))
dataframe2 <- data.frame(ids = 1001:1010, generics = 1:10)
result <- merge(dataframe1,dataframe2,by="ids")[,-1]
Alternatively we can use match and replace by assignment.
dataframe1$ids <- dataframe2$generics[match(dataframe1$ids,dataframe2$ids)]
Subsetting data frames isn't very difficult in R: hope this helps, you didn't provide much code so I hope this will be of help to you:
#create 4 random columns (vectors) of data, and merge them into data frames:
a <- rnorm(n=100,mean = 0,sd=1)
b <- rnorm(n=100,mean = 0,sd=1)
c <- rnorm(n=100,mean = 0,sd=1)
d<- rnorm(n=100,mean = 0,sd=1)
df_ab <- as.data.frame(cbind(a,b))
df_cd <- as.data.frame(cbind(c,d))
#if you want column d in df_cd to equal column a in df_ab simply use the assignment operator
df_cd$d <- df_ab$a
#you can also use the subsetting with square brackets:
df_cd[,"d"] <- df_ab[,"a"]

Repeated values ​when join data frames in r

when I merge dataframes, I write this code:
library(readxl)
df1 <- read_excel("C:/Users/PC/Desktop/precipitaciones_4Q.xlsx")
df2 <- read_excel("C:/Users/PC/Desktop/libro_copia_1.xlsx")
df1 = data.frame(df1)
df2 = data.frame(df2)
df1$codigo = toupper(df1$codigo)
df2$codigo = toupper(df2$codigo)
dat = merge.data.frame(df1,df2,by= "codigo", all.y = TRUE,sort = TRUE)
the data has rainfall counties, df1 has less counties than df2. I want to paste counties that has rainfall data from df1 to df2.
The problem occurs when counties data are paste into df2, repeat counties appears.
df1:
df2:
Instead "id" you must specify the column names for join from the first and second table.
You can use the data.table package and code below:
library(data.table)
dat <- merge(df1, df2, by.x = "Columna1", by.y = "prov", all.y = TRUE)
also, you can use funion function:
dat <- funion(df1, df2)
or rbind function:
dat <- rbind(df1, df2)
dat <- unique(dat)
Note: column names and the number of columns of the two dataframes needs to be same.

Unknown result of select command

I have multiple .csv files (mydata_1, mydata_2,...) with the same amount of columns and column names(, different row lengths if that helps finding an answer). After reading them into my environment they have the class data.frame . I was putting them all in a list and now want to select specific columns by name from all of them, resulting in in the same variable name with just the chosen columns.
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = c(df1, df2)
class(all_data)
class(df1)
for (i in all_data){
i = select(i,"X3":"X5")
}
My for command shall output the data.frames df1 and df2 with just three columns (instead of the prior seven), but when running the code an error message regarding the select command appears.
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "c('integer', 'numeric')"
How can I get an working output of my new dfs?
The first issue here is that your are trying to create a list using c(df1, df2), while you have to use list(df1, df2)
Data
library(dplyr)
library(purrr)
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = list(df1 = df1, df2 = df2)
The second problem is within your loop. look, in this approach you have to create an empty list before running the loop, and then aggregate elements in each iteration.
all_data2 <- list()
for(i in 1:length(all_data)) {
all_data2[[i]] <- all_data[[i]] %>% select(X3, X4, X5)
}
try using map from purrr which is part of the tidyverse package and lead to a cleaner code with the same result.
# Down here the `.x` is replaced by each element of the list all_data
# in each iteration, ending wiht a list of two data frames
all_data2 = map(all_data, ~.x %>%
select(X3, X4, X5))
Consider base R's subset with select argument for contiguous column selection, wrapped in an lapply call. Unlike for loop, lapply does not require the bookkeeping to reassign each element back into a list:
all_data <- list(df1 = df1, df2 = df2)
all_data_sub <- lapply(all_data, function(df) subset(df, select=X3:X5))

Problems with casting a dataframe with text columns

I have this text dataframe with all columns being character vectors.
Gene.ID barcodes value
A2M TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ABCC10 TCGA-BA-5559-01A-01D-1512-08 Missense_Mutation
ABCC11 TCGA-BA-5557-01A-01D-1512-08 Silent
ABCC8 TCGA-BA-5555-01A-01D-1512-08 Missense_Mutation
ABHD5 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ACCN1 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?
The default aggregation function keeps defaulting to length, which I want to avoid if possible.
I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.
geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
barcode = as.factor(c(rep(1, 10), rep(2, 9))))
I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.
geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)
Now melting and casting the data,
library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)
You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.

Resources