Subsetting data in R for repeated values - r

I have two data frames. df1 = 300000rows df2 = 100000rows. Few values in df1 are repeated (can be seen from dimension of data) as I have to a graphical analysis on data. The df2 contains metadata for values in rows in df2.
dput(df1[1:5, ])
c("ENSG00000272905.1", "ENSG00000269148.1", "ENSG00000272905.1",
"ENSG00000204581.2", "ENSG00000158486.12")
dput(df2[1:5, ])
structure(list(ensembl_gene_id = c("ENSG00000004838", "ENSG00000005206",
"ENSG00000007174", "ENSG00000009724", "ENSG00000009844"), hgnc_symbol = c("ZMYND10",
"SPPL2B", "DNAH9", "MASP2", "VTA1"), gene_biotype = c("protein_coding",
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
)), row.names = c(NA, 5L), class = "data.frame")
I want to match each rows in df1 and store its metadata (given in df2) in corresponding columns. My expected results are:
dput(df3[1:5, ])
c("ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod"
)
I tried match function but it returned NA as values in column1 of df1 are in decimals. I also tried %in% operator, but that returned "Error:incorrect dimension".
What should script look like where I can subset my data without omitting repeated values.

R automatically joins the dataframes by common variable names, but you would most likely want to specify df3 <- merge(df1, df2, by = "ensembl_gene_id") to make sure that you are matching on only the fields you desired.

I'm always a fan of the dplyr package (part of the tidyverse).
You will likely need something like this
Unique drops duplicates
df3 <- inner_join(unique(df1), df2, on = "ensembl_gene_id")
Alternatively you could just filter for the desired columns
df3 <- df2 %>% filter(ensembl_gene_id %in% pull(df1, ensembl_gene_id))
Edit: just reread the question, ignore unique. Also the second method will drop uniques too.
You just want df3 <- inner_join(df1, df2, on = "ensembl_gene_id")

Try the following code -
library(dplyr)
result <- result <- df1 %>%
mutate(ensembl_gene_id = sub('\\..*', '', ensembl_gene_id)) %>%
inner_join(df2, by = 'ensembl_gene_id')
result

Related

Return the row indices of df1 when those row values occur in df2 in R

I'm coding in R. I have a big data frame (df1) and a little data frame (df2). df2 is a subset of df1, but in a random order. I need to know the row indices of df1 which occur in df2. All of the specific cell values have lots of duplicates. Tapirus terrestris shows up more than once, as does each ModType value. I tried experimenting with which() and grpl() but couldn't get my code to work.
df1 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Panthera onca', 'Leopardus tigrinus' , 'Leopardus tigrinus'),
ModType = c('ANN', 'GAM', 'GAM','RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1019_s3_sd','CHELSAbio1015_s4_sd','CHELSAbio1015_s4_sd'))
df2 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Leopardus tigrinus'),
ModType = c('ANN', 'RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1015_s4_sd'))
Should output an array: 1,4 because df1 rows 1 and 4 occur in df2.
You can create an index column in df1 and merge the datasets.
df1$index <- 1:nrow(df1)
df3 <- merge(df1, df2)
df3$index
#[1] 4 1
You can use match.
df1[match(df2$SpeciesName, df1$SpeciesName), ]
Another option is tidyverse
library(dplyr)
df1 %>%
mutate(index = row_number()) %>%
inner_join(df2)

Matching based on conditional data

I have two Dfs with multiple rows and columns. I want to see if Df1$Name matches Df2$Name. If it matches, I want it to take match value and create a new variable in DF3. But if it doesn't match, I want to paste the value from Df1. The issue is that Df1 has 270 observations and Df2 has 277.
See example:
Df1
Name
Natalie
Desmond,James
Kylie
Df2
Name
<Na>
Desmond,James
<Na>
Df3
Merged_name
Natalie
Desomond,James
Kylie
I've tried:
Df3$Merged_name <- ifelse(Df1Name %in% Df$Name
& !is.na(Df2$Name), Df1$Name
, Df2$Name)
I get an error saying that the longer object length is not a multiple of shorter object
length which I'm assuming is due to the varying observations. Do I have to separate rows that have more than one name in it(i.e. separate_rows())? If so, how do I re-merge back together?
You can use cbind.fill function which accepts binding of columns given different row numbers and then you do the conditional scenario you gave;
library(dplyr)
library(rowr)
Df1 <-
data.frame(
Name = c("Natalie", "Desmond,James", "Kylie"),
stringsAsFactors = FALSE
)
Df2 <-
data.frame(
Name = c(NA_character_, "Desmond,James", NA_character_, "Test"),
stringsAsFactors = FALSE
)
# Binding data by column and renaming similar column names
cbind.fill(Df1 %>% rename(Name1 = Name), Df2 %>% rename(Name2 = Name), fill = NA) %>%
mutate(Name = coalesce(Name2, Name1)) %>% # Conditional logic given
select(Name)
# Name
# Natalie
# Desmond,James
# Kylie
# Name1

Merging two columns of words having small and capital letters in r

I have two data frames and I want to merge them using two columns that are like below:
a <- data.frame(A = c("Ali", "Should Be", "Calif")))
b <- data.frame(B = c("ALI", "CALIF", "SHOULD BE"))
Could you please let me know if it is possible to do it in r?
One way would be to decapitalize your character values using tolower from base R and then do a merge:
library(dplyr) # for mutating
df1 <- df1 %>%
mutate(A = tolower(A))
df2 <- df2 %>%
mutate(B = tolower(B))
df3 <- merge(df1, df2, by.x = "A", by.y = "B")
df3
A
1 ali
2 calif
3 should be
Is this what you needed?
Edit: The dplyr bit is of course not necessary. If everything is to be done in base R, df1$A=tolower(df1$A) and df2$B=tolower(df2$B) - as suggested in the comments - work just as well.

Recoding a large number of variables using another data frame in R

I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.
This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.
Starting point (dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Desired outcome (df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}
Here is on method in base R using match and Map. Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.
The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
this returns
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K
Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3
A solution from dplyr and tidyr. The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
DATA
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)

Problems with casting a dataframe with text columns

I have this text dataframe with all columns being character vectors.
Gene.ID barcodes value
A2M TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ABCC10 TCGA-BA-5559-01A-01D-1512-08 Missense_Mutation
ABCC11 TCGA-BA-5557-01A-01D-1512-08 Silent
ABCC8 TCGA-BA-5555-01A-01D-1512-08 Missense_Mutation
ABHD5 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ACCN1 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?
The default aggregation function keeps defaulting to length, which I want to avoid if possible.
I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.
geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
barcode = as.factor(c(rep(1, 10), rep(2, 9))))
I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.
geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)
Now melting and casting the data,
library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)
You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.

Resources