matching large vector of string against large vector of patterns - r

I have a very large dataframe with a column containing postal codes:
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
I also have a second large-ish dataframe with postal codes patterns mapped to a zone.
mapping <- data.frame(code = c("10*", "20*"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)
I would like to join those two tables to add the zone column to the data dataframe but the volume of the data is too large to do a "rowwise" grepl. What is the most efficient way of doing this?

The most efficient way to deal with large objects is data.table. To do joins, you need a common column in both objects. I'm using substr to get only the first two digits of the code column in the data object. Also note that I removed the "*" from mapping as that character is not present in data.
library(data.table)
setDT(data)
setDT(mapping)
data[, code := substr(code, start = 1, stop = 2)]
mapping[data, on="code"]
code zone data
1: 10 zone1 -1.0481912
2: 11 <NA> 1.1339476
3: 20 zone2 -0.8072921
4: 90 <NA> 1.5883562
DATA
data <- data.frame(data = rnorm(n = 4),
code = c("1001", "1130", "2001", "9010"),
stringsAsFactors = F)
mapping <- data.frame(code = c("10", "20"),
zone = c("zone1", "zone2"),
stringsAsFactors = F)

I am not sure what specific method you are using when you say "rowwise" but here is what I would do in the dplyr world.
mapping <- dplyr::rename(mapping, codeString = code) # rename for joining.
data <- data %>%
dplyr::mutate( codeString = paste0(substr(code, 1, 2), "*")) %>%
dplyr::left_join(mapping, by= "codeString")
You should be able to join like this and avoid any rowwise operation since the patter you're looking for is easy to create.

Related

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

How to create a new column based on partial string of another column

I have a data frame with a vector of thousands of project codes, each representing a different type of research. Here's an example:
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
The first letter of the assignment code denotes the type of research. C = Cartography, B = Biology, G = Geology, and LOG = Logistics.
I would like to create a new column that looks at the first letter of the Assignment column and uses it to denote the type of research it is.
I've tried something similar to this thread, but I know I'm missing something:
R - Creating New Column Based off of a Partial String
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
Types <- data.frame(Type = c("Cartography", "Biology", "Geology","Logistic"),
stringsAsFactors = FALSE)
Data %>%
mutate(Type = str_match(Assignment, Types$Type)[1,])
You can add a new column Code in your Types data.frame and then join it with original table. You will need to create a Code column in your Data data.frame too.
library(dplyr)
library(stringr)
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
Types <- data.frame(Type = c("Cartography", "Biology", "Geology","Logistic"),
Code = c("C","B","G","L"), # Create new column here
stringsAsFactors = FALSE)
Data <- Data %>% mutate(Code = substr(Assignment,1L,1L)) # extract first character
Data <- left_join(Data, Types, by = "Code") %>% select(Assignment, Type) # combine

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL

long to wide data restructure in data.table

I have a data.table which is in long format and I want to re-structure it to wide format. Something similar to case to vars in SPSS. I have the data which is created in long format using melt
library(data.table)
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=17),
tc = rep(c('C','D'), 17),
one = rnorm(34,1,1),
two = rnorm(34,2,1),
three = rnorm(34,3,1),
four = rnorm(34,4,1),
five = rnorm(34,5,2),
six = rnorm(34,6,2),
seven = rnorm(34,7,2),
eight = rnorm(34,28,3))
DT1 <- melt(DT, id.vars = c("town","tc"),measure=3:10)
DT1[, `:=` (mn = mean(value,na.rm = TRUE), sdev = sd(value,na.rm = TRUE), uplimit = mean(value,na.rm = TRUE)+1.96*sd(value,na.rm = TRUE), lowlimit=mean(value,na.rm = TRUE)-1.96*sd(value,na.rm = TRUE)), by = .(town,tc,variable)][, outlier := +(value < mn - 1.96*sdev | value > mn + 1.96*sdev)]
so originally the data had 34 records and we had 8 key variables "one" to "eight". using melt we get a data similar to the one that is generated by above code and column "value" holds the data from the original data. Now on this data we do some computation and create the other variables "mn", "sd", "up", "low", "out". Now this data needs to be merged with the original data which has only 34 records. So I want to re-structure this data so that the restructured data has 34 records and eight variable each for "mn", "sd", "uplimit", "lowlimit", "out". How can I achieve this? I was trying dcast but not very clear of ~ and + how to use in the formula.....not clearly mentioned in ?dcast and other notes. Do you have something to share which explains this with an example. can the requirement be achieved using dcast?
finally I have found answer to my question post multiple tries. So basically the key to note is that the original data had to have a unique ID identifier which in this case was not present. So I added sequential unique numbers at the end in variable "unique".
library(data.table)
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=17),
tc = rep(c('C','D'), 17),
one = rnorm(34,1,1),
two = rnorm(34,2,1),
three = rnorm(34,3,1),
four = rnorm(34,4,1),
five = rnorm(34,5,2),
six = rnorm(34,6,2),
seven = rnorm(34,7,2),
eight = rnorm(34,28,3),
unique = 1:34)
This have me the data with a column having uniqueIDs. Then when converting the data in long format, I have used this unique variable as well in melt.
DT1 <- melt(DT, id.vars = c("town","tc","unique"),measure=3:10)
Then created the new variables that were needed
DT1[, `:=` (mn = mean(value,na.rm = TRUE), sdev = sd(value,na.rm = TRUE), uplimit = mean(value,na.rm = TRUE)+1.96*sd(value,na.rm = TRUE), lowlimit=mean(value,na.rm = TRUE)-1.96*sd(value,na.rm = TRUE)), by = .(town,tc,variable)][, outlier := +(value < mn - 1.96*sdev | value > mn + 1.96*sdev)]
Then when converting this long data to wide format so that I have same number of records as in the original data, I have taken help of this unique variable in dcast
DT2 <- dcast(DT1,town+tc+unique~variable,value.var = c("mn","sdev","outlier","uplimit","lowlimit"))
This gave me the desired output with same number of records (34) as in the original data and created the desired number of columns and each cell having same data as that in the long format.
I hope you all find this useful :)!!

Problems with casting a dataframe with text columns

I have this text dataframe with all columns being character vectors.
Gene.ID barcodes value
A2M TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ABCC10 TCGA-BA-5559-01A-01D-1512-08 Missense_Mutation
ABCC11 TCGA-BA-5557-01A-01D-1512-08 Silent
ABCC8 TCGA-BA-5555-01A-01D-1512-08 Missense_Mutation
ABHD5 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
ACCN1 TCGA-BA-5149-01A-01D-1512-08 Missense_Mutation
How do I build a dataframe from this using reshape/reshape 2 such that I get a dataframe of the format Gene.ID~barcodes and the values being the text in the value column for each and "NA" or "WT" for a filler?
The default aggregation function keeps defaulting to length, which I want to avoid if possible.
I think this will work for your problem. First, I'm generating some data similar to yours. I'm making gene.id and barcode a factor for simplicity and this should be the same as your data.
geneNames <- c(paste("gene", 1:10, sep = ""))
data <- data.frame(gene = as.factor(c(1:10, 1:4, 6:10)),
express = sample(c("Silent", "Missense_Mutation"), 19, TRUE),
barcode = as.factor(c(rep(1, 10), rep(2, 9))))
I made a vector geneNames a vector of the gene names (e.g, A2M). In order to get the NA values in those missing an expression of a given gene, you need to merge the data such that you have number_of_genes by number_of_barcodes rows.
geneID <- unique(data$gene)
data2 <- data.frame(barcode = rep(unique(data$barcode), each = length(geneID)),
gene = geneID)
data3 <- merge(data, data2, by = c("barcode", "gene"), all.y = TRUE)
Now melting and casting the data,
library(reshape)
mdata3 <- melt(data3, id.vars = c("barcode", "gene"))
cdata <- cast(mdata3, barcode ~ variable + gene, identity)
names(cdata) <- c("barcode", geneNames)
You should then have a data frame with number_of_barcodes rows and with (number_of_unique_genes + 1) columns. Each column should contain the expression information for that particular gene in that particular sample barcode.

Resources