How to subset data with advance string matching - r

I have the following data frame from which I would like to extract rows based on matching strings.
> GEMA_EO5
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
KNG1 3.433049 8.56e-28 NM_000893,NM_001102416 1.234245e-24
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
VPS29 3.827665 2.22e-25 NM_057180,NM_016226 2.560770e-22
CYP51A1 3.363149 5.95e-25 NM_000786,NM_001146152 6.239386e-22
TNPO2 4.707600 1.60e-23 NM_001136195,NM_001136196,NM_013433 1.538000e-20
NSDHL 2.703922 6.74e-23 NM_001129765,NM_015922 5.980454e-20
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
So I would like to extract e.g. two rows based on matching strings in $RefSeq_ID, that works fine with the following:
> list<-c("NM_001386", "NM_020385")
> GEMA_EO6<-subset(GEMA_EO5, GEMA_EO5$RefSeq_ID %in% list, drop = TRUE)
> GEMA_EO6
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
But some of the rows have several RefSeq_IDs separated with commas, so I am looking for a general way of telling if $RefSeq_ID contains a certain string pattern and then subset that row.

To do partial matching you'll need to use regular expressions (see ?grepl). Here's a solution to your particular problem:
##Notice that the first element appears in
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")
To test one sequence at a time, we just select a particular seq id:
R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
5 TNPO2 4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433 1.538e-20
To test for multiple genes, we use the | operator:
R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE
So
subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
should give you what you want.

A different approach is to recognize the duplicate entries in RefSeq_ID as an attempt to represent two data base tables in a single data frame. So if the original table is csv, then normalize the data into two tables
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset (select) on the RefSeq table, followed by a merge (join) with Anno
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.

Related

How do I grab the word in a character column after two consecutive word matches in R?

I have a data frame 'key_words' with vectors of pairs of words
key_words <- data.frame( c1 = ('word1','word2'), c2 = ('word3, word4'), c3 = ('word5','word6'))
I would like to search for these pairs of key words in a character column 'text' in another data frame 'x' where each row can be a few sentences long. I want to grab the word following two consecutive matches of a column in the key_words data frame and insert that value into a table at the same index of where the match was found. For example, if 'word1' and 'word2' are found one after the other in text[1] then I want to grab the word that comes after in text[1] and insert it into table[1].
I have tried splitting each row in 'text' into a list, separating by a single space so that each word has its own index in each row. I have the following idea which seems very inefficient and I'm running into problems where the character value temp_list[k] is of length 0.
x <- x %>% mutate(text = strsplit(text, " "))
for (i in 1:ncol(key_words)) {
word1 <- key_words[i, 1]
word2 <- key_words[i, 2]
for (j in 1:length(x$text)) {
temp_list <- as.list(unlist(x$text[[j]]))
for (k in 1:length(temp_list))
if (word1 == temp_list[k]) {
if (word2 == temp_list[k + 1]) {
table$word_found[j] <- temp_list[k + 2]
}
}
}
Is there a better way to do this or can I search through the text column for 'word1 word2' and grab the next word which can be any length? I'm new to R and coding in general, but I know I should be avoiding nested loops like this. Any help would be appreciated, thanks!!
I would suggest that you create a small function like this one, that returns the word following the occurrence of the pair 'w1 w2'
get_word_after_pair <- function(text,w1,w2) {
stringr::str_extract(text, paste0("(?<=\\b", w1, "\\s", w2, "\\b\\s)\\w*(?=\\b)"))
}
and then you can do this
data.frame(
lapply(key_words, function(x) get_word_after_pair(texttable$text,x[1],x[2]))
)
Input:
(keywords is a list of word pairs, texttable is a frame with a column text)
key_words <- list( pair1 = c('has','important'), pair2 = c('sentence','has'), pair3 = c('third','sentence'))
texttable = data.frame(text=c("this sentence has important words that we must find",
"this second sentence has important words to find",
"this is the third sentence and it also has important words within")
)
Output:
pair1 pair2 pair3
1 words important <NA>
2 words important <NA>
3 words <NA> and

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

Sort strings based on number in part of string

I have a huge data that I cannot upload here because.
I have two types of columns, their names start with T.H.L or T.H.L.varies..... Both types have are numbered in the format So####, e.g., T.H.L.So1_P1_A2 until T.H.L.So10000_P1_A2.
For each T.H.L column there is a column named T.H.L.varies.... with the same ending.
I want to order the columns by the numbers after So, with first the T.H.L and then the corresponding T.H.L.varies.... version for each So number.
What I tried was to do
library(gtools)
mySorted<- df2[,mixedorder(colnames(df2))]
Which is close, it sorts them correctly by number, but first all T.H.L and then all T.H.L.varies instead of alternating them.
I have posted the column names to Github:
Okay, let's call the names of your data frame (the names you want to reorder) x:
x = names(df2)
# first remove the ones without numbers
# because we want to use the numbers for ordering
no_numbers = c("T.H.L", "T.H.L.varies....")
x = x[! x %in% no_numbers]
# now extract the numbers so we can order them
library(stringr)
x_num = as.numeric(str_extract(string = x, pattern = "(?<=So)[0-9]+"))
# calculate the order first by number, then alphabetically to break ties
ord = order(x_num, x)
# verify it is working
head(c(no_numbers, x[ord]), 10)
# [1] "T.H.L" "T.H.L.varies...." "T.H.L.So1_P1_A1"
# [4] "T.H.L.varies.....So1_P1_A1" "T.H.L.So2_P1_A2" "T.H.L.varies.....So2_P1_A2"
# [7] "T.H.L.So3_P1_A3" "T.H.L.varies.....So3_P1_A3" "T.H.L.So4_P1_A4"
# [10] "T.H.L.varies.....So4_P1_A4"
# finally, reorder your data frame columns
df2 = df2[, c(no_numbers, x[ord])]
And you should be done.

How to convert multiple columns in R

I have a data frame which is retrieved from a csv file. I need to get the column types of some columns and apply these types to another data.frame's corresponding columns.
For example, after certain steps, the data.frame from the csv is called Table1.
header <- names(Table1)
"Acct" "Tran"
class(Table1$Acct)
"character"
class(Table1$Tran)
"character"
Then I need to convert Table2's corresponding "Acct" and "Tran" columns to character.
I tried
class(Table1[header])
[1] "data.frame"
class(Table1$header)
[1] "NULL"
How do I apply the column types of Table1 to Table2? Do I have to use for loop to do the transfer?
Thanks
****UPDATES****
Since the data types of Table1 are not complex, I created a function to manually convert column types. as.numeric(as.character(After)) is important. if After is a factor, as.numeric(After) will change the values.
Typeconvertion<- function(Before, After){
classtype<-class(Before)
if (classtype=="factor") {After<-as.character(After)}
else if(classtype=="integer"){After<- as.numeric(as.character(After))}
else if(classtype=="numeric"){After<- as.numeric(as.character(After))}
else if(classtype=="character"){After<- as.character(After)}
else {After<- After}
}
Consider a more flexible function.
matchColClasses<- function(df1, df2){
# Purpose: protect joins from column type mismatches - a problem with multi-column empty df
# Input: df1 - master for class assignments, df2 - for col reclass and return.
# Output: df2 with shared columns classed to match df1
# Usage: df2 <- matchColClasses(df1, df2)
sharedColNames <- names(df1)[names(df1) %in% names(df2)]
sharedColTypes <- sapply(df1[,sharedColNames], class)
for (n in sharedColNames) {
class(df2[, n]) <- sharedColTypes[n]
}
return(df2)
}

Creating a palindrome function in r

Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40
It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.

Resources