I have a vector like go_id and a data.frame like data.
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements? I note that the GO code can not be repeated in the same bio_process cell.
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.
I have tried a for loop using 'grepl' function, like this:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.
Any ideas on this?
Thank you
We can use Reduce with grepl
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE in grepl() :
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract to define the pattern on those substrings that are distinctive:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT:
The most straighforward solution is subsetting with grepland paste0 to add the escape slashes for the metacharacter [:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]
Related
I have a large data frame where I want to extract rows based on a column value. My problem is that grep will take all instances of the value (e.g. will take "11" if I wanted to grep "1"). How do I get exact matches? Example below simply illustrates my issue. I only want to grep the "metm1" row but it is grepping all rows even though they are not exact matches.
## make data
df1 <- data.frame(matrix(, nrow=4, ncol=2))
colnames(df1) <- c("met", "dt1")
df1$met <- c("metm11", "metm1", "metm1", "metm12")
df1$dt1 <- c("0.666", "0.777", "0.99", "0.01")
# make list for grep
mets <- "metm1"
# grep
new_df <- as.data.frame(df1[grep(paste(mets, collapse = "|"), df1$met), ])
You may place ^ and $ anchors around the search term to force an exact match:
regex <- paste0("^(?:", paste(mets, collapse = "|"), ")$")
new_df <- as.data.frame(df1[grep(regex, df1$met, fixed=TRUE), ])
For reference, the regex pattern being used here in:
^(?:metm1)$
^(?:metm1|metm2|metm3)$ <-- for multiple terms
You can use simply == to make exact match.
df1[df1$met == mets,]
# met dt1
#2 metm1 0.777
#3 metm1 0.99
In case mets is more than one element long use %in% as already pointed out in the comments by #MrFlick.
df1[df1$met %in% mets,]
# met dt1
#2 metm1 0.777
#3 metm1 0.99
Another solution is by using boundary anchors \\b:
df1[grep(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), df1$met), ]
met dt1
2 metm1 0.777
3 metm1 0.99
Using dplyr you'd filter with grepl, which returns TRUE and FALSE whereas grep returns indices of matches:
library(dplyr)
df1 %>%
filter(grepl(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), met))
I haven't been able to find an answer to this, but I am guessing this is because I am not phrasing my question properly.
I want to combine two strings containing several comma-separated values into one string, alternating the inputs from each original string.
x <- '1,2'
y <- 'R,L'
# fictitious function
z <- combineSomehow(x,y)
z = '1R, 2L'
EDIT : Adding dataframe to better describe my issue. I would like to be able to accomplish the above, but within a mutate ideally.
df <- data.frame(
x = c('1','2','1,1','2','1'),
y = c('R','L','R,L','L','R'),
desired_result = c('1R','2L','1R,1L','2L','1R')
)
df:
x y desired_result
1 1 R 1R
2 2 L 2L
3 1,1 R,L 1R,1L
4 2 L 2L
5 1 R 1R
Final Edit/Answer: Based on #akrun's comment/response below and after removing the error originally in df, this ended up being the tidyverse answer:
mutate(desired_result = map2(.x=strsplit(x,','),.y=strsplit(y,','),
~ str_c(.x,.y, collapse=',')))
It can be done with strsplit and paste
combineSomehow <- function(x, y) {
do.call(paste0, c(strsplit(c(x,y),","), collapse=", "))
}
combineSomehow(x,y)
#[1] "1R, 2L"
Without modifying the function, we can Vectorize it to apply on multiple elements
df$desired_result2 <- Vectorize(combineSomehow)(df$x, df$y)
I have a nested loop during which I want to add dataframes into a list, as in:
listname <- list()
for(xx in 1:50) {
listname[[xx]] <- list()
for(yy in 1:25) {
my_df <- data.frame(aa = c(1,2,".."), bb = c(1,2,".."))
listname[[xx]][yy] <- my_df
}
}
However, I get the warning message:
"number of items to replace is not a multiple of replacement length"
... and as a result, only my_df$aa is in listname[[xx]][yy], but not my_df$bb.
What would be a more appropriate way to accomplish a nested list of dataframes with this approach?
Thank you!
EDIT: Found the solution! It turns out that the only mistake was the lack of a further bracket. It should have been listname[[xx]][[yy]] <- my_df. Thanks to Ben in the comments; it works now.
You will want to use double brackets [[ for yy in storing your data frames as follows:
listname[[xx]][[yy]] <- my_df
The use of [[ is important when working with lists, since subsetting a list with [ (single bracket alone) always returns a smaller list.
To see the difference, take a look at listname[[1]][1]:
[[1]]
aa bb
1 1 1
2 2 2
3 .. ..
This is a list returned.
Then we can compare with listname[[1]][[1]]:
aa bb
1 1 1
2 2 2
3 .. ..
Which is a data.frame.
In this case, when you are want to replace a single value in a list with a data.frame, you will want to use [[.
You can try with replicate to get such nested list of dataframes.
my_df <- data.frame(aa = c(1,2,".."), bb = c(1,2,".."))
listname <- replicate(50, replicate(25,
my_df, simplify = FALSE), simplify = FALSE)
I have extracted data from twitter and it appears as the column List (what you will get after running the code). I want the output as what appears in the Broken column.
data <- data.frame(matrix(, nrow=4, ncol=2))
colnames(data)[1:2] <- c("List", "Broken")
data$List[1] <- 1
data$List[2] <- list(c("1", "SmythsToysUK"))
data$List[3] <- list(c("1", "FortniteGame", "CityCtrMirdif", "itpliveme"))
data$List[4] <- 1
data$Broken[1:4]<- c("SmythsToysUK","FortniteGame","CityCtrMirdif","itpliveme")
We can remove all the numbers from List column.
temp <- unlist(data$List)
data$Broken <- temp[is.na(as.numeric(temp))]
data
# List Broken
#1 1 SmythsToysUK
#2 1, SmythsToysUK FortniteGame
#3 1, FortniteGame, CityCtrMirdif, itpliveme CityCtrMirdif
#4 1 itpliveme
We can use grep with unlist. After unlisting the list, select only the elements that have letters
data$Broken <- grep("[A-Za-z]", unlist(data$List), value = TRUE)
data$Broken
#[1] "SmythsToysUK" "FortniteGame" "CityCtrMirdif" "itpliveme"
Or another option is to remove the first element which seems to be index and then unlist
unlist(sapply(data$List, `[`, -1))
NOTE: Both the options, doesn't have any warnings
IN R
I have a vector of NAME:
[1] "ALKR50SV" "AMKR71SV" "AOKR71SV" "AZKR52SV" "BFKR70SV" "BJKR61SV" "BUKR6HSV"
"CDKR61SV" "CFKR31SV"
I want to use them as a name for each new dataframe
Like dataframe of ALKR50SV, dataframe of ALKR50SV ......
for loop like:
NAME[i] <- data1
will cause problem.
What should I do? Thank you.
As #joran and #neilfws said, best to work with a list of data.frames.
For example, consider the following list of three data.frames
lst <- lapply(1:3, function(x) as.data.frame(matrix(sample(20), ncol = 4)));
You can name list elements
names(lst) <- c("ALKR50SV", "AMKR71SV", "AOKR71SV");
and operate on list elements using lapply, e.g.
lapply(lst, dim);
#$ALKR50SV
#[1] 5 4
#
#$AMKR71SV
#[1] 5 4
#
#$AOKR71SV
#[1] 5 4
You can use assign:
numbers <- c('one', 'two', 'three')
for (i in 1:3) {
assign(nms[i], i)
}
one # 1
two # 2
three # 3
But as others have commented, it is most likely better to put your dataframes into a named list.