R function to input character vector - r

I currently have 10 vectors that look like the following:
string1 <- c("house", "home", "cabin")
string2 <-c("hotel", "hostel", "motel")
and so on for 10 strings.
R newbie learning functions. I have the following code I want to execute across these 10 strings, and turn in to a function. This code takes in these strings and searches for matches and creates a new variable:
a$string.i <- (1:nrow(a) %in% c(sapply(string1, grep, a$Contents, fixed = TRUE))) +0
As I am new to R, I'm stumped on how to turn this into a function. Do I need to first define the number of strings, then set 'string1' in the above code to x? How do I set the name of the variable = to the name of the string?
Some sample data:
a <- read.table(text='Contents other
1 "a house a home" "111"
2 "cabin in the woods" "121"', header=TRUE)

If you need a function, may be you can try:
fun1 <- function(namePrefix, dat){ #assuming that the datasets have a common prefix i.e. `string`
pat <- paste0("^", namePrefix, "\\d")
nm1 <- ls(pattern=pat, envir=.GlobalEnv)
lst <- mget(nm1, envir=.GlobalEnv)
lst2 <- lapply(lst, function(x)
(1:nrow(dat) %in% c(sapply(x, grep, dat$Contents, fixed=TRUE)))+0) #your code
dat[names(lst2)] <- lst2
dat
}
fun1("string", a)
# Contents other string1 string2
#1 a house a home 111 1 0
#2 cabin in the woods 121 1 0

Related

How to use grepl function multiple times, in R

I have a vector like go_id and a data.frame like data.
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements? I note that the GO code can not be repeated in the same bio_process cell.
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.
I have tried a for loop using 'grepl' function, like this:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.
Any ideas on this?
Thank you
We can use Reduce with grepl
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE in grepl() :
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract to define the pattern on those substrings that are distinctive:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT:
The most straighforward solution is subsetting with grepland paste0 to add the escape slashes for the metacharacter [:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]

R - looking up strings and exclude based on other string

I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.
I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)
If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0

rbind all data frames with common names based on list using lapply

I have several data frames named as such:
orange_ABC
orange_BCD
apple_ABC
apple_BCD
grape_ABC
grape_BCD
I need to rbind those that have the first part of their name in common (orange, apple, grape), and name the new data frames as such. I'm accessing the names from a list of data frames names(fruitlist) (from which I made the aforementioned data frames) and have tried using lapply with function(x) with no luck. I'm somewhat new to R, so think I'm making a simple mistake when it comes to dynamically naming the new data frame...
lapply(names(fruitlist),
function(x){
frame_nm <- toString((names(fruitlist[x])))
frame_nm <- do.call(rbind, mget(ls(pattern=paste0((names(splitlist[x])),"*"))))
})
I've tried the standalone line on one type of "fruit" and it seems to work:
test_DF <- do.call(rbind, mget(ls(pattern="apple*")))
EDIT: I realize I forgot to mention that the example list of 6 data frames were created dynamically, so I can't simply generate a list of them. However, I do have a list of the "fruits", and all possible the ends of the new data frame names are known ("_ABC" and "_BCD").
As suspected, the proposed way of assigning values to objects does not work. Moreover, care has to be taken when using ls() and mget() for listing and accessing named objects within a function, because they do not automatically ascend to parent environments and only "see" variables in the local scope unless told otherwise. This applies to R version 3.4, older versions may behave differently.
Creating named objects.
In order to create new objects in the global environment, use assign() (already suggested in Luke C's answer):
> assign("foo", "some text")
> foo
[1] "some text"
Placing code inside a function induces a local scope. Explicitly specifying the global environment allows setting global variables:
> set_foo <- function (x) { assign("foo", x, envir=globalenv()) }
> set_foo("other text")
> foo
[1] "other text"
Note that omitting the envir argument would leave the global environment unaffected.
Use of ls()/mget() within a local function.
By default, this only lists names from the current (local) environment of the that function, which only sees the argument x in the example code given in the question. Similar to above, a quick fix is to specify the global environment explicitly by adding the argument envir=globalenv(). The same applies for mget().
Since no MWE was provided, I am taking the liberty of adapting the "fake data" example code provided in Luke C's answer.
# Populate environment
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
for(x in namelist)
assign(x, data.frame(a = 1:4, b = 11:14))
# The following re-generates the list of fruits used above
grouplist <- unique(unlist(lapply(strsplit(namelist, "_"), function (x) { x[[1]] })))
# Group and rbind by prefix, suppressing output
invisible(lapply(grouplist,
function(x) {
grouped <- do.call(rbind,
mget(ls(pattern=paste0(x,"_*"), envir=globalenv()),
envir=globalenv()))
assign(x, grouped, envir=globalenv())
}))
If your fruitlist is a named list of data frames, maybe this will suit.
First, get the like names into their own list:
fruit.groups <- split(names(fruitlist),
sapply(strsplit(names(fruitlist), split = "_"), "[[", 1))
> fruit.groups
$apple
[1] "apple_ABC" "apple_BCD"
$grape
[1] "grape_ABC" "grape_BCD"
$orange
[1] "orange_ABC" "orange_BCD"
Then, use lapply to rbind by group:
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, fruitlist[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
> fdf$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
3 3 13 apple_ABC
4 4 14 apple_ABC
5 1 11 apple_BCD
6 2 12 apple_BCD
7 3 13 apple_BCD
8 4 14 apple_BCD
Fake data:
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
fruitlist <- llply(namelist, function(x){
assign(as.character(x), data.frame(a = 1:4, b = 11:14))
})
EDIT:
From the edits to your question above:
If you have the fruits and suffixes, use expand.grid to get all possible combinations (assuming that all combinations will refer to the dynamically generated data frames).
fruits <- c("orange", "apple", "grape")
suffixes <- c("_ABC", "_BCD")
fullnames <- apply(expand.grid(fruits, suffixes), 1, paste, collapse = "")
Using that list of names, use mget to generate a list of the present dataframes.
new_fruit_df_list <- mget(fullnames)
Then, the code from above should work, modified here to reflect the name changes:
fruit.groups <- split(names(new_fruit_df_list),
sapply(strsplit(names(new_fruit_df_list), split = "_"), "[[", 1))
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, new_fruit_df_list[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
Have a look at the head of each, with the added column (remove if you don't want it) showing the name of that row's original data frame.
> lapply(fdf, head, 2)
$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
$grape
a b from
1 1 11 grape_ABC
2 2 12 grape_ABC
$orange
a b from
1 1 11 orange_ABC
2 2 12 orange_ABC
Give this a try:
file_groups <- ls()[grep(".*_.*", ls())]
file_groups <- gsub("(.*)_.*", "\\1", file_groups)
df_list <- lapply(file_groups,
function(x){ do.call(rbind, mget(ls(pattern = paste0(x, "*"))))})

How to split strings and numbers in R?

I have character vector of the following form (this is just a sample):
R1Ng(10)
test(0)
n.Ex1T(34)
where as can be seen above, the first part is always some combination of alphanumeric and punctuation marks, then there are parentheses with a number inside. I want to create a numeric vector which will store the values inside the parentheses, and each number should have name attribute, and the name attribute should be the string before the number. So, for example, I want to store 10, 0, 34, inside a numeric vector and their name attributes should be, R1Ng, test, n.Ex1T, respectively.
I can always do something like this to get the numbers and create a numeric vector:
counts <- regmatches(data, gregexpr("[[:digit:]]+", data))
as.numeric(unlist(counts))
But, how can I extract the first string part, and store it as the name attribute of that numberic array?
How about this:
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name = gsub( "\\(.*", "", x),
Count = as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)))
# Name Count
# 1 R1Ng 10
# 2 test 0
# 3 n.Ex1T 34
Or alternatively as a vector
setNames(as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)),
gsub( "\\(.*", "", x ))
# R1Ng test n.Ex1T
# 10 0 34
Here is another variation using the same expression and capturing parentheses:
temp <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name=gsub("^(.*)\\((\\d+)\\)$", "\\1", temp),
count=gsub("^(.*)\\((\\d+)\\)$", "\\2", temp))
We can use str_extract_all
library(stringr)
lst <- str_extract_all(x, "[^()]+")
Or with strsplit from base R
lst <- strsplit(x, "[()]")
If we need to store as a named vector
sapply(lst, function(x) setNames(as.numeric(x[2]), x[1]))
# R1Ng test n.Ex1T
# 10 0 34
data
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")

splitting filename text by underscores using R

In R I'd like to take a collection of file names in the format below and return the number to the right of the second underscore (this will always be a number) and the text string to the right of the third underscore (this will be combinations of letters and numbers).
I have file names in this format:
HELP_PLEASE_4_ME
I want to extract the number 4 and the text ME
I'd then like to create a new field within my data frame where these two types of data can be stored. Any suggestions?
Here is an option using regexec and regmatches to pull out the patterns:
matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.
Produces:
a match.1 match.2
1 HELP_PLEASE_4_ME 4 ME
2 SOS_WOW_3_Y34OU 3 Y34OU
This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.
And the data:
df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)
The obligatory stringr version of #BrodieG's quite spiffy answer:
df[c("match.1", "match.2")] <-
t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))
Put here for context only. You should accept BrodieG's answer.
Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.
> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
## string under2 under3
## 1 HELP_PLEASE_4_ME 4 ME
Then for longer vectors, you could do something like the last two lines here.
## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME" "GOODBYE_TWO_21_YOU"
## [3] "HI_THREE_22_THEM" "BYE_FOUR_23_US"
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
## XX V2 V3
## 1 HELLO_ONE_20_ME 20 ME
## 2 GOODBYE_TWO_21_YOU 21 YOU
## 3 HI_THREE_22_THEM 22 THEM
## 4 BYE_FOUR_23_US 23 US

Resources