Flexibly creating logical statements from Vectors with text elements - r

I'm probably going about this in a silly way but bear with me. I'm using selectizeInput in shiny so users can select multiple categories. Those selections are used to subset a dataframe. After trying match and pmatch and %in% to look for patterns in text, I decided to use grep. Its able to find the proper row when a project has multiple categories. However, the pattern paramter can't have a length greater than 1. The workaround? Add | (or operator) between elements in the pattern. I need help coming up with a process for inserting | between elements of a vector so I can use grep to subset dataframes when length(input$Category)>1.
Example
df <- data.frame(title = 1:5, category = c("ab", "bcd", "efg","ab,bcd","efg"))
selected category
cate <- c("bcd")
df[grep(cate,x = df$category),]
Works great!
But, if someone were to select more than one category, grep only uses the first element in the pattern:
cate <- c("bcd","efg")
df[grep(cate,x = df$category),]
failure
UNLESS we add an | between the categories selected
cate <- c("bcd|efg")
df[grep(cate,x = df$category),]
success
I cannot figure out how to programmatically add the | between elements of cate without making a giant mess with if statements.
if(length(cate)== 1){
df[grep(cate,x = df$category),])
} else {
if(length(cate) == 2){
cate2 <- paste(cate[[1]],"|",cate[[2]], sep = "")
df[grep(cate2,x = df$category),]
} else {...
There must be some way to generate the pattern:
paste(cate[[1]],"|",cate[[2]],...,"|",cate[[n]], sep = "")

I'm not 100% sure on what you're doing and thus can't point you to the "better" solution, but for your purposes I'm pretty sure something along the lines of this will do:
cate <- character()
cate[1] <- c("ab")
cate[2] <- c("efg")
cate[3] <- c("ab")
do.call(what = paste, c(list(cate), collapse = "|"))

Related

In r, use string just as if I had typed it in

I am dealing with one aspect of r that really confuses me. What I have built is a line of code invoking str_remove saved as a string. If I was to copy-paste that string into where I want to use this line of code, it works perfectly as intended.
However I cannot get r to interpret this code correctly. I have tried using e.g. parse, but the escape characters intended for str_remove regular expression throw up errors.
Is there not a simple way to just treat a string as if it was a line of typed code?
Here is my reproducible example:
Make toy data:
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x)
{colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
The idea is that context will be an argument to a function and that it can be flexible, so the user can supply any number of contexts of interest separated by commas. These will be stringr regular expressions designed to look for particular contexts in DNA within a string of 11 bases. Here for example we can use two contexts of interest. The code that follows combines these to make an expression for use later in selecting the appropriate rows from the dataframes in the list.
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
'contextexpression' is now:
[1] "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')"
If I were to paste this expression directly into apply, it works precisely as I would want it.
> lapply(maf_list_context, function(x){
+
+ x[str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
+
+ })
[[1]]
CONTEXT want_it
1 ATTATCGAATT this one
[[2]]
CONTEXT want_it
1 ATTACGTAATT this one too
But of course if I use the string there, it does not.
> lapply(maf_list_context, function(x){
+
+ x[contextexpression, ]
+
+ })
[[1]]
CONTEXT want_it
NA <NA> <NA>
[[2]]
CONTEXT want_it
NA <NA> <NA>
I have tried many different functions but none of them make this work. Is there are way of having r interpret this string as if I had typed it in directly?
The whole reprex:
if (!require("stringr") {
install.packages("stringr", dependencies = TRUE)
library("stringr")
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x){
colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
maf_list_select <- lapply(maf_list_context, function(x){
x[contextexpression, ]
})
I'm not sure I completely follow what you want your input to be and how to apply it, but your problem seems to be with what you're passing to the subset operator, i.e. x[<codehere>]
The subset operator expects a logical vector. When you "paste the expression" you are actually pasting an expression that gets evaluated to a logical vector, hence it properly subsets. When you pass the variable contextexpression, you are actually passing a string. As R sees it:
x[ "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')", ]
Instead of (notice the syntax highlighting difference):
x[ str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
You want apply each context to each member of the list to get a logical vector and then subset.
purrr::map2(maf_list_context, contextvec, ~.x[str_detect(.x$CONTEXT, .y), ])
If you want to compare every item in contextvec to every item in maf_list_context, then it's a little more complicated but doable.
purrr::map2(
maf_list_context,
purrr::map(
maf_list_context,
function(data){
purrr::reduce(contextvec,
function(prev, cond) str_detect(data$CONTEXT, cond) | prev,
.init = logical(length(contextvec))
)
}
),
~.x[.y]
)
There's probably a more efficient way to short circuit the matching against the items in maf_list_context, but the general approach applies. The str_detect handles the comparison of a single condition against a single maf_list item. The reduce call combines the results of all the comparisons of contextvec to a single item in maf_list_context to a single boolean. The inner map iterates through maf_list_context. The outer map2 iterates through the list of boolean values created by the inner map and maf_list_context to subset for matches.
If maf_list_context has n items and contextvec has m items:
reduce makes m comparisons, resulting in 1 value
map makes n calls to reduce result in n values
map2 makes n iterations to subset maf_list_context

How to use loop to make a stack in R?

I'm a newbie to this world.
I am currently working with R codes to analyze some sequencing data and just stuck now.
Here's some problem description.
What I'd like to do is to select first word of $v3 from pat1_01_exonic data(115 rows)and make it file. (I used strsplit function for this)
till now, I tried below code 1 attached, and it worked for first line.
but the problem is I can't do this for 115 times.
so, It seems like a loop is necessary.
I'm not really confident with making a loop by myself. and as I expected it didn't work.
for making stack I thought about using append or rbind or stack.
Can anyone give me some advice about how to fix this problem?
Big thanks in advance
#code1
pat1_01_exonic$V3 <-as.character(pat1_01_exonic$V3)
pat1 <- data.frame(head(strsplit(pat1_01_exonic$V3, ":")[[1]],1))
#code2
for (i in 1: nrow(pat1_01_exonic)) {
pat1_output <- vector()
sub[i] <- data.frame(head(strsplit(pat1_01_exonic$V3, ":")[[i]],1))
pat1_0utput <- append(sub[i])
i <- i+1
}
Many of the times, you can avoid for loop in R. If I have understood you correctly, here you can use sub to get first string before ":"
pat1_01_exonic$new_col <- sub(":.*", "", pat1_01_exonic$V3)
pat1_01_exonic
# V3 new_col
#1 abc:def:avd abc
#2 afd:adef afd
#3 emg:rvf:temp emg
data
pat1_01_exonic <- data.frame(V3 = c("abc:def:avd", "afd:adef", "emg:rvf:temp"),
stringsAsFactors = FALSE)
The below code is an an example to create a new variable "V3_First_Word" that selects the first word in the original string.
Want<-pat1_01_exonic%>%
mutate(V3_First_Word=word(V3,1,1)) # This creates new varaible and selects first word
In base R, we can use read.table
pat1_01_exonic$new_col <- read.table(text = pat1_01_exonic$V3, sep=":",
header = FALSE, fill = TRUE, stringsAsFactors = FALSE)[,1]
pat1_01_exonic$new_col
#[1] "abc" "afd" "emg"
Or strsplit and select the first element
sapply(strsplit(pat1_01_exonic$V3, ":"), `[`, 1)
data
pat1_01_exonic <- data.frame(V3 = c("abc:def:avd", "afd:adef", "emg:rvf:temp"),
stringsAsFactors = FALSE)

Reverse lookup for loop in R

I have a set of numbers / string that makes other number / string. I need to create a function that gives me a list of the all the numbers / string needed to create that number / string.
Consider the following dataset
ingredients <- c('N/A', 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product)
head(data)
If I input function(dog), I would like a list that returns bird and then cat. The function knows when to stop when ingredients = N/A (there's nothing more to look up).
It seems like some of sort of for loop that appends is the right approach.
needed <- list()
for (product in list){
needed[[product]]<-df
}
df <- dplyr::bind_rows(product)
I appended your initial code to make N/A simply equal to NA so I could use the is.na function in my R code. Now the sample data is
ingredients <- c(NA, 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product)
Code is below:
ReverseLookup <- function (input) {
ans <- list()
while (input %in% data$product) {
if (!is.na(as.character(data[which(data$product == input),]$ingredients))) {
ans <- append(ans, as.character(data[which(data$product == input),]$ingredients))
input <- as.character(data[which(data$product == input),]$ingredients)
}
else {
break
}
}
print(ans)
}
I create an empty list and then create a while loop that just checks if the input exists in the product column. If so, it then checks to see if the corresponding ingredient to the product input is a non-NA value. If that's the case, the ingredient will be appended to ans and will become the new input. I also added a break statement to get out of the while loop when you reach an NA.
I did a quick test on the case where there is no NA in your dataframe and it appears to be working fine. Maybe someone else here can figure out a more concise way to write this, but it should work for you.
You can likely find a way to use a tree of some type to work through nodes. But, using a recursive function in base R, I have come up with this.
I have also changed the 'N/A' to NA to make life easier. Also, I have added in stringsAsFactors = F to the data frame.
ingredients <- c(NA, 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product, stringsAsFactors = F)
reverse_lookup <- function(data, x, last_result = NULL) {
if (! is.null(last_result)) {
x <- data[data$product == last_result[length(last_result)], "ingredients"]
}
if (! is.na(x)) {
last_result <- reverse_lookup(data, x, c(last_result, x))
}
last_result
}
This returns the input as well, which you can always drop off as the first element of the vector.
> reverse_lookup(data, "dog")
[1] "dog" "bird" "cat"

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

R new variable assignment

I made a loop that assigns the result of a function to a newly created variable. After that that variable is used to create another.
This second step fails to produce the expected result.
library(stringr)
for (i in 1:length(Ids)){
nam <- paste("data", Ids[i], sep = "_")
assign(nam, GetReportData(query, token,paginate_query = F))
newvar=paste(nam,"contentid",sep="$")
originStr=paste(nam,"pagePath",sep="$")
assign(newvar,str_extract(originStr,"&id=[0-9]+"))
}
Don't create a bunch of variables, store related values in named lists to make it easier to retrieve them. You didn't supply any input to test with, but i'm guessing this does the same thing.
library(stringr)
mydata <- lapply(1:length(Ids), function(i) {
dd <- GetReportData(query, token,paginate_query = F))
dd$contentid <- str_extract(d$pagePath,"&id=[0-9]+"))
dd
})
This will return a list of data.frames. You can access them with mydata[[1]], mydata[[2]], etc rather than data_1, data_2, etc
If you absolutely insist on creating a bunch of variables, just make sure to do all your transformations on an actual object, and then save that object when your are done. You can never use assign with names that have $ or [ as described in the help page: "assign does not dispatch assignment methods, so it cannot be used to set elements of vectors, names, attributes, etc." For example
for(i in 1:length(Ids)) {
dd <- GetReportData(query, token,paginate_query = F))
dd$contentid <- str_extract(d$pagePath,"&id=[0-9]+"))
assign(paste("data",i,sep="_"), dd)
}

Resources