argument "second_str" is missing with no default - r

I have written a function which finds the intersection between two strings. I want to use this function in apply and find out all the intersections in the given data frame. I am using below code.
Function:-
common <- function(first_str,second_str)
{
a <- unlist(strsplit(first_str," "))
b <- unlist(strsplit(second_str," "))
com <- intersect(a,b)
return((length(com)/length(union(a,b)))*100)
}
Data frame:-
str1 <- c("One Two Three","X Y Z")
str2 <- c("One Two Four", "X Y A")
df <- data.frame(str1, str2)
When use apply I get argument "second_str" is missing with no default error
apply(df, 1, common)
Could you please help me out with the solution?

apply() will only pass a single vector to the function you provide. With margin=1 it will call your function once per each row with a single vector containing all the values for the "current" row. It will not split up those values into multiple parameters to your function.
You could instead re-write your function to
common2 <- function(x) {
first_str <- x[1]
second_str <- x[2]
a <- unlist(strsplit(first_str," "))
b <- unlist(strsplit(second_str," "))
com <- intersect(a,b)
return((length(com)/length(union(a,b)))*100)
}
Although that doesn't scale well for multiple parameters. YOu could also use Map or mapply to iterate over multiple vectors at a time
If your original function you can do
with(df, Map(common, str1, str2))

Related

In r, use string just as if I had typed it in

I am dealing with one aspect of r that really confuses me. What I have built is a line of code invoking str_remove saved as a string. If I was to copy-paste that string into where I want to use this line of code, it works perfectly as intended.
However I cannot get r to interpret this code correctly. I have tried using e.g. parse, but the escape characters intended for str_remove regular expression throw up errors.
Is there not a simple way to just treat a string as if it was a line of typed code?
Here is my reproducible example:
Make toy data:
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x)
{colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
The idea is that context will be an argument to a function and that it can be flexible, so the user can supply any number of contexts of interest separated by commas. These will be stringr regular expressions designed to look for particular contexts in DNA within a string of 11 bases. Here for example we can use two contexts of interest. The code that follows combines these to make an expression for use later in selecting the appropriate rows from the dataframes in the list.
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
'contextexpression' is now:
[1] "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')"
If I were to paste this expression directly into apply, it works precisely as I would want it.
> lapply(maf_list_context, function(x){
+
+ x[str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
+
+ })
[[1]]
CONTEXT want_it
1 ATTATCGAATT this one
[[2]]
CONTEXT want_it
1 ATTACGTAATT this one too
But of course if I use the string there, it does not.
> lapply(maf_list_context, function(x){
+
+ x[contextexpression, ]
+
+ })
[[1]]
CONTEXT want_it
NA <NA> <NA>
[[2]]
CONTEXT want_it
NA <NA> <NA>
I have tried many different functions but none of them make this work. Is there are way of having r interpret this string as if I had typed it in directly?
The whole reprex:
if (!require("stringr") {
install.packages("stringr", dependencies = TRUE)
library("stringr")
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x){
colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
maf_list_select <- lapply(maf_list_context, function(x){
x[contextexpression, ]
})
I'm not sure I completely follow what you want your input to be and how to apply it, but your problem seems to be with what you're passing to the subset operator, i.e. x[<codehere>]
The subset operator expects a logical vector. When you "paste the expression" you are actually pasting an expression that gets evaluated to a logical vector, hence it properly subsets. When you pass the variable contextexpression, you are actually passing a string. As R sees it:
x[ "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')", ]
Instead of (notice the syntax highlighting difference):
x[ str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
You want apply each context to each member of the list to get a logical vector and then subset.
purrr::map2(maf_list_context, contextvec, ~.x[str_detect(.x$CONTEXT, .y), ])
If you want to compare every item in contextvec to every item in maf_list_context, then it's a little more complicated but doable.
purrr::map2(
maf_list_context,
purrr::map(
maf_list_context,
function(data){
purrr::reduce(contextvec,
function(prev, cond) str_detect(data$CONTEXT, cond) | prev,
.init = logical(length(contextvec))
)
}
),
~.x[.y]
)
There's probably a more efficient way to short circuit the matching against the items in maf_list_context, but the general approach applies. The str_detect handles the comparison of a single condition against a single maf_list item. The reduce call combines the results of all the comparisons of contextvec to a single item in maf_list_context to a single boolean. The inner map iterates through maf_list_context. The outer map2 iterates through the list of boolean values created by the inner map and maf_list_context to subset for matches.
If maf_list_context has n items and contextvec has m items:
reduce makes m comparisons, resulting in 1 value
map makes n calls to reduce result in n values
map2 makes n iterations to subset maf_list_context

How to convert several characters to vectors in R?

I am struggling with converting several characters to vectors and making them as a list in R.
The converting rule is as follows:
Assign a number to each character. ex. A=1, B=2, C=3,...
Make a vector when the length of characters is ">=2". ex. AB = c(1,2), ABC = c(1,2,3)
Make lists containing several vectors.
For example, suppose that there is ex object with three components. For each component, I want to make it to list objects list1, list2, and list3.
ex = c("(A,B,C,D)", "(AB,BC,CD)","(AB,C)")
# 3 lists to be returned from ex object
list1 = "list(1,2,3,4)" # from (A,B,C,D)
list2 = "list(c(1,2), c(2,3), c(3,4))" # from (AB,BC,CD)
list3 = "list(c(1,2), c(3))" # from (AB,C)
Please let me know a good R function to solve the example above.
* The minor change is reflected.
lookUpTable = as.numeric(1:4) #map numbers to their respective strings
names(lookUpTable) = LETTERS[1:4]
step1<- #get rid of parentheses and split by ",".
strsplit(gsub("[()]", "", ex), ",")
result<- #split again to make things like "AB" into "A", "B", also convert the strings to numbers acc. to lookUpTable
lapply(step1, function(x){ lapply(strsplit(x, ""), function(u) unname(lookUpTable[u])) })
# assign to the global environment.
invisible(
lapply(seq_along(result), function(x) {assign(paste0("list", x), result[[x]], envir = globalenv()); NULL})
)
# get it as strings:
invisible(
lapply(seq_along(result), function(x) {assign(paste0("list_string", x), capture.output(dput(result[[x]])), envir = globalenv()); NULL})
)
data:
ex = c("(A,B,C,D)", "(AB,BC,CD)","(AB,C)")
tips and tricks:
I make use of regular expressions in gsub (and strsplit). Learn regex!
I made a lookUpTable that maps the individual strings to numbers. Make sure your lookUpTable is set up analogously.
Have a look at apply functions like in that case ?lapply.
lastly I assign the result to the global environment. I dont recommend this step but its what you have requested.

Difference between two columns with separated variables by ; in R

I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

subset in parallel using a list of dataframes and a list of vectors

This works:
onion$yearone$id %in% mask$yearone
This doesn't:
onion[1][1] %in% mask[1]
onion[1]['id'] %in% mask[1]
Why? Short of an obvious way to vectorize in parallel columns in DF and in memberids (so I only get rows within each year when ids are present in both DF and memberids), im using a for loop, but I'm not being lucky at finding the right way to express the index... Help?
Example data:
yearone <- data.frame(id=c("b","b","c","a","a"),v=rnorm(5))
onion <- list()
onion[[1]] <- yearone
names(onion) <- 'yearone'
mask <- list()
mask[[1]] <- c('a','c')
names(mask) <- 'yearone'
The '$' operator is not the same as the '[' operator. If the "yearone' and 'ids' are in fact the first items in those lists you should see that this is giving the same results as the first call:
DF[[1]][[1]] %in% memberids[[1]]
Why we should think that accessing yearpathall should give the same results is entirely unclear at this point, but using the "[[" operator will possibly give an atomic vector, whereas using "[" will certainly not. The "[" operator always returns a result that is the same class as its first argument so in this case would be a list rather than a vector, for both 'DF' and 'memberids'. The %in% operator is just an infix version fo match and needs an atomic vector as both of its arguments
Here is an approach using Map
# some data
onion <- replicate(5,data.frame(id = sample(letters[1:3], 5,T), v = 1:5),
simplify = F)
mask <- replicate(5, sample(letters[1:3],2), simplify = F)
names(onion) <- names(mask) <- paste0('year', seq_along(onion))
A function that will do the matching
get_matches <- function(data, id, mask){
rows <- data[[id]] %in% mask
data[rows,]
}
Map(get_matches , data = onion, mask = mask, MoreArgs = list(id = 'id'))
This seems to be the answer I was seeking:
merge(mask[1],onion[[1]], by.x = names(mask[1]), by.y = names(onion[[1]][1]))
And applied to parallel lists of dataframes:
result <- list()
for (i in 1:(length(names(onion)))) {
result[[i]] <- merge(mask[i],onion[[i]], by.x = names(mask[i]), by.y = names(onion[[i]][1]))
}

Resources