How to add an attribute to grep () Output - r

What I'm trying to do is following:
use a grep() function to search for a pattern (a list of numbers, which I called "toMatch") in a data.frame ("News"). So, what I want it to do is search for those numbers in the news and return the matches (in the form "number", "corresponding news"). Unfortunately, I could so far only get a list of the corresponding news as a result. Any idea how I can add an attribute with the corresponding number from the match to the output? (in a way create a key-value pairs as an output)
Here a simple short example of my code:
News <- c ("AT000000STR2 is schwierig", "AT", "ATI", "AT000000STR1")
toMatch <- c("AT000000STR1","AT000000STR2","DE000000STR1","DE000000STR2")
matches <- unique (grep(paste(toMatch,collapse="|"),News, value=TRUE))
matches
And here the result:
> matches
[1] "AT000000STR2 is schwierig" "AT000000STR1" `
What I would like to have is a list or better yet Excel file, looking like this:
AT000000STR2 "AT000000STR2 is schwierig"
AT000000STR1 "AT000000STR1"
Help is much appreciated.

Something like this might be of help:
#name toMatch with its names
names(toMatch) <- toMatch
#create a list with the format you request
myl <-
lapply(toMatch, function(x) {
grep(x, News, value=TRUE)
})
#or in a more compact way as #BenBolker says in the comments below
#myl <- lapply(toMatch, grep, x=News, value=TRUE)
#remove the unmatched
myl[lapply(myl,length)>0]
Output:
$AT000000STR1
[1] "AT000000STR1"
$AT000000STR2
[1] "AT000000STR2 is schwierig"

Your current approach returns the unique matches, but then you have no way of linking them to the relevant 'toMatch'.
This might be a start for you: using lapply we create a list of matches for all elements of toMatch, and then bind those together with toMatch.
matched <- lapply(toMatch, function(x){grep(x,News,value=T)})
#turn unfound matches to missings. You can remove these, but I don't like
#generating implicit missings
matched[sapply(matched,length)==0]<-NA
res <- cbind(toMatch,matched)
res
toMatch matched
[1,] "AT000000STR1" "AT000000STR1"
[2,] "AT000000STR2" "AT000000STR2 is schwierig"
[3,] "DE000000STR1" NA
[4,] "DE000000STR2" NA
writing to csv is then trivial:
write.csv(res,"yourfile.csv")

Related

How to search a vector of words for words containing two specific letters

So I've got a vector of 5 letter words and I want to be able to create a function that extracts the words that contain ALL of the letters in the pattern.
For example, if my vector is ("aback", "abase", "abate", "agate", "allay") and I'm looking for words that contain BOTH "a" and "b", I want the function to return ("aback", "abase", "abate"). I don't care what position or how many times these letters occur in the words, only that the words contain both of them.
I've tried to do this by creating a function that is meant to combine grepl's with an &. But the problem here is the grepl function doesn't accept vectors as the pattern. My plan was for this function to achieve grepl("a", word_vec) & grepl("b", word_vec). I also need this to be scalable so if I want to search for all words containing "a" AND "b" AND "c", for example.
grepl_cat <- function(str, words_vec) {
pat <- str_split(str, "")
first_let = TRUE
for (i in 1:length(pat)) {
if (first_let){
result <- sapply(pat[i], grepl, x = word_vec)
first_let <- FALSE
}
print(pat[i])
result <- result & sapply(pat[i], grepl, x = word_vec)
}
return(result)
}
word_vec[grepl_cat("abc", word_vec)]
The function I've written above definitely isn't doing what it's intended to do.
I'm wondering if there an easier way to do this with regex patterns or there's a way to input each letter in the str into the grepl function as non-vectors.
A possible solution in base R:
s <- c("aback", "abase", "abate", "agate", "allay")
subset(s, grepl("(a)(b)", s))
#> [1] "aback" "abase" "abate"
Another possible solution, based on tidyverse:
library(tidyverse)
s <- c("aback", "abase", "abate", "agate", "allay")
s %>%
data.frame(s = .) %>%
filter(str_detect(s, "(a)(b)")) %>%
pull(s)
#> [1] "aback" "abase" "abate"
For a,b and c regex solution would be:
^.*a.*b.*c.*$
You may add more letters as needed
Demo1
Alternative regex approach:
^(?=.*a)(?=.*b)(?=.*c).*$
Demo 2

Converting list of Characters to Named num in R

I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

how to separate names using regular expression?

I have a name vector like the following:
vname<-c("T.Lovullo (73-58)","K.Gibson (63-96) and A.Trammell (1-2)","T.La Russa (81-81)","C.Dressen (16-10), B.Swift (32-25) and F.Skaff (40-39)")
Watch out for T.La Russa who has a space in his name
I want to use str_match to separate the name. The difficulty here is that some characters contain two names while the other contain only one like the example I gave.
I have write my code but it does not work:
str_match_all(ss,"(D[.]D+.+)s(\\(d+-d+\\))(s(and)s(D[.]D+.+)s(\\(d+-d+\\)))?")
Perhaps this helps
res <- unlist(strsplit(vname, "(?<=\\))(\\sand\\b\\s)*", perl = TRUE))
res
#[1] "T.Lovullo (73-58)" "K.Gibson (63-96)" "A.Trammell (1-2)" "T.La Russa (81-81)"
To get the names only (if that is what the expected)
sub("\\s*\\(.*", "", res)
#[1] "T.Lovullo" "K.Gibson" "A.Trammell" "T.La Russa"

How to remove common parts of strings in a character vector in R?

Assume a character vector like the following
file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt
Desired output:
file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw
I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.
The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.
So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,
mf <- function(x,sep){
xsplit = strsplit(x,split = sep)
xdfm <- as.data.frame(do.call(rbind,xsplit))
res <- list()
for (i in 1:ncol(xdfm)){
if (!all(xdfm[,i] == xdfm[1,i])){
res[[length(res)+1]] <- as.character(xdfm[,i])
}
}
res <- as.data.frame(do.call(rbind,res))
res <- apply(res,2,function(x) paste(x,collapse="_"))
return(res)
}
Applying the above function:
a = c("a_samples.txt","b_samples.txt")
mf(a,"_")
V1 V2
"a" "b"
2.
> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
V1 V2
"apple" "orange"
If the resulting split parts are not same length, this doesn't work.
What about
files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files
... which yields
[1] "file1_p1_analysed" "file1_p1_raw" "f2_file2_p1_analysed" "f3_file3_p1_raw"
This removes the _samples.txt part from your strings.
Why not:
strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")
sapply(strings, function(x) {
pattern <- ".*(file[0-9].*)_samples\\.txt"
gsub(x, pattern = pattern, replacement = "\\1")
})
Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!
Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

Resources