using grep with multiple entries in r to find matching strings - r

If I have a vector of strings:
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
and want to find the entires with '2011' in the string I can use
ifiles <- dd[grep("2011",dd)]
How do I search for entries with a combination of strings included, without using a loop?
For example, I would like to find the entries with both '2011' and 'sprd' in the string, which in this case will only return
sflxgrbfg_sprd_2011
How can this be done? I could define a variable
toMatch <- c('2011','sprd)
and then loop through the entries but I was hoping there was a better solution?
Note: To make this useful for different strings. Is it also possible to to determine which entries have these strings without them being in the order shown. For example, 'sflxlgrbfg_2011_sprd'

If you want to find more than one pattern, try indexing with a logical value rather than the number. That way you can create an "and" condition, where only the string with both patterns will be extracted.
ifiles <- dd[grepl("2011",dd) & grepl("sprd_",dd)]

Try
grep('2011_sprd|sprd_2011', dd, value=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
Or using an example with more patterns
grep('(?<=sprd_).*(?=2011)|(?<=2011_).*(?=sprd)', dd1,
value=TRUE, perl=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
#[3] "sfxl_2011_14334_sprd" "sprd_124334xsff_2011_1423"
data
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012",
"sflxlgrbfg_2011_sprd")
dd1 <- c(dd, "sfxl_2011_14334_sprd", "sprd_124334xsff_2011_1423")

If you want a scalable solution, you can use lapply, Reduce and intersect to:
For each expression in toMatch, find the indices of all matches in dd.
Keep only those indices that are found for all expressions in toMatch.
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
dd <- c(dd, "sflxgrbfh_sprd_2011")
toMatch <- c('bfg', '2011','sprd')
dd[Reduce(intersect, lapply(toMatch, grep, dd))]
#> [1] "sflxgrbfg_sprd_2011" "sflxgrbfg_sprd2_2011"
Created on 2018-03-07 by the reprex package (v0.2.0).

Related

How often does the string repeat in the list of vectors? (in R)

I have a list of vectors with the following structure
[953] "c(\"15768\", \"11999\")"
[954] "c(\"18012\", \"4761\", \"1792\", \"18085\", \"18002\", \"18018\", \"8818\", \"8696\")"
[955] "c(\"735\", \"6073\", \"18007\", \"18046\", \"18087\")"
As you can see, each number is a string. These strings repeate in different vectors. What I need is to find out how often each string repeats over the data I have.
I have tried to table, but it doesn't work the way I need.
If it is a vector of single strings, then an option is str_extract_all to extract all the numeric part, unlist and get the table
library(stringr)
tbl <- sort(table(as.numeric(unlist(str_extract_all(vec1,
"\\d+")))), decreasing = TRUE)
Or using base R
sort(table(unlist(regmatches(vec1, gregexpr("\\d+", vec1)))), decreasing = TRUE)
data
vec1 <- c("c(\"15768\", \"11999\")", "c(\"18012\", \"4761\", \"1792\", \"18085\", \"18002\", \"18018\", \"8818\", \"8696\")",
"c(\"735\", \"6073\", \"18007\", \"18046\", \"18087\")")

What is the best way in R to identify the first character in a string?

I am trying to find a way to loop through some data in R that contains both numbers and characters and where the first character is found return all values after. For example:
column
000HU89
87YU899
902JUK8
result
HU89
YU89
JUK8
have tried stringr_detct / grepl but the value of the first character is by nature unknown so I am having difficultly pulling it out.
We could use str_extract
stringr::str_extract(x, "[A-Z].*")
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
Ronak's answer is simple.
Though I would also like to provide another method:
column <-c("000HU89", "87YU899" ,"902JUK8")
# Get First character
first<-c(strsplit(gsub("[[:digit:]]","",column),""))[[1]][1]
# Find the location of first character
loc<-gregexpr(pattern =first,column)[[1]][1]
# Extract everything from that chacracter to the right
substring(column, loc, last = 1000000L)
We can use sub from base R to match one or more digits (\\d+) at the start (^) of the string and replace with blank ("")
sub("^\\d+", "", x)
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
In base R we can do
x <- c("000HU89", "87YU899", "902JUK8")
regmatches(x, regexpr("\\D.+", x))
# [1] "HU89" "YU899" "JUK8"

R: How to remove a string containing a specific character pattern?

I'm trying to remove strings that contain a specific character pattern. My data looks somethink like this:
places <- c("copenhagen", "copenhagens", "Berlin", "Hamburg")
I would like to remove all elements that contain "copenhagen", i.e. "copenhagen" and "copenhagens".
But I was only able to come up with the following code:
library(stringr)
replacement.vector <- c("copenhagen", "copenhagens")
for(i in 1:length(replacement.vector)){
places = lapply(places, FUN=function(x)
gsub(paste0("\\b",replacement.vector[i],"\\b"), "", x))
I'm looking fo a function that enables me to remove all elements that contain "copenhagen" without having to specify whether or not the element also includes other letters.
Best,
Dose
Based on the OP's code, it seems like we need to subset the 'places'. In that case, it may be better to use grep with invert= TRUE argument
grep("copenhagen", places, invert=TRUE, value = TRUE)
#[1] "Berlin" "Hamburg"
or use grepl and negate (!)
places[!grepl("copenhagen", places)]
#[1] "Berlin" "Hamburg"

Processing files in a particular order in R

I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)

R - Using grep and gsub to return more than one match in the same (character) vector element

Imagine we want to find all of the FOOs and subsequent numbers in the string below and return them as a vector (apologies for unreadability, I wanted to make the point there is no regular pattern before and after the FOOs):
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
We can use this to find one of them (taken from 1)
gsub(".*(FOO[0-9]+).*", "\\1", xx)
[1] "FOO1643"
However, I want to return all of them, as a vector.
I've thought of a complicated way to do it using strplit() and gregexpr() - but I feel there is a better (and easier) way.
You may be interested in regmatches:
> regmatches(xx, gregexpr("FOO[0-9]+", xx))[[1]]
[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
library(stringr)
str_extract_all(xx, "(FOO[0-9]+)")[[1]]
#[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
this can take vectors of strings as well, and mathces will be in list elements.
Slightly shorter version.
library(gsubfn)
strapplyc(xx,"FOO[0-9]*")[[1]]

Resources