I am trying to take all strings that start with a certain few letters and replace them with a different string.
If I have:
y <- "pet"
x <-c("Cat","Cats","Catss")
z=cbind(y,x)
Which gives:
y x
pet Cat
pet Cats
pet Catss
How can I get
y x
pet Cat
pet Cat
pet Cat
gsub(pattern = "^Cat.*", replacement = "Cat", x)
for more complex patterns you should take a look on Regular Expressions
sub("(Cat).*", "\\1", x)
[1] "Cat" "Cat" "Cat"
This solution uses backreference: sub's replacement argument (i.e. the second) 'recalls' only the capturing group in the pattern argument (i.e, the first), here (Cat) but not the endings thus effectively removing them.
Related
I have 4 files:
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf
And I want to group them by name (date: A2001001 and A2001002) inside a list, something like this:
[[MCD18A1.A2001001.h15v05.061.2020097222704.hdf, MCD18A1.A2001001.h16v05.061.2020097221515.hdf], [MCD18A1.A2001002.h15v05.061.2020079205554.hdf, MCD18A1.A2001002.h16v05.061.2020079205717.hdf]]
I did this using Python, but I don't know how to do with R:
# Seperate files by date
MODIS_files_bydate = [list(i) for _, i in itertools.groupby(MODIS_files, lambda x: x.split('.')[1])]
Is this what you are looking for?
g <- sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", s)
split(s, g)
#$A2001001
#[1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf"
#[2] "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
#
#$A2001002
#[1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf"
#[2] "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
regex explained
The regex is divided in three parts.
^[^\\.]*\\.
^ first circumflex marks the beginning of the string;
^[^\\.] at the beginning, a class negating a dot (the second ^). The dot is a meta-character and, therefore, must be escaped, \\.;
the sequence with no dots at the beginning repeated zero or more times (*);
the previous sequence ends with a dot, \\..
([^\\.]+) is a capture group.
[^\\.] the class with no dots, like above;
[^\\.]+ repeated at least one time (+).
\\..*$"
\\. starting with one dot
\\..*$ any character repeated zero or more times until the end ($).
What sub is replacing is the capture group, what is between parenthesis, by itself, \\1. This discards everything else.
Data
s <- "
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
s <- scan(text = s, what = character())
How would you like the outcome organized?
This is a solution:
files <- c("MCD18A1.A2001001.h15v05.061.2020097222704.hdf",
"MCD18A1.A2001001.h16v05.061.2020097221515.hdf",
"MCD18A1.A2001002.h15v05.061.2020079205554.hdf",
"MCD18A1.A2001002.h16v05.061.2020079205717.hdf")
unique_date <- unique(sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", files))
# (credit to Rui Barradas for the nice regular expression)
grouped_files <- lapply(unique_date, function(x){files[grepl(x, files)]})
names(grouped_files) <- unique_date
> grouped_files
# $A2001001
# [1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf" "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
# $A2001002
# [1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf" "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.
i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")
I have a string variable in a large data set that I want to cleanse based on a set list of strings. ex. pattern <- c("dog","cat") but my list will be about 400 elements long.
vector_to_clean == a
black Dog
white dOG
doggie
black CAT
thatdamcat
Then I want to apply a function to yield
new
dog
dog
dog
cat
cat
I have tried str_extract, grep, grepl etc.. Since I can pick a pattern based on one string at a time. I think what I want is to use dapply with one of these text cleansing functions. Unfortunately, I'm stuck. Below is my latest attempt. Thank you for your help!
new <- vector()
lapply(pattern, function(x){
where<- grep(x,a,value = FALSE, ignore.case = TRUE)
new[where]<-x
})
We paste the 'pattern' vector together to create a single string, use that to extract the words from 'vec1' after we change it to lower case (tolower(vec1)).
library(stringr)
str_extract(tolower(vec1), paste(pattern, collapse='|'))
#[1] "dog" "dog" "dog" "cat" "cat"
data
pattern <- c("dog","cat")
vec1 <- c('black Dog', 'white dOG', 'doggie','black CAT', 'thatdamcat')
Another way using base R is:
#data
vec <- c('black Dog', 'white dOG', 'doggie','black CAT','thatdamcat')
#regexpr finds the locations of cat and dog ignoring the cases
a <- regexpr( 'dog|cat', vec, ignore.case=TRUE )
#regmatches returns the above locations from vec (here we use tolower in order
#to convert to lowercase)
regmatches(tolower(vec), a)
[1] "dog" "dog" "dog" "cat" "cat"
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.