I am pre-processing a data frame with 100,000+ blog URLs, many of which contain content from the blog header. The grep function lets me drop many of those URLs because they pertain to archives, feeds, images, attachments or a variety of other reasons. One of them is that they contain “atom”.
For example,
string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one"
df <- data.frame(row, string)
df$string <- as.character(df$string) df[-grep("atom", string), ]
My problem is that the pattern “atom” might appear in a blog header, which is important content, and I do not want to drop those URLs.
How can I concentrate the grep on only the final 20 characters (or some number that greatly reduces the risk that I will grep out content that contains the pattern rather than the ending elements? This question uses $ at the end but is not using R; besides, I don't know how to extend the $ back 20 characters. Regular Expressions _# at end of string
Assume that it is not always the case that the pattern has forward slashes on either or both ends. E.g, /atom/.
The function substr can isolate the end portion of the strings, but I don’t know how to grep only within that portion. The pseudo-code below draws on the %in% function to try to illustrate what I would like to do.
substr(df$string, nchar(df$string)-20, nchar(df$string)) # extracts last 20 characters; start at nchar end -20, to end
But what is the next step?
string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]
Thank you for your guidance.
lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
# atom was in there
} else {
# atom was not in there
}
could also do it without the lastpart..
if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
# atom was in there
} else {
# atom was not in there
}
but things become harder to read... (gives better perfomance though)
You could try using a URL component depth approach (i.e. only return df rows which contain the word "atom" after 5 slashes):
find_first_match <- function(string, pattern) {
components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
matches <- grepl(pattern = pattern, x = components)
if(any(matches) == TRUE) {
first.match <- which.min(matches)
} else {
first.match <- NA
}
return(first.match)
}
Which can be used as follows:
# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")
# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]
# row string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/ 6
This gives you control over which URLs to return based on the depth of when "atom" appears
I chose the second answer because it is easier for me to understand and because with the first one it is not possible to predict how many forward slashes to include in the “component depth”.
The second answer translated into English from the inside function to the broadest function out says:
Define the final 20 characters of your string with the substr() function, your substring;
then find if the pattern “atom” is in that sub-string with the grep() function;
then count whether “atom” was found more than once in the substring, thus with length greater than zero, and that row will be omitted;
finally, if no pattern is matched, i.e., no “atom” is found in the final 20 characters, leave the row alone – all done with the if…else() function
Related
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}
I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.
For those fellow DnD fans, I recently found the Ring of the Grammarian. Thus I am trying to make a quick script for generating a list of sensible words based on swapping letters from an input string. For example, I want to input "mage hand" and have the program return a list or dataframe which reads;
cage hand
...yada yada ...
mage band
mage land
...yada yada ...
mage bang
so far, I've only gotten as far as this:
dictionary<-data.frame(DICTIONARY)
spell.suggester<-function(x){
for (i in 1:nchar(x)) {
for (k in 1:length(letters)) {
res1<-gsub(pattern = x[i] ,replace = letters[k], x)
res2<-grep("\\bres1\\b",dictionary[,1],value = F)
if_else(res2>1,print(grep("\\bres1\\b",dictionary[,1],value = T)),"nonsense")
return()
}
}
}
spell.suggester(x = "mage hand")
but I end up with an error message which reads
character(0)
NULL
I haven't found any answers on stack using R. Could someone please help me with some suggestions and guidance?
Your major problem here is that you're trying to index each letter of a string, and R doesn't like letting you do that - it treats a string as a whole value, so attempting to index the letters fails.
To fix that, you can use strsplit to turn a string into a vector of individual characters that you can index as normal.
Your second issue the dictionary search seems a bit over-complicated; you can use %in% to check if a value is present in a vector.
The code below shows a minimal example of how to do this; it only works with single words, and relies on you having a decent dictionary to check valid words against.
# minimal example of valid word list
dictionary <- c("vane", "sane", "pane", "cane",
"bone", "bans", "bate", "bale")
spell.suggester<-function(spell){
#Split spell into a vector of single characters
spell_letters <- strsplit(spell,"")[[1]]
# Once for each letter in spell
for (i in 1:nchar(spell)) {
# Once for each letter in letters
for (k in 1:length(letters)) {
#If the letter isn't a space
if (spell_letters[i] != " "){
# Create a new word by changing one letter
word <-gsub(pattern = spell_letters[i] ,replace = letters[k], spell)
# If the word is in the list of valid words
if (word %in% dictionary){
# print the possibility
print(word)
}
}
}
}
}
spell.suggester(spell="bane")
I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}