R - extracting multiple patterns from string using gregexpr - r

I am working with a dataset where I have a column describing different products. In the product description is also the weight of the product, which is what I'd like to extract. My problem is that some products come in dual-packs, meaning that the description starts with '2x', while the actual weight is at the end of the description. For example:
x = '2x pet food brand 12kg'
What I'd like to do is to shorten this to just 2x12kg.
I'm not great at using regexp in R and was hoping that someone here could help me.
I have tried doing this using gregexp in the following way:
m <- gregexpr("(^[0-9]+x [0-9]+kg)", x)
Unfortunately this only gives me '10kg' not including the '2x'
I would appreciate any help at all with this.
EDIT ----
After sorting out my initial problem, I found that there were a few instances in the data of a different format, which I also like to extract:
x = 'Pet food brand 15x85g'
# Should be:
x = '15x85g'
I have tried to play around with OR statements in gsub, like:
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+kg)|([0-9]+x)?[^0-9]*([0-9.]+g)', '\\1\\2', x)
#And
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+(kg|g)), x)
While this still extracts the kilos, it only removes the instances with grams and leaves the rest of the string, like:
x = 'Pet food brand '
Or running gsub a second time using:
m <- gsub('([0-9]+x[0-9]+g)', '\\1', x)
The latter option does not extract the product weights at all, and just leaves the string intact.
Sorry for not noticing that the strings were formatted differently earlier. Again, any help would be appreciated.

You could use this regular expression
m = gregexpr("([0-9]+x|[0-9.]+kg)", string, ignore.case = T)
result = regmatches(string, m)
r = paste0(unlist(result),collapse = "")
For string = "2x pet food brand 12kg" you get "2x12kg"
This also works if kilograms have decimals:
For string = "23x pet food 23.5Kg" you get "23x23.5Kg"

(edited to correct mistake pointed out by #R. Schifini)
You can use regex like this:
x <- '2x pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "2x12kg"
This would get you the weight even if there is no "2x" in the beginning of the string:
x <- 'pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "12kg"

Related

R How to Search for String Pattern and Extract Custom character lengths from that location?

I am looking to extract a pattern and then a custom number of characters to the left or right of that pattern. I believe this is possible with Regex but unsure how to proceed. Below is an example of the data and the output I am looking for:
library(data.table)
#my data set
df = data.table(
event = c(1,2,3),
notes = c("watch this movie from 4-7pm",
"watch this musical from 5-9pm",
"eat breakfast at this place from 7-9am")
)
#how do I point R to a string section and then pull characters around it?
#example:
grepl('pm|am',df$notes) # I can see an index that these keywords exist but how can I tell R
#locate that word and then maybe pull N digits to the left, or n digits to right like substr()
#output would be
#'4-7pm', '5-9pm', '7-9am'
#right now I can extract the pattern:
library(stringr)
str_extract(df$notes, "pm")
#but I also want to then pull things to the left or right of it.
May in your case, just the below should work:
sapply(df$notes, function(x) {
grep("am|pm", unlist(strsplit(x, " ")), value = T)
}, USE.NAMES = FALSE)
[1] "4-7pm" "5-9pm" "7-9am"
However, this can still fail because of edge cases.
You can also try regex to extract all works ending with am or pm
Look at stringr to locate the extract characters and build the radius:
stringr::str_locate(df$notes, "am|pm")
start end
[1,] 26 27
[2,] 28 29
[3,] 37 38
Using stringr you could do something like this. With the matrix of locations you could tinker with moving around the radius for whatever you are looking for:
library(stringr)
# Extacting locations
locations <- str_locate(df$notes, "\\d+\\-\\d+pm|\\d+\\-\\d+am")
# Using substring to pull the info you want
str_sub(df$notes, locations)
[1] "12-7pm" "5-9pm" "7-9am"
Data (I swapped out 4 for 12):
df = data.table(
event = c(1,2,3),
notes = c("watch this movie from 12-7pm",
"watch this musical from 5-9pm",
"eat breakfast at this place from 7-9am")
)

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!
Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

How do I convert a string to a number in R if the string contains a letter?

I am currently helping a friend with his research and am gathering information about different natural disasters that occured from 2004-2016. The data can be found using this link:
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
when you import it to R it gives helpful information, however, my friend, and now I, am only interested in State, Year, Month, Event, Type, County, Direct & indirect deaths and injuries, and property damage. So first I am extracting the columns I need and will later in the code combine them back together, however the data is currently in string mode, for the Property Damage column I need it to present as numeric since it is in cash value. So for example, I have a data entry in that column that looks like "8.6k" and I need it as this 8600 and for all the "NA" entries to be replaced with a 0.
I have this so far but it gives me back a string of "NA"s. Can anyone think of a better way of doing this?
State<- W2004$STATE
Year<-W2004$YEAR
Month<-W2004$MONTH_NAME
Event<-W2004$EVENT_TYPE
Type<-W2004$CZ_TYPE
County<-W2004$CZ_NAME
Direct_Death<-W2004$DEATHS_DIRECT
Indirect_Death<-W2004$DEATHS_INDIRECT
Direct_Injury<-W2004$INJURIES_DIRECT
Indirect_Injury<-W2004$INJURIES_INDIRECT
W2004$DAMAGE_PROPERTY<-as.numeric(W2004$DAMAGE_PROPERTY)
Damage_Property<-W2004$DAMAGE_PROPERTY
l <- cbind( all the columns up there)
print(l)
We can try using a case when expression here, to map each type of unit to a bona fide number. Going with the two examples you actually showed us:
library(dplyr)
x <- c("1.00M", "8.6k")
result <- case_when(
grepl("\\d+k$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000,
grepl("\\d+M$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000000,
TRUE ~ as.numeric(sub("\\D+$", "", x))
)
You can extract the letter and use switch() which is easily maintainable, if you want to add additional symbols it is very easy.
First, the setup:
options(scipen = 999) # to prevent R from printing scientific numbers
library(stringr) # to extract letters
This is the sample vector:
numbers_with_letters <- c("1.00M", "8.6k", 50)
Use lapply() to loop through vector, extract the letter, replace it with a number, remove the letter, convert to numeric, and multiply:
lapply(numbers_with_letters, function(x) {
letter <- str_extract(x, "[A-Za-z]")
letter_to_num <- switch(letter,
k = 1000,
M = 1000000,
1) # 1 is the default option if no letter found
numbers_with_letters <- as.numeric(gsub("[A-Za-z]", "", x))
#remove all NAs and replace with 0
numbers_with_letters[is.na(numbers_with_letters)] <- 0
return(numbers_with_letters * letter_to_num)
})
This returns:
[[1]]
[1] 1000000
[[2]]
[1] 8600
[[3]]
[1] 50
[[4]]
[1] 0
Maybe I'm oversimplifying here, but . . .
library(tidyverse)
data <- tibble(property_damage = c("8.6k", "NA"))
data %>%
mutate(
as_number = if_else(
property_damage != "NA",
str_extract(property_damage, "\\d+\\.*\\d*"),
"0"
),
as_number = as.numeric(as_number)
)

NLP - identifying and replacing words (synonyms) in R

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

Text Mining in a string using R

I recently started using R and a newbie for data analysis.
Is it possible in R to find the number of repetitions in a single main string of data when a string of data is used for searching through it?
Example:
Main string: 'abcdefghikllabcdefgllabcd'
and search string: 'lla'
Desired output: 'abcdefghik lla bcdefg lla bcd'
[I tried using grep() function of R, but It is not working in the desired way and only gives the number of repetitions of search string in multiple main strings.]
Thank you in advance.
This works too using regex capture groups:
gsub("(lla)"," \\1 ","abcdefghikllabcdefgllabcd")
Try the gsub() method like this:
main_string <- 'abcdefghikllabcdefgllabcd'
search_string <- 'lla'
output_string <- gsub(search_string, paste(' ', search_string, ' ', sep = ''), main_string)
Your question says that you might want to just COUNT the number of occurrences of the search tring in the main string. If that is the case, try this one liner:
string = "abcdefghikllabcdefgllabcd"
search = 'lla'
( nchar(string) - nchar( gsub(search, "", string)) ) / nchar(search)
#returns 2
string2 = "llaabcdefghikllabcdefgllabcdlla"
( nchar(string2) - nchar( gsub(search, "", string2)) ) / nchar(search)
#returns 4
NOTE: Unit-test your solution for matches at the beginning and end of the string (i.e. make sure it works on 'llaabcdefghikllabcdefgllabcdlla'). I have seen several solutions elsewhere that rely on strsplit() to split on 'lla', but these solutions skip the final 'lla' at the end of the word.

Resources