Hoping to get some guidance as only an occasional analyst and couldn't really understand how to manage an expression with a preceding numeric value.
My data below, I am hoping to convert the "4D" and "5D" type of data into "4 Door" and "5 Door".
a <- c("4D Sedan", "5D Wagon")
b <- c("4 Door Sedan", "5 Door Wagon")
dt <- cbind(a,b)
thanks.
We can use gsub() here, searching for the pattern:
\\b(\\d+)D\\b
and replacing it with:
\\1 Door
Code:
a <- c("4D Sedan", "5D Wagon", "AB4D car 5D")
> gsub("\\b(\\d+)D\\b", "\\1 Door", a)
[1] "4 Door Sedan" "5 Door Wagon" "AB4D car 5 Door"
Note in the above example that the 4D in AB4D car 5D does not get replaced, nor would we want this to happen. By using word boundaries in \\b(\\d+)D\\b we can avoid unwanted replacements from happening.
Related
I have tried to resolve this problem all day but without any improvement.
I am trying to replace the following abbreviations into the following desired words in my dataset:
-Abbreviations: USA, H2O, Type 3, T3, bp
Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure
The input data is for example
[1] I have type 3, its considered the highest severe stage of the disease.
[2] Drinking more H2O will make your skin glow.
[3] Do I have T2 or T3? Please someone help.
[4] We don't have this on the USA but I've heard that will be available in the next 3 years.
[5] Having a high bp means that I will have to look after my diet?
The desired output is
[1] i have type 3 disease, its considered the highest severe stage
of the disease.
[2] drinking more water will make your skin glow.
[3] do I have type 3 disease? please someone help.
[4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.
[5] having a high blood pressure means that I will have to look after my diet?
I have tried the following code but without success:
data= read.csv(C:"xxxxxxx, header= TRUE")
lowercase= tolower(data$MESSAGE)
dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"=
"water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"=
"blood pressure")
for(i in 1:length(dict1)){
lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"),
dict[[i]], lowercase)}
I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.
If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).
An example code:
abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
str_replace_all(x,
paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"),
function(z) df$desired_words[df$abbreviations==z][[1]][1]
)
The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.
See the R demo online.
I have the following data frames
df1 <- data.frame(
Description=c("How are you- doing?", "will do it tomorrow otherwise: next week", "I will work hard to complete it for nextr week1 or tomorrow", "I am HAPPY with this situation now","Utilising this approach can helpα'x-ray", "We need to use interseting <U+0452> books to solve the issue", "Not sure if we could do it appropriately.", "The schools and Universities are closed in f -blook for a week", "Things are hectic here and we are busy"))
<!-- begin snippet: js hide: false console: true babel: false -->
and I want to get the following table:
d <- data.frame(
Description=c("Utilising this approach can helpa'x-ray", "How are you- doing", " We need to use interseting <U+0452> books to solve the issue ", " will do it tomorrow otherwise: next week ", " Things are hectic here and we are busy ", "I will work hard to complete it for nextr week1 or tomorrow ", "The schools and Universities are closed in f -blook for a week", " I am HAPPY with this situation now "," I will work hard to complete it for nextr week1 or tomorrow"))
f2<- read.table(text="B12 B6 B9
No Yes Yes
12 6 9
No No Yes
No No Yes
No No Yes
Yes No Yes
11 No Yes
12 11 P
No No Yes
", header=TRUE)
df3<-cbind(d,f2)
As you can see in the Description column, there are space and colon, and so on 1 after week is subscript and I was unable to fix it. I want to match it based on "Description". So I want to match df1 with df2 using Description. Can we do it it in R for this case?
We can use stringdist joins from fuzzyjoin package to match data based on 'Description'. We use na.omit to remove the NA rows from the final dataframe.
na.omit(fuzzyjoin::stringdist_left_join(df1, df3, by = 'Description'))
I am using read.csv on a datapath. It returns a dataframe. I want to be able to get a single value in this dataframe, but instead I get a list of values displaying the levels.
I have tried several ways to access the value I want. In the next part, I will show you what I tried and the results I got.
Here is my simple dataframe:
"OGM","Nutrient","data3"
"tomato","iron",0.03
"domestic cat","iron",0.02
"zebrafish","zing",0.02
"giraffe","nitrate", 0.09
"common cougar","manganese",0.05
"fawn","nitrogen",0.04
"daim","bromure",0.08
"wild cat","iron",0.05
"domestic cat","calcium",0.02
"muren","calcium",0.07
"jaguar","iron",0.02
"green turtle","sodium",0.01
"dave grohl","metal",0.09
"zebra","nitrates",0.12
"tortoise","sodium",0.16
"dinosaur","calcium",0.08
"apex mellifera","sodium",0.15
Here is how I load the data:
#use read.csv on the datapath contained in file
fileData <- read.csv(file[4][[1]])
print(fileData[1][1])
What I want is to access a single value: from example, "tomato" or "nitrate". The result I want is exactly this:
>[1] tomato
Here is what I tried and the result I got:
print(fileData[1][1])
returns
> OGM
>1 tomato
>2 domestic cat
>3 zebrafish
>4 giraffe...
print(fileData$OGM[1])
returns
> [1] tomato
Levels: apex mellifera common cougar daim...
print(fileData[1][[1]])
returns
> [1] tomato domestic cat zebrafish giraffe common cougar [...]
[15] tortoise dinosaur apex mellifera
Levels: apex mellifera common cougar daim...
print(fileData$OGM[[1]])
returns
Levels: apex mellifera common cougar daim...
All apologies for the stupid question, but I'm a bit lost. All help is appreciated. If you want me to edit my post to be more clear, tell me. Thank you.
Some suggestions
Try readr::read_csv rather than read.csv to read in your data. This will get around the stringsAsFactors problem. Or use the approach suggested by Stewart Macdonald.
Once you have the data in, you can manipulate it as follows
# Make a sample dataframe
library(tidyverse)
df <- tribble(~OGM, ~Nutrient, ~data3,
"tomato","iron",0.03,
"domestic cat","iron",0.02,
"zebrafish","zing",0.02,
"giraffe","nitrate", 0.09,
"common cougar","manganese",0.05,
"fawn","nitrogen",0.04,
"daim","bromure",0.08,
"wild cat","iron",0.05,
"domestic cat","calcium",0.02,
"muren","calcium",0.07,
"jaguar","iron",0.02,
"green turtle","sodium",0.01,
"dave grohl","metal",0.09,
"zebra","nitrates",0.12,
"tortoise","sodium",0.16,
"dinosaur","calcium",0.08,
"apex mellifera","sodium",0.15)
df %>%
select(OGM) %>% # select the OGM column
filter(OGM == 'tomato') %>%
pull # convert to a vector
[1] "tomato"
I want to substitute all the strings that have words that repeat themselves one after another with words that have a single occurrence.
My strings go something like that:
text_strings <- c("We have to extract these numbers 12, 47, 48", "The integers numbers are also interestings: 189 2036 314",
"','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456", "We like to to offer you 7890$ per month in order to complete this task... we are joking", "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.", "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.", "you can also extract exotic stuff like a456 gb67 and 45678911ghth", "Writing 1 example is not funny, please consider that 66% is validation+testing", "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]", "Who loves arrays more than me?", "{366,78,90,5}Yes, there are only 4 numbers inside", "Integers are fine but sometimes you like 99 cents after the 99 dollars", "100€ are better than 99€", "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]", "Ok ok 1 2 3 4 5 and the last one is 6", "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando")
I tried:
gsub("\b(?=\\w*(\\w)\1)\\w+", "\\w", text_strings, perl = TRUE)
But nothing happened (the output remained the same).
How can I remove the repeating words such as in
text_strings[9]
#[1] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
Thank you!
You can use gsub and a regular expression.
gsub("\\b(\\w+)\\W+\\1", "\\1", text_strings, ignore.case=TRUE, perl=TRUE)
[1] "We have to extract these numbers 12, 47, 48"
[2] "The integers numbers are also interestings: 189 2036 314"
[3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
[4] "We like to offer you 7890$ per month in order to complete this task... we are joking"
[5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
[6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
[7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth"
[8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
[9] "You are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
[10] "Who loves arrays more than me?"
[11] "{366,78,90,5}Yes, there are only 4 numbers inside"
[12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
[13] "100€ are better than 99€"
[14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
[15] "Ok 1 2 3 4 5 and the last one is 6"
[16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando
"
I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"