String processing in R (find and replace) - r

This is an example of my data.frame
no string
1 abc&URL_drf
2 abcdef&URL_efg
I need to replace word *&URL with "". So, I need a this result
no string
1 _drf
2 _efg
In case of Excel, I can easily make this result using '*&URL' in 'find and replace' function.
However, I cannot look for effective method in R.
In R, my approach is below.
First, I have split string using strsplit(df$string, "&URL") and then I have selected second column. I think that it is not a effective way.
Is there a any effective method?

# data
df <- read.table(text="no string
1 abc&URL_drf
2 abcdef&URL_efg", header=T, as.is=T)
# `gsub` function is to substitute the unwanted string with nothing,
# thus the `""`. The pattern of unwanted string was written in
# regular expressions.
df$string <- gsub("[a-z]+(&URL)", "", df$string)
# you get
no string
1 1 _drf
2 2 _efg

I suggest you use the grep function .
The grep function takes your regex as the first argument, and the input vector as the second argument.If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.
so in your case
grep("[a-z]+(&URL)", df$col, perl=TRUE, value=TRUE)

Another approach:
df <- transform(df, string = sub(".*&URL", "", string))
# no string
# 1 1 _drf
# 2 2 _efg

Related

How to count occurrence(s) of each character in a string and encode output as [character] [number] [character] [number] format?

I want to count occurrence(s) of each character in the following string and output the result as [character] [number] [character] [number] format, lowercase and sort by the character.
"Hello World !"
d 1 e 1 h 1 l 3 o 2 r 1 w 1 ! 1
Solution in base R. First we split the string using strsplit() and empty string "" as split. Also unlist and sort at the same time. For good measure, we define a vector of strings we do not want in our output; it seems you don't count the space " " as a character. This has to be defined. Next, a little package pasted in paste0()'s. First, use lapply on the uniques of characters to count each occurance. Then, paste this with the respective string (and an equal sign). Last, collapse the resulting string with commas.
Here is the code:
example_string = "Hello World!"
unwanted_chars = c(" ")
split_string = sort(unlist(strsplit(example_string, "")))
split_string = split_string[split_string %in% unwanted_chars]
paste0(paste0(unique(split_string), " = ",
unlist(lapply(unique(split_string),
function(x) sum(x == unlist(split_string))))),
collapse = ", ")
The long line looks a bit ugly, if you want to undestand it, unwrap it layer by layer.
#Calbers is right that strsplit is the right way to go, but it can be achieved using a one-liner:
examples = c("Hello World!", "Lorem ipsum")
lapply(strsplit(examples, ""), table)
i.e. split each element of the examples using the null split "" and then lapply the table function to it. With a single example it is:
table(strsplit(examples, "")[[1]])
i.e. we take the first element from the list outputted by strsplit.

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

concatenate a string that contains backquote characters R

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?
Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.
From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

R: Find patern and get the values in between

I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.

Resources