Drop char code from start of string in a whole column - r

Trying to drop a euro character code from the start of a column. Column was ingested as character by readr, but I need to convert to integers
data$price[1:3]
[1] "\u0080343,000.00" "\u0080185,000.00" "\u0080438,500.00"
so need to get rid of \u0080 from the start (and , and . but we'll deal with those later)
tried:
data$price <- sub("\u0080", "", data$price)
-- no change(!!!)
data$price <- substr(data$price, 7, 100)
-- invalid multibyte string, element 1 (???)
I'd like to get to:
343000, 185000, 438500
But not sure how to get there. Any wisdom would be much appreciated!

You can tell R to use the exact text rather than regular expressions by using the fixed = TRUE option.
price <- c("\u0080343,000.00", "\u0080185,000.00", "\u0080438,500.00")
sub("\u0080", "", price, fixed = TRUE)
[1] "343,000.00" "185,000.00" "438,500.00"
To remove the comma and convert to an integer, you can use gsub.
as.integer(gsub(",", "", sub("\u0080", "", price, fixed = TRUE)))
[1] 343000 185000 438500

You can do this:
gsub("[^ -~]+", "", price)
"343,000.00" "185,000.00" "438,500.00"
Explanation:
The Euro sign is a non-ASCII character. So to get rid of it in the values in price we define a character class of ASCII characters in [ -~]; by negating the class through the caret ^ we match non-ASCII characters (such as €). This pattern is matched in gsuband replaced by "", i.e., nothing.
To convert to integer, proceed as in #Adam's answer. To convert to numeric, you can do this:
as.numeric(gsub(",", "", gsub("[^ -~]+", "", price)))

Related

Regex expression to remove left over ascii hex code

I am analysing some tweets and I have written an basic emoji to text dictionary. I use the following to convert emoji's to r-encoded unicode;
df$text <- iconv(df$text, from = "latin1", to = "ascii", sub = "byte")
After that I swap the unicode to a text string that describes the emoji, for example <c2><ae> becomes 'copyright'
Problem is I have a lot of emoji's that aren't in the dictionary and I need to remove the strings that represent them. I can remove the <> symbols with "[[:punct:]]", "", but I need to get rid of the alpha numeric characters inside the <>'s too.
I was thinking something like
gsub("^<", "")
but i'm honestly stumped on how to find the < > symbols and remove anything found between them, or how to make a regex that finds < then removes it and the next 3 characters.
Appreciate any help
example
text <- ("have a <ed><a0><bd><ed><b8><80> day")
gsub("[[:punct:]]", "", text)
gives "have a eda0bdedb880 day"
but I want "have a day"
We can use a regex to match the < followed by characters that are not space ([^ ]+), ending in > and replace with blank ("")
gsub("\\<[^ ]+\\>\\s*", "", text, perl = TRUE)
#[1] "have a day"

How to substring a char vector using patterns in R?

I have this kind of char vector:
"MODIS.evi.2013116.yL2.BOKU.tif"
The number in the middle of the vector is gonna change. And the evi word will change to ndvi some times.
I want to use substr (or other function, maybe) to sub-string the vector after the second point: ., ie, just take the 2013116.yL2.BOKU.tif, even when the string is MODIS.evi.2013116.yL2.BOKU.tif or MODIS.ndvi.2013116.yL2.BOKU.tif.
We can use sub to match two instance of one or more characters that are not a . followed by a . from the start (^) of the string and replace it with blank ("")
sub("^([^.]+\\.){2}", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
If the pattern to keep always start with numbers, then the above can be simplified to match only one or more non-numeric characters and replace it with blank from the start (^) of the string
sub("^\\D+", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
data
str1 <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
This deletes all leading non-digit characters in s :
sub("^\\D*", "", s)
If s is as in the Note at the end then the result of running the above is:
[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
Note:
s <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
l = c("MODIS.evi.2013116.yL2.BOKU.tif","MODIS.ndvi.2013116.yL2.BOKU.tif")
sapply(l, function(x) strsplit(x, "vi.", fixed = T)[[1]][2])

concatenate a string that contains backquote characters R

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?
Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.
From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

Removing last consecutive digits

I have a column of alphanumeric data and i have to remove the last consecutive digits. It could be of any length.
Input:
dlxcp01
dlcs8012
fg2fdes1
Desired Output:
dlxcp
dlcs
fg2fdes
As i have large dataset, a right code would do it better.
Use sub function.
sub("[0-9]+$", "", x)
or
sub("[[:digit:]]+$", "", x)
Use the gsub() function:
text <- c('dlxcp01', 'dlcs8012', 'fg2fdes1')
gsub('[0-9]*$', "", text)
[1] "dlxcp" "dlcs" "fg2fdes"

Gsub transforming numbers

I find this problem >S
I scrap some data from the web and for instance I obtain this
"3.444.654" (As character)
If I use gsub("3.444.654", ".", "") in order to get 3444654...
R gives me
[1] ""
What could I do to get the integer!
> gsub(".", "", "3.444.654", fixed = TRUE)
[1] "3444654"
Maybe read the documentation for gsub for argument order etc. To then turn the string into a number, use as.numeric, as.integer etc.

Resources