Gsub transforming numbers - r

I find this problem >S
I scrap some data from the web and for instance I obtain this
"3.444.654" (As character)
If I use gsub("3.444.654", ".", "") in order to get 3444654...
R gives me
[1] ""
What could I do to get the integer!

> gsub(".", "", "3.444.654", fixed = TRUE)
[1] "3444654"
Maybe read the documentation for gsub for argument order etc. To then turn the string into a number, use as.numeric, as.integer etc.

Related

R: How to split string into pieces

I'm trying to split tons of strings as below:
x = "�\001�\001�\001�\001�\001\002CN�\001\bShandong�\001\004Zibo�\002$ABCDEFGHIJK�\002\aIMG_HAS�\002�\002�\002�\002�\002�\002�\002�\002\02413165537405763268743�\002\001�\002�\002�\002�\003�\003�\003����\005�\003�\003�\003�\003"
into four pieces
'CN', 'Shandong', 'Zibo', 'ABCDEFGHIJK'
I've tried
stringr::str_split(x, '\\00.')
which output the origin x.
Also,
trimws(gsub("�\\00?", "", x, perl = T))
which only removes the unknown character �.
Could someone help me with this? Thanks for doing so.
You can try with str_extract_all :
stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"
With base R :
regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]
Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.
We can use strsplit from base R
setdiff(strsplit(x, "[^A-Za-z]+")[[1]], "")
#[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG" "HAS"

R: How can I Split and Trim Data Successfully

I've successfully split the data and removed the "," with the following code:
s = MSA_data$area_title
str_split(s, pattern = ",")
Result
[1] "Albany" " GA"
I need to trim this data, removing white space, however this places the comma back into the data which was initially removed.
"Albany, GA"
How can I successfully split and trim the data so that the result is:
[1] "Albany" "GA"
Thank you
An alternative is to use trimws function to trim the whitespace at the beginning and end of the string.
Result <- trimws(Result)
We just need to use zero or more spaces (\\s*) (the question OP asked) and this can be done in a single step
strsplit(MSA_data$area_title, pattern = ",\\s*")
If we are using the stringr, then make use of the str_trim
library(stringr)
str_trim(str_split("Albany, GA", ",")[[1]])
#[1] "Albany" "GA"

Drop char code from start of string in a whole column

Trying to drop a euro character code from the start of a column. Column was ingested as character by readr, but I need to convert to integers
data$price[1:3]
[1] "\u0080343,000.00" "\u0080185,000.00" "\u0080438,500.00"
so need to get rid of \u0080 from the start (and , and . but we'll deal with those later)
tried:
data$price <- sub("\u0080", "", data$price)
-- no change(!!!)
data$price <- substr(data$price, 7, 100)
-- invalid multibyte string, element 1 (???)
I'd like to get to:
343000, 185000, 438500
But not sure how to get there. Any wisdom would be much appreciated!
You can tell R to use the exact text rather than regular expressions by using the fixed = TRUE option.
price <- c("\u0080343,000.00", "\u0080185,000.00", "\u0080438,500.00")
sub("\u0080", "", price, fixed = TRUE)
[1] "343,000.00" "185,000.00" "438,500.00"
To remove the comma and convert to an integer, you can use gsub.
as.integer(gsub(",", "", sub("\u0080", "", price, fixed = TRUE)))
[1] 343000 185000 438500
You can do this:
gsub("[^ -~]+", "", price)
"343,000.00" "185,000.00" "438,500.00"
Explanation:
The Euro sign is a non-ASCII character. So to get rid of it in the values in price we define a character class of ASCII characters in [ -~]; by negating the class through the caret ^ we match non-ASCII characters (such as €). This pattern is matched in gsuband replaced by "", i.e., nothing.
To convert to integer, proceed as in #Adam's answer. To convert to numeric, you can do this:
as.numeric(gsub(",", "", gsub("[^ -~]+", "", price)))

str_replace (package stringr) cannot replace brackets in r?

I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?
Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"
The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"
Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"

R character conversion

From an import, I have a date being read in as a factor:
user$registrationDate[1]
[1] "2004-07-23 14:19:32"
15551 Levels: " "1" "2004-07-23 14:19:32" "2004-07-25 03:29:18" "2004-07-25 08:35:20" ... i10yo."
I convert it apparently successfully into a character vector
as.character(user$registrationDate[1])
[1] "\"2004-07-23 14:19:32\""
Whatever I try to strip off the leading and trailing quote, I still end up with a trailing quote (or something like it)
sub('"', "", as.character(user$registrationDate[10]), fixed=TRUE)
[1] "2004-09-12 22:39:21\""
I tried many variations of sub and keep getting the same result. Tips?
From ?sub: "sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences". So use gsub instead.

Resources