R character conversion - r

From an import, I have a date being read in as a factor:
user$registrationDate[1]
[1] "2004-07-23 14:19:32"
15551 Levels: " "1" "2004-07-23 14:19:32" "2004-07-25 03:29:18" "2004-07-25 08:35:20" ... i10yo."
I convert it apparently successfully into a character vector
as.character(user$registrationDate[1])
[1] "\"2004-07-23 14:19:32\""
Whatever I try to strip off the leading and trailing quote, I still end up with a trailing quote (or something like it)
sub('"', "", as.character(user$registrationDate[10]), fixed=TRUE)
[1] "2004-09-12 22:39:21\""
I tried many variations of sub and keep getting the same result. Tips?

From ?sub: "sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences". So use gsub instead.

Related

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.
To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Gsub transforming numbers

I find this problem >S
I scrap some data from the web and for instance I obtain this
"3.444.654" (As character)
If I use gsub("3.444.654", ".", "") in order to get 3444654...
R gives me
[1] ""
What could I do to get the integer!
> gsub(".", "", "3.444.654", fixed = TRUE)
[1] "3444654"
Maybe read the documentation for gsub for argument order etc. To then turn the string into a number, use as.numeric, as.integer etc.

Remove hyphen at the end of string in R

I have a column of a dataframe in R like this:
names <- data.frame(name=c("ABC", "ABC-D", "ABCD-"))
I would like to remove the hyphen at the end of the strings while maintaining the hyphen in the middle of them. I've tried a few expressions like:
names$name <- gsub("+-\\w", "", names$name)
# the desired output is "ABC", "ABC-D", and "ABCD", respectively
While several combinations remove the hyphens entirely, I'm not sure how to specify the string boundary and the hyphen together.
Thanks!
Try :
gsub("\\-$", "", names$name)
# [1] "ABC" "ABC-D" "ABCD"
$ tells R that the (escaped) hyphen is at the end of the word
Although, as the - is placed first in the regex you don't need to escape it so this works too:
gsub("-$", "", names$name)
#[1] "ABC" "ABC-D" "ABCD"

Replace text that appears at the end of a string

Consider "artikelnr". I want to replace "nr" by "nummer", but when I consider "inrichting", I do NOT want to replace "nr". So I just want to replace "nr" by "nummer" if it's at the end of a word.
regex is your friend, here:
sub('nr$', 'nummer', 'artikelnr')
# [1] "artikelnummer"
The $ indicates "end of string", so nr will only be replaced with nummer when it appears at the end of the string.
sub can operate on an entire vector, e.g. for a character vector x, do:
sub('nr$', 'nummer', x)
If you don't mind using the stringr package, str_replace is also handy :
library(stringr)
str_replace("artikelnr", "nr$", "nummer")

Resources