gsub not working for dataframe's variable R - r

I donot understand what I am doing wrong.
I have a dataframe and one of the variables looks like this.
ss <- c("F00020 " , "F13975 " , "F13976 " , "F15334 " , "F12490 " , "F09787 " , "F14675 " ,
"F12129 " , "F04641 " , "F04680 " , "F04715 " , "F04753 " , "F08868 " , "F14031 " ,
"F14033 " , "F12585 " , "F14663 ")
I want to omit the extra blank spaces.
gsub("[[:space:]]","",ss)
The above code works but if I directly call the variable from the dataframe it's not working.
gsub("[[:space:]]","",df$Variable)
I also checked the type of the vector/variables, both are same as a character vector.
So what is happening here?

I cannot reproduce your error:
ss <- c("F00020 " , "F13975 " , "F13976 " , "F15334 " , "F12490 " , "F09787 " , "F14675 " ,
"F12129 " , "F04641 " , "F04680 " , "F04715 " , "F04753 " , "F08868 " , "F14031 " ,
"F14033 " , "F12585 " , "F14663 ")
gsub("[[:space:]]","",ss)
[1] "F00020" "F13975" "F13976" "F15334" "F12490" "F09787" "F14675" "F12129" "F04641" "F04680" "F04715"
[12] "F04753" "F08868" "F14031" "F14033" "F12585" "F14663"
df <- data.frame(Variable = ss)
gsub("[[:space:]]","",df$Variable)
[1] "F00020" "F13975" "F13976" "F15334" "F12490" "F09787" "F14675" "F12129" "F04641" "F04680" "F04715"
[12] "F04753" "F08868" "F14031" "F14033" "F12585" "F14663"

An easy solution for your use case is with trimws:
trimws(ss)
[1] "F00020" "F13975" "F13976" "F15334" "F12490" "F09787" "F14675" "F12129" "F04641" "F04680" "F04715"
[12] "F04753" "F08868" "F14031" "F14033" "F12585" "F14663"
Yes, as noted by others, your solution does work too, just as this, shorter, one does:
sub("\\s", "", ss) # no `gsub` needed **iff** there's always just one whitespace per string (in whatever position)

Related

R case insensitive capturing group

This regex :
str_extract_all("This is a Test , ' ' " , "[a-z]+")
returns :
[1] "his" "is" "a" "est"
How to modify so this is case insensitive ?
`[1] "This" "is" "a" "Test"`
should instead be returned
Should /i remove case sensitive ?
Trying str_extract_all("This is a Test , ' ' " , "[a-z]+/i")
returns
[[1]]
character(0)
There is a special notation for stringr functions:
regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
dotall = FALSE, ...)
You may use
> str_extract_all("This is a Test , ' ' " , regex("[a-z]+", ignore_case=TRUE))
[[1]]
[1] "This" "is" "a" "Test"
Alternatively, use an inline i modifier (?i):
str_extract_all("This is a Test , ' ' " , "(?i)[a-z]+")
You could try including the capital letters in the set you're searching for.
str_extract_all("This is a Test , ' ' " , "[A-Za-z]+")
If you only want the first letter to be capitalized you could try the code below. It lets the first letter be case insensitive and then have only lowercase afterward.
str_extract_all("This is a Test , ' ' " , "[A-Za-z][a-z]*")

R - Converting a text file with white spaces / tabs and newline to list

Curious if you might offer advice on the following.
I have data in a text file in this form:
"var1"
" var1a"
" var1a_descrp1"
" thing"
" var1b"
" var1b_descrp2"
" thing"
" var1b_descrp3"
" thing1"
" thing2"
" var1b_descrp4"
"poobarvar"
" var2a"
" var2a_descrp1"
" var2b"
" var2b_descrp1"
" thing"
" var2b_descrp1"
" thing1"
" thing2"
" thing3"
White spaces go a max depth of 12 spaces, or "three levels" deep.
And I'd love to cleanly parse this into a list structure of something like the following structure:
$var1
$var1$var1a
$var1$var1a$var1a_descrp1
$var1$var1a$var1a_descrp1[[1]]
[1] "thing"
$var1$var2a
$var1$var2a$var2a_descrp2
$var1$var2a$var2a_descrp2[[1]]
[1] "thing"
$var1$var2a$var2a_descrp3
$var1$var2a$var2a_descrp3[[1]]
[1] "thing1"
$var1$var2a$var2a_descrp3[[2]]
[1] "thing2"
$poobarvar
$poobarvar$var2a
list()
$poobarvar$var2b
$poobarvar$var2b$var2b_descrp1
$poobarvar$var2b$var2b_descrp1[[1]]
[1] "thing1"
$poobarvar$var2b$var2b_descrp1[[2]]
[1] "thing2"
$poobarvar$var2b$var2b_descrp1[[3]]
[1] "thing3"
I have a pretty convoluted set of while loops and if-else statements I'd love to clean up.

How to print double quotes (") in R

I want to print to the screen double quotes (") in R, but it is not working. Typical regex escape characters are not working:
> print('"')
[1] "\""
> print('\"')
[1] "\""
> print('/"')
[1] "/\""
> print('`"')
[1] "`\""
> print('"xml"')
[1] "\"xml\""
> print('\"xml\"')
[1] "\"xml\""
> print('\\"xml\\"')
[1] "\\\"xml\\\""
I want it to return:
" "xml" "
which I will then use downstream.
Any ideas?
Use cat:
cat("\" \"xml\" \"")
OR
cat('" "','xml','" "')
Output:
" "xml" "
Alternative using noqoute:
noquote(" \" \"xml\" \" ")
Output :
" "xml" "
Another option using dQoute:
dQuote(" xml ")
Output :
"“ xml ”"
With the help of the print parameter quote:
print("\" \"xml\" \"", quote = FALSE)
> [1] " "xml" "
or
cat('"')

Replacing + by 5 in R

I have a dataset called Price which is supposed to be numeric but is generated as a string because all 5 is replaced by +.
It looks like this:
"99000" "98300" "98300" "98290" "98310" " 9831+ " "98310" " 9830+ " " 9830+ " " 9830+ " " 9829+ " " 9828+ " " 9827+ " "98270"
I used the gsub function in R to try and replace + by 5. The code I wrote is:
finalPrice<-gsub("+",5,Price)
However, the output is just a bunch of numbers which doesn't make sense for what I intended:
"59595050505,5 59585350505,5 59585350505,5 59585259505,5 59585351505,5 5 5 595853515+5 5,5 59585351505,5 5 5 595853505+5 5,5 5 5 595853505+5
How can I fix this?
The + sign should be escaped. Try this:
finalPrice<-gsub("\\+",5, Price)
Besides using double-escapes to force a literal-x to be matched by the pattern argument, you can also use either the fixed=TRUE parameter or use a character-class defined by the "[.]"-operation. See the ?regex page for more details:
> gsub("+", "5", txt, fixed=TRUE)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"
> gsub("[+]", "5", txt)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"
When writing regex, + means match the preceeding group one or more times. As the preceeding character is in your regex before the + is empty, gsub matches every empty string in the target.
The result is that 5 is inserted into each of these positions.
To avoid this, escape the +, which needs to be done with double backslash in R:
finalPrice<-gsub("\\+",5,Price)

extracting first value from a list

I would like to extract the first value from this list:
[[1]]
[1] " \" 0.0337302" " -0.000248016" " -0.000496032" " -0.000744048"
[5] " -0.000992063" " -0.00124008" " -0.0014881" " -0.00173611"
[9] " -0.00198413" " -0.00223214" " -0.00248016" " -0.00272817"
[13] " -0.00297619" " -0.00322421" " -0.00347222" " -0.00372024"
[17] " -0.00396825" " -0.00421627" " -0.00446429" " -0.0047123"
[21] " -0.00496032" " -0.00520833" " -0.00545635" " -0.00570437"
the name of this test is M, I have tested this M[1] and M[[1]] but I don't get the correct answer.
How can I do that?
You need to subset the list, and then the vector in the list:
M[[1]][1]
In other words, M is a list of 1 element, a character vector of length 24.
You may want to use unlist M to convert it to just a vector.
M <- unlist(M)
Then you can just use M[1].
To remove the \" you can use sub:
sub("\"","",M[1])
[1] " 0.0337302"
The first element in the list you've shown is the entire vector shown by
[1] " \" 0.0337302" " -0.000248016" " -0.000496032" " -0.000744048"
[5] " -0.000992063" " -0.00124008" " -0.0014881" " -0.00173611"
[9] " -0.00198413" " -0.00223214" " -0.00248016" " -0.00272817"
[13] " -0.00297619" " -0.00322421" " -0.00347222" " -0.00372024"
[17] " -0.00396825" " -0.00421627" " -0.00446429" " -0.0047123"
[21] " -0.00496032" " -0.00520833" " -0.00545635" " -0.00570437"
you get that vector by doing M[[1]]
To further get the first element of this vector just recognize that M[[1]] is the vector you want the first element of so use normal subsetting to get that: M[[1]][1]
> M[[1]][1]
[1] " \" 0.0337302"

Resources