R - how to drop chars from string depending on their values? - r

I have a CSV file where numeric values are stored in a way like this:
+000000000000000000000001101.7100
The number above is 1101.71. This string is always the same length, so number of zeroes before the actual number depends on numberĀ“s length.
How can I drop the + and all 0s before the actual number so I can then convert it to numeric easily?

If it is of fixed width, then substring will be a faster option
as.numeric(substring(str1, nchar(str1)-8))
#[1] 1101.71
but if we don't know how many 0's will be there at the beginning, then another option is sub where we match a + at the start (^) of the string followed by 0 or more elements of 0 (0*) and replace with blank ("")
as.numeric(sub("^\\+0*", "", str1))
#[1] 1101.71
Note that we escape the + as it is a metacharacter implying one or more

I may miss an important point, but my best try would be like this:
1) read the values as a character
2) use substr to get rid of the first character, namely the plus sign
3) convert column with as.integer / this way we safely loose any leading zeroes

Related

Replace character if it is 1 of a set of characters r

I have a single column of words that I am trying to clean. Some of the words have characters in them that I would like replaced with a space.
I know how to replace a single character in a string:
df2 <- data.frame(gsub("-"," ",data$string_column))
This example replaces the '-' character with a space.
How do I apply this procedure to an array of characters? I have tried the following:
df2 <- data.frame(gsub(c("-","&")," ",data$string_column))
This code runs, but it will only perform the operation of the first character, and not the second.
Any ideas on how to define a list of characters to be replaced by a space?
Thank you
You need
data$string_column <- gsub("[-&]", " ", data$string_column)
This way, all - and & chars in the string_column of the data dataframe will get replaced with a space char.

Regex - Best way to match all values between two two digit numbers?

Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo

How to remove starting(suffix) special character("_") from column names [duplicate]

After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"

remove all delimiters at beginning and end of string

After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"

Make all elemants of a character vector the same length

Consider a character vector
test <- c('ab12','cd3','ef','gh03')
I need all elements of test to contain 4 characters (nchar(test[i])==4). If the actual length of the element is less than 4, the remaining places should be filled with zeroes. So, the result should look like this
> 'ab12','cd30','ef00','gh03'
My question is similar to this one. Yet, I need to work with a character vector.
We can use base R functions to pad 0 at the end of a string to get the number of characters equal. The format with width specified as max of nchar (number of characters) of the vector gives an output with trailing space at the end (as format by default justify it to right. Then, we can replace each space with '0' using gsub. The pattern in the gsub is a single space (\\s) and the replacement is 0.
gsub("\\s", "0", format(test, width=max(nchar(test))))
#[1] "ab12" "cd30" "ef00" "gh03"
Or if we are using a package solution, then str_pad does this more easily as it also have the argument to specify the pad.
library(stringr)
str_pad(test, max(nchar(test)), side="right", pad="0")

Resources