I have a single column of words that I am trying to clean. Some of the words have characters in them that I would like replaced with a space.
I know how to replace a single character in a string:
df2 <- data.frame(gsub("-"," ",data$string_column))
This example replaces the '-' character with a space.
How do I apply this procedure to an array of characters? I have tried the following:
df2 <- data.frame(gsub(c("-","&")," ",data$string_column))
This code runs, but it will only perform the operation of the first character, and not the second.
Any ideas on how to define a list of characters to be replaced by a space?
Thank you
You need
data$string_column <- gsub("[-&]", " ", data$string_column)
This way, all - and & chars in the string_column of the data dataframe will get replaced with a space char.
I have a large dataset with two sorts of labels. The first is of the form 'numeric_alphanumeric_alpha' and another which is 'alphanumeric_alpha'. I need to strip the numeric prefix from the first label so that it matches the second label. I know how to remove numbers from alphanumeric data (as below) but this would remove numbers that I need.
gsub('[0-9]+', '', x)
Below is an example of the two different labels I am encountered with well as the prefer
c('12345_F24R2_ABC', 'r87R2_DEFG')
Below is the desired output
c('F24R2_ABC', 'r87R2_DEFG')
A simple regex can do it. ^ refers to the start of a string, \\d refers to any digits, + indicates one or more time it appears.
gsub("^\\d+_", "", c('12345_F24R2_ABC', 'r87R2_DEFG'), perl = T)
[1] "F24R2_ABC" "r87R2_DEFG"
Your code a litte modified:
^[0-9]*.....starts with number followed by numbers
\\_ .... matches underscore
gsub('^[0-9]*\\_', '', x)
[1] "F24R2_ABC" "r87R2_DEFG"
I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))
Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo
I have a CSV file where numeric values are stored in a way like this:
+000000000000000000000001101.7100
The number above is 1101.71. This string is always the same length, so number of zeroes before the actual number depends on numberĀ“s length.
How can I drop the + and all 0s before the actual number so I can then convert it to numeric easily?
If it is of fixed width, then substring will be a faster option
as.numeric(substring(str1, nchar(str1)-8))
#[1] 1101.71
but if we don't know how many 0's will be there at the beginning, then another option is sub where we match a + at the start (^) of the string followed by 0 or more elements of 0 (0*) and replace with blank ("")
as.numeric(sub("^\\+0*", "", str1))
#[1] 1101.71
Note that we escape the + as it is a metacharacter implying one or more
I may miss an important point, but my best try would be like this:
1) read the values as a character
2) use substr to get rid of the first character, namely the plus sign
3) convert column with as.integer / this way we safely loose any leading zeroes