I have a vector of strings, all containing a common symbol lets say "*". I need to delete the * and all characters after that for all vector elements. For example:
In abcd*123 I need to have abcd. The number of characters, before and after * are various.
Thanks for the help.
out <- gsub("\\*.*", "", yourVector)
Related
I have a single column of words that I am trying to clean. Some of the words have characters in them that I would like replaced with a space.
I know how to replace a single character in a string:
df2 <- data.frame(gsub("-"," ",data$string_column))
This example replaces the '-' character with a space.
How do I apply this procedure to an array of characters? I have tried the following:
df2 <- data.frame(gsub(c("-","&")," ",data$string_column))
This code runs, but it will only perform the operation of the first character, and not the second.
Any ideas on how to define a list of characters to be replaced by a space?
Thank you
You need
data$string_column <- gsub("[-&]", " ", data$string_column)
This way, all - and & chars in the string_column of the data dataframe will get replaced with a space char.
I have a vector of 8-character file names of the format
"/relative/path/to/folder/a(bc|de|fg)...[xy]1.sav"
where the brackets hold one of two-three known characters, and the '...' are three unknown characters. I want to match all character vectors that has the same unknown sequence XXX and sort into a list of character vectors.
I am not sure how to proceed on this. I am thinking about a way to extract the letters in the fourth to sixth position (...), and put into a vector then use `grep to get all the files with the matching string.
E.g.
# Pseudo-code. Not functioning code, but sort of the thing I want to do
> char.extr <- str_extract(file.vector, !"a(bc|de|fg)...[xy]1.sav")
> char.extr
"JKL", "MNO" ,"PQR" ...
# Use grep and lapply to put matched strings into list
> path.list <- lapply(char.extr, grep, file.vector)
> path.list
1. "/relative/path/to/folder/abcJKLx1.sav"
"/relative/path/to/folder/adeJKLy1.sav"
2. "/relative/path/to/folder/afgMNOx1.sav"
"/relative/path/to/folder/abcMNOy1.sav"
Since we know the name structure, I'd imaging extracting the 3 letter substring and then using split to get individual lists is what you're looking for.
split(path.list, substr(basename(path.list), 4, 6))
Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo
I have a CSV file where numeric values are stored in a way like this:
+000000000000000000000001101.7100
The number above is 1101.71. This string is always the same length, so number of zeroes before the actual number depends on numberĀ“s length.
How can I drop the + and all 0s before the actual number so I can then convert it to numeric easily?
If it is of fixed width, then substring will be a faster option
as.numeric(substring(str1, nchar(str1)-8))
#[1] 1101.71
but if we don't know how many 0's will be there at the beginning, then another option is sub where we match a + at the start (^) of the string followed by 0 or more elements of 0 (0*) and replace with blank ("")
as.numeric(sub("^\\+0*", "", str1))
#[1] 1101.71
Note that we escape the + as it is a metacharacter implying one or more
I may miss an important point, but my best try would be like this:
1) read the values as a character
2) use substr to get rid of the first character, namely the plus sign
3) convert column with as.integer / this way we safely loose any leading zeroes
Consider a character vector
test <- c('ab12','cd3','ef','gh03')
I need all elements of test to contain 4 characters (nchar(test[i])==4). If the actual length of the element is less than 4, the remaining places should be filled with zeroes. So, the result should look like this
> 'ab12','cd30','ef00','gh03'
My question is similar to this one. Yet, I need to work with a character vector.
We can use base R functions to pad 0 at the end of a string to get the number of characters equal. The format with width specified as max of nchar (number of characters) of the vector gives an output with trailing space at the end (as format by default justify it to right. Then, we can replace each space with '0' using gsub. The pattern in the gsub is a single space (\\s) and the replacement is 0.
gsub("\\s", "0", format(test, width=max(nchar(test))))
#[1] "ab12" "cd30" "ef00" "gh03"
Or if we are using a package solution, then str_pad does this more easily as it also have the argument to specify the pad.
library(stringr)
str_pad(test, max(nchar(test)), side="right", pad="0")