i want to write a function which takes a character Vector(including numbers) as Input and left pads zeroes to the numbers in it. for example this could be an Input Vector :
x<- c("abc124.kk", "77kk-tt", "r5mm")
x
[1] "abc124.kk" "77kk-tt" "r5mm"
each string of the input Vector contains only one Vector but there all in different positions(some are at the end, some in the middle..)
i want the ouput to look like this:
"abc124.kk" "077kk-tt" "r005mm"
that means to put as many leading Zeros to the number included in the string so that it has as many Digits as the longest number.
but i want a function who does this for every string Input not only my example(the x Vector).
i already started extracting the numbers and letters and turned the numbers the way i want them but how can i put them back together and back on the right Position?
my_function<- function(x){
letters<- str_extract_all(x,"[a-z]+")
numbers<- str_extract_all(x, "[0-9]+")
digit_width<-max(nchar(numbers))
numbers_correct<- str_pad(numbers, width=digit_width, pad="0")
}
and what if i have a Vector which includes some strings without numbers? how can i exclude them and get them back without any changes ?
for example if teh Input would be
y<- c("12ab", "cd", "ef345")
the numbers variable Looks like that:
[[1]]
[1] "12"
[[2]]
character(0)
in this case i would want that the ouput at the would look like this:
"012ab" "cd" "ef345"
An option would be using gsubfn to capture the digits, convert it to numeric and then pass it to sprintf for formatting
library(gsubfn)
gsubfn("([0-9]+)", ~ sprintf("%03d", as.numeric(x)), x)
#[1] "abc124.kk" "077kk-tt" "r005mm"
x <- c("12ab", "cd", "ef345")
s = gsub("\\D", "", x)
n = nchar(s)
max_n = max(n)
sapply(seq_along(x), function(i){
if (n[i] < max_n) {
zeroes = paste(rep(0, max_n - n[i]), collapse = "")
gsub("\\d+", paste0(zeroes, s[i]), x[i])
} else {
x[i]
}
})
#[1] "012ab" "cd" "ef345"
Related
I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))
I faced this issue for some numeric columns in R.Some of negative values in some columns are taken in brackets and column is convert into factor.
How to remove brackets in R and make value to negative? Eg. "(265)" to -265
How can I use gsub function in R to do this? If any other method is available, please suggest.
Here is an alternative. Regex match is made on values that start and end with a round bracket, and contain one or more numeric characters between, returning the middle-group (numeric characters) with a minus-sign in front. The whole lot is then cast to numeric:
as.numeric(gsub("^\\(([1-9]+)\\)$","-\\1",x))
Just in case there is something else going on with numbers:
convert.brackets <- function(x){
if(grepl("\\(.*\\)", x)){
paste0("-", gsub("\\(|\\)", "", x))
} else {
x
}
}
x <- c("123", "(456)", "789")
sapply(x, convert.brackets, USE.NAMES = F)
[1] "123" "-456" "789"
Otherwise simply:
paste0("-", gsub("\\(|\\)", "", x))
How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.
I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"