Extract from string on varying conditions - r

I am trying to extract characters only and numbers only from a string. Because the positions of these vary, I can't use syntax which relies on the position of the values.
For example, say I have the following column x where values are repeated, but with different numbers:
x <- c("dummy.DR57", "dummy.hour41", "dummy.MAV43", "dummy.SB1")
I want to create two columns:
1: A column with just the characters after the "." but before the numbers:
name <- c("DR", "hour", "MAV", "SB")
2: A column with just the numbers:
number <- c("57", "41", "43", "1")
I've mostly been trying substr and str_sub - but I'm not getting the results I need.
Any help is much appreciated!

x <- c("dummy.DR57", "dummy.hour41", "dummy.MAV43", "dummy.SB1")
(number <- gsub('[[:alpha:]].', '', x))
# [1] "57" "41" "3" "1"
(name <- gsub("[^.]*[.]|[[:digit:]]", "", x))
# [1] "DR" "hour" "MAV" "SB"

> gsub(x, pattern = '[0-9]|dummy\\.', replacement = '')
[1] "DR" "hour" "MAV" "SB"
> gsub(x, pattern = '[a-zA-Z]|\\.', replacement = '')
[1] "57" "41" "43" "1"

You may try this:
gsub(pattern = "(^.*\\.)([[:alpha:]]+)([[:digit:]]+)",
replacement = "\\2",
x = x)
# [1] "DR" "hour" "MAV" "SB"
gsub(pattern = "(^.*\\.)([[:alpha:]]+)([[:digit:]]+)",
replacement = "\\3",
x = x)
# [1] "57" "41" "43" "1"

Related

How to pass the length of each element in an R vector to the substr function?

I have the following vector.
v <- c('X100kmph','X95kmph', 'X90kmph', 'X85kmph', 'X80kmph',
'X75kmph','X70kmph','X65kmph','X60kmph','X55kmph','X50kmph',
'X45kmph','X40kmph','X35kmph','X30kmph','X25kmph','X20kmph',
'X15kmph','X10kmph')
I want to extract the digits representing speed. They all start at the 2nd position, but end at different places, so I need (length of element i) - 4 as the ending position.
The following doesn't work as length(v) returns the length of the vector and not of each element.
vnum <- substr(v, 2, length(v)-4)
Tried lengths() as well, but doesn't work.
How can I supply the length of each element to substr?
Context:
v actually represents a character column (called Speed) in a tibble which I'm trying to mutate into the corresponding numeric column.
mytibble <- mytibble %>%
mutate(Speed = as.numeric(substr(Speed, 2, length(Speed) - 4)))
Using nchar() instead of length() as suggested by tmfmnk does the trick!
vnum <- substr(v, 2, nchar(v)-4)
If you just want to extract the digits, then here is another option
vnum <- gsub("\\D","",v)
such that
> vnum
[1] "100" "95" "90" "85" "80" "75" "70" "65" "60" "55"
[11] "50" "45" "40" "35" "30" "25" "20" "15" "10"

Split a column based on a space between numbers [duplicate]

I have the vector
length
# [1] 15,34, 12,24, 225,
# Levels: 12,24, 15,34, 225,
and I want to separate them by the comma to eventually make a list of these values
Tried:
strsplit(length, ",")
but keep getting the error message
Error in strsplit(length, ",") : non-character argument
Your "length" object is a factor:
As the error message indicates, strsplit expects a character vector as the input.
Try:
strsplit(as.character(length), ",")
Demo
x <- factor(c("1,2", "3,4", "5,6"))
strsplit(x, ",")
# Error in strsplit(x, ",") : non-character argument
strsplit(as.character(x), ",")
# [[1]]
# [1] "1" "2"
#
# [[2]]
# [1] "3" "4"
#
# [[3]]
# [1] "5" "6"
You could also use: (x from #Ananda Mahto's post)
library(stringr)
str_split(x, ",")
#[[1]]
# [1] "1" "2"
#[[2]]
#[1] "3" "4"
#[[3]]
#[1] "5" "6"
Or
str_extract_all(x, "[0-9]+")
Or
library(stringi)
stri_extract_all_regex(x, "[0-9]+")

Getting all characters ahead of first appearance of special character in R

I want to get all characters that are ahead of the first "." if there is one. Otherwise, I want to get back the same character ("8" -> "8").
Example:
v<-c("7.7.4","8","12.6","11.5.2.1")
I want to get something like this:
[1] "7 "8" "12" "11"
My idea was to split each element at "." and then only take the first split. I found no solution that worked...
You can use sub
sub("\\..*", "", v)
#[1] "7" "8" "12" "11"
or a few stringi options:
library(stringi)
stri_replace_first_regex(v, "\\..*", "")
#[1] "7" "8" "12" "11"
# extract vs. replace
stri_extract_first_regex(v, "[^\\.]+")
#[1] "7" "8" "12" "11"
If you want to use a splitting approach, these will work:
unlist(strsplit(v, "\\..*"))
#[1] "7" "8" "12" "11"
# stringi option
unlist(stri_split_regex(v, "\\..*", omit_empty=TRUE))
#[1] "7" "8" "12" "11"
unlist(stri_split_fixed(v, ".", n=1, tokens_only=TRUE))
unlist(stri_split_regex(v, "[^\\w]", n=1, tokens_only=TRUE))
Other sub variations that use a capture group to target the leading characters specifically:
sub("(\\w+).+", "\\1", v) # \w matches [[:alnum:]_] (i.e. alphanumerics and underscores)
sub("([[:alnum:]]+).+", "\\1", v) # exclude underscores
# variations on a theme
sub("(\\w+)\\..*", "\\1", v)
sub("(\\d+)\\..*", "\\1", v) # narrower: \d for digits specifically
sub("(.+)\\..*", "\\1", v) # broader: "." matches any single character
# stringi variation just for fun:
stri_extract_first_regex(v, "\\w+")
scan() would actually work well for this. Since we want everything before the first ., we can use that as a comment character and scan() will remove everything after and including that character, for each element in v.
scan(text = v, comment.char = ".")
# [1] 7 8 12 11
The above returns a numeric vector, which might be where you are headed. If you need to stick with characters, add the what argument to denote we want a character vector returned.
scan(text = v, comment.char = ".", what = "")
# [1] "7" "8" "12" "11"
Data:
v <- c("7.7.4", "8", "12.6", "11.5.2.1")

R - Select Elements from list that meet the criteria

I had a tough time selecting elements from a list that meet a function. So documenting the same with a solution.
check.digits <- function(x){ grepl('^(\\d+)$' , x) }
x = "741 abc pqr street 71 15 41 510741"
lx = strsplit(x, split = " ", fixed = TRUE)
lapply(lx, check.digits)
This does not work -
lx[[1]][c(lapply(lx, check.digits))]
Use -
lx[[1]][sapply(lx, check.digits)]
thanks!!!
Given what you're after, perhaps you should just use gregexpr + regmatches:
regmatches(x, gregexpr("\\d+", x))
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Or, from "qdapRegex", use rm_number:
library(qdapRegex)
rm_number(x, extract = TRUE)
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Or, from "stringi", use stri_extract_all_regex:
library(stringi)
stri_extract_all_regex(x, "\\d+")
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Add an [[1]] at the end if you're just dealing with a single string and are just interested in the single vector.
Use
lx[[1]][sapply(lx, check.digits)]
[1] "741" "71" "15" "41" "510741"

If statements in for loop not qualifying vector in R

I wrote a function in R to attach zeros such that any number between 1 and 100 comes out as 001 (1), 010 (10), and 100 (100) but I can't figure out why the if statements aren't qualifying like I would like them to.
id <- 1:11
Attach_zero <- function(id){
i<-1
for(i in id){
if(id[i] < 10){
id[i] <- paste("00",id[i], sep = "")
}
if((id[i] < 100)&&(id[i]>=10)){
id[i] <- paste("0",id[i], sep = "")
}
print(id[i])
}
}
The output is "001", "2", "3",... "010", "11"
I have no idea why the for loop is skipping middle integers.
The problem here is that you're assigning a character string (e.g. "001") to a numeric vector. When you do this, the entire id vector is converted to character (elements of a vector must be of one type).
So, after comparing 1 to 10 and assigning "001" to id[1], the next element of id is "2" (i.e. character 2). When an inequality includes a character element (e.g. "2" < 10), the numeric part is coerced to character, and alphabetic sorting rules apply. These rules mean that both "100" and "10" comes before "2", and so neither of your if conditions are met. This is the case for all numbers except 10, which according to alphabetic sorting is less than 100, and so your second if condition is met. When you get to 11, neither condition is met once again, since the "word" "11" comes after the word "100".
While there are a couple of ways to fix your function, this functionality exists in R (as mentioned in the comments), both with sprintf and formatC.
sprintf('%03d', 1:11)
formatC(1:11, flag=0, width=3)
# [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
For another vectorised approach, you could use nested ifelse statements:
ifelse(id < 10, paste0('00', id), ifelse(id < 100, paste0('0', id), id))
Try this:
id <- 1:11
Attach_zero <- function(id){
id1 <- id
i <- 1
for (i in seq_along(id)) {
if(id[i] < 10){
id1[i] <- paste("00", id[i], sep = "")
}
if(id[i] < 100 & id[i] >= 10){
id1[i] <- paste("0", id[i], sep = "")
}
}
print(id1)
}
If you try your function with id = c(1:3, 6:11):
Attach_zero(id)
##[1] "001"
##[1] "2"
##[1] "3"
##[1] "8"
##[1] "9"
##[1] "010"
##[1] "11"
##Error in if (id[i] < 10) { : missing value where TRUE/FALSE needed
What here happens is that the missing values are omitted because your i values says so. The i<-1 does nothing as it is after that written with for (i in id) which in turns gives i for each loop the ith value of id instead of an index. So if your id is id <- c(1:3, 6:11) you will have unexpected results as showed.
Just correcting your function to include all the elements of the id:
Attach_zero <- function(id){
for(i in 1:length(id)){
if(id[i] < 10){
id[i] <- paste("00",id[i], sep = "")
}
if((id[i] < 100)&&(id[i]>=10)){
id[i] <- paste("0",id[i], sep = "")
}
print(id[i])
}
}
Attach_zero(id)
##[1] "001"
##[1] "2"
##[1] "3"
##[1] "6"
##[1] "7"
##[1] "8"
##[1] "9"
##[1] "010"
##[1] "11"
Note the number 7 in this output.
And using sprintf as jbaums says, including it in a function:
Attach_zero <- function(id){
return(sprintf('%03d', id)) #You can change return for print if you want
}
Attach_zero(id)
## [1] "001" "002" "003" "006" "007" "008" "009" "010" "011"

Resources