R: Extract first two digits from nested list elements - r

For the following vector, I would to keep only the first two digits of each integer:
a <- c('1234 2345 345 234', '323 55432 443', '43 23345 321')
I've attempted to do this by converting the vector into a nested list using strsplit and then applying substr to the list:
a <- strsplit(a, ' ')
a <- substr(a, start = 1, stop = 2)
However, this seems to just extract eh beginning of the concatenated command:
a
[1] "c(" "c(" "c("
Ideally, I would be able to coerce the vector into the following form:
[[1]]
[1] "12" "23" "34" "23"
[[2]]
[1] "32" "55" "44"
[[3]]
[1] "43" "23" "32"

How about
lapply(strsplit(a, " "), substr, 1, 2)
this explicitly does an lapply over the results of the strsplit. This is because substr() tries to coerce your list to a character vector first (it doesn't expect a list as it's first parameter). You can see what it's looking at if you do
as.character(strsplit(a, ' '))
# [1] "c(\"1234\", \"2345\", \"345\", \"234\")" "c(\"323\", \"55432\", \"443\")"
# [3] "c(\"43\", \"23345\", \"321\")"

We can also extract the first two digits from a word boundary
library(stringr)
str_extract_all(a, "\\b\\d{2}")
#[[1]]
#[1] "12" "23" "34" "23"
#[[2]]
#[1] "32" "55" "44"
#[[3]]
#[1] "43" "23" "32"

Related

How could I split a nested list using str_split without the use of for loop?

Now I have a nested list containing something like this
[[1]]
[1] "53" "682, 684" "677" "683"
[[2]]
[1] "40, 43" "10" "44, 47"
and I want to split each list by ", " so that they can all be a list of numbers into
[[1]]
[1] "53" "682" "684" "677" "683"
[[2]]
[1] "40" "43" "10" "44" "47"
Now my current idea is I want to use apply() and define my own function using recursion, but how to detect there is how many lists in each row?
I'm required not to use for loop, so what should I do??
something like this?
not sure since the format of the sample data in the question is unknown...
#create sample data
v1 <- c("53", "682, 684", "677", "683")
v2 <- c("40, 43", "10", "44, 47")
l <- list( v1, v2 )
#code
lapply( l, function(x) trimws( unlist ( strsplit( x, ",") ) ) )
#output
# [[1]]
# [1] "53" "682" "684" "677" "683"
#
# [[2]]
# [1] "40" "43" "10" "44" "47"
#

How to pass the length of each element in an R vector to the substr function?

I have the following vector.
v <- c('X100kmph','X95kmph', 'X90kmph', 'X85kmph', 'X80kmph',
'X75kmph','X70kmph','X65kmph','X60kmph','X55kmph','X50kmph',
'X45kmph','X40kmph','X35kmph','X30kmph','X25kmph','X20kmph',
'X15kmph','X10kmph')
I want to extract the digits representing speed. They all start at the 2nd position, but end at different places, so I need (length of element i) - 4 as the ending position.
The following doesn't work as length(v) returns the length of the vector and not of each element.
vnum <- substr(v, 2, length(v)-4)
Tried lengths() as well, but doesn't work.
How can I supply the length of each element to substr?
Context:
v actually represents a character column (called Speed) in a tibble which I'm trying to mutate into the corresponding numeric column.
mytibble <- mytibble %>%
mutate(Speed = as.numeric(substr(Speed, 2, length(Speed) - 4)))
Using nchar() instead of length() as suggested by tmfmnk does the trick!
vnum <- substr(v, 2, nchar(v)-4)
If you just want to extract the digits, then here is another option
vnum <- gsub("\\D","",v)
such that
> vnum
[1] "100" "95" "90" "85" "80" "75" "70" "65" "60" "55"
[11] "50" "45" "40" "35" "30" "25" "20" "15" "10"

Remove elements from a list in R based on a condition

I have a list l in R as shown below. I want to remove elements where the only alphanumeric character is 0. How can I do that?
# Create list
l <- list(c('108', '50', '0]'), c('109','58','0','0]'), c('18','0'))
l
[[1]]
[1] "108" "50" "0]"
[[2]]
[1] "109" "58" "0" "0]"
[[3]]
[1] "18" "0"
# What I want:
l
[[1]]
[1] "108" "50"
[[2]]
[1] "109" "58"
[[3]]
[1] "18"
We can use grepl to match either 0 or the ] and negate (!) to remove the values from the list elements
lapply(l, function(x) x[!grepl("^0$|\\]", x)])
#[[1]]
#[1] "108" "50"
#[[2]]
#[1] "109" "58"
#[[3]]
#[1] "18"
Or convert to numeric remove the NA elements along with 0
lapply(l, function(x) x[!is.na(as.numeric(x)) & x != 0])
Or use setdiff
lapply(l, setdiff, c("0", "0]"))
I believe this is a general purpose way.
l2 <- lapply(l, function(s) {
s <- gsub('[^[:digit:]]', '', s)
s[nchar(sub('([^0]*)0([^0]*)', '\\1\\2', s)) != 0]
})
l2
#[[1]]
#[1] "108" "50"
#
#[[2]]
#[1] "109" "58"
#
#[[3]]
#[1] "18"
An even more general solution, that removes potential elements like "&% 00]" (where the only alphanumeric characters are 0)
lapply(l, function(x) x[grep('^[0[:punct:][:blank:]]*$', x, invert = TRUE)])

Regular expressions, extract specific parts of pattern

I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"

R - Select Elements from list that meet the criteria

I had a tough time selecting elements from a list that meet a function. So documenting the same with a solution.
check.digits <- function(x){ grepl('^(\\d+)$' , x) }
x = "741 abc pqr street 71 15 41 510741"
lx = strsplit(x, split = " ", fixed = TRUE)
lapply(lx, check.digits)
This does not work -
lx[[1]][c(lapply(lx, check.digits))]
Use -
lx[[1]][sapply(lx, check.digits)]
thanks!!!
Given what you're after, perhaps you should just use gregexpr + regmatches:
regmatches(x, gregexpr("\\d+", x))
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Or, from "qdapRegex", use rm_number:
library(qdapRegex)
rm_number(x, extract = TRUE)
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Or, from "stringi", use stri_extract_all_regex:
library(stringi)
stri_extract_all_regex(x, "\\d+")
# [[1]]
# [1] "741" "71" "15" "41" "510741"
Add an [[1]] at the end if you're just dealing with a single string and are just interested in the single vector.
Use
lx[[1]][sapply(lx, check.digits)]
[1] "741" "71" "15" "41" "510741"

Resources