I am searching for "1." in a matrix in R using grep function. grep("1\.",VADeaths,value=TRUE). However, it is not showing me 41.0 in the result. Why is it so? 41.0 is one of the values in one of the columns.
If we convert in to character, this will become more apparent
as.character(VADeaths)
#[1] "11.7" "18.1" "26.9" "41" "66" "8.7" "11.7" "20.3" "30.9" "54.3"
#[11] "15.4" "24.3" "37" "54.6" "71.1" "8.4" "13.6" "19.3" "35.1" "50"
For 41, it is just round and there is no . there
If we need to get those elements as well
grep("1\\.|^[^.]*1$", VADeaths, value = TRUE)
#[1] "11.7" "41" "11.7" "71.1"
Related
This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Extract all numbers from a single string in R
(4 answers)
Closed 1 year ago.
I was thinking I could use str_extract_all or something in tidyverse, but I am not sure how to get it, because what my string returns is not correct.
This is the string:
str <- "12, 47, 48 The integers numbers are also interesting: 189 2036 314 \',\' is a separator, so please extract these numbers 125,789,1450 and also these 564,90456. 7890$ per month "
We can use str_extract_all to extract multiple instances of one of more digits (\\d+). The output will be a list of length 1. So, we extract the list element with [[
library(stringr)
str_extract_all(string1, "\\d+")[[1]]
-output
[1] "12" "47" "48" "189" "2036" "314" "125" "789" "1450" "564" "90456" "7890"
For a base R option, we can use regmatches along with gregexpr:
regmatches(string1, gregexpr("\\d+", string1))
[1] "12" "47" "48" "189" "2036" "314" "125" "789" "1450" "564" "90456" "7890"
I want to reshape a year '1984' into '84' in my dataset. I just want to remove the first to digits ('19') and ('20') so only the last two numbers will remain.
I've tried the following:
gsub('19+', '', year)
gsub('20+', '', year)
These codes also delete the years 1919 or 2020 completely but that's not the idea.
What code can I try while using gsub?
Using 19+ will match a 1 followed by 1 or more times a 9. Using 20+ will match a 2 followed by 1 or more times a zero. As gsub replaces all matches of a string, you will match both 1919 and 2020, as it will alse for example 19999919 or 200.
You might use a pattern to match either 19 or 20 and capture the last 2 digits in a capture group.
In the replacement use the first capture group using \\1, and use word boundaries \b around the pattern to prevent the digits being part of a larger string.
gsub('\\b(?:19|20)(\\d\\d)\\b', '\\1', "1984")
Output
[1] "84"
R demo
A broader match could be matching 2 digits at the start instead of 19 or 20.
gsub('\\b\\d{2}(\\d{2})\\b', '\\1', "1984")
Using ^ for beginning of string.
gsub("^19|^20", "", year)
# [1] "19" "28" "37" "46" "55" "64" "73" "82" "91" "00" "09" "18"
Alternatively using substring.
substring(year, 3)
# [1] "19" "28" "37" "46" "55" "64" "73" "82" "91" "00" "09" "18"
Data:
year <- seq(1919, 2021, 9)
I have the following vector.
v <- c('X100kmph','X95kmph', 'X90kmph', 'X85kmph', 'X80kmph',
'X75kmph','X70kmph','X65kmph','X60kmph','X55kmph','X50kmph',
'X45kmph','X40kmph','X35kmph','X30kmph','X25kmph','X20kmph',
'X15kmph','X10kmph')
I want to extract the digits representing speed. They all start at the 2nd position, but end at different places, so I need (length of element i) - 4 as the ending position.
The following doesn't work as length(v) returns the length of the vector and not of each element.
vnum <- substr(v, 2, length(v)-4)
Tried lengths() as well, but doesn't work.
How can I supply the length of each element to substr?
Context:
v actually represents a character column (called Speed) in a tibble which I'm trying to mutate into the corresponding numeric column.
mytibble <- mytibble %>%
mutate(Speed = as.numeric(substr(Speed, 2, length(Speed) - 4)))
Using nchar() instead of length() as suggested by tmfmnk does the trick!
vnum <- substr(v, 2, nchar(v)-4)
If you just want to extract the digits, then here is another option
vnum <- gsub("\\D","",v)
such that
> vnum
[1] "100" "95" "90" "85" "80" "75" "70" "65" "60" "55"
[11] "50" "45" "40" "35" "30" "25" "20" "15" "10"
For the following vector, I would to keep only the first two digits of each integer:
a <- c('1234 2345 345 234', '323 55432 443', '43 23345 321')
I've attempted to do this by converting the vector into a nested list using strsplit and then applying substr to the list:
a <- strsplit(a, ' ')
a <- substr(a, start = 1, stop = 2)
However, this seems to just extract eh beginning of the concatenated command:
a
[1] "c(" "c(" "c("
Ideally, I would be able to coerce the vector into the following form:
[[1]]
[1] "12" "23" "34" "23"
[[2]]
[1] "32" "55" "44"
[[3]]
[1] "43" "23" "32"
How about
lapply(strsplit(a, " "), substr, 1, 2)
this explicitly does an lapply over the results of the strsplit. This is because substr() tries to coerce your list to a character vector first (it doesn't expect a list as it's first parameter). You can see what it's looking at if you do
as.character(strsplit(a, ' '))
# [1] "c(\"1234\", \"2345\", \"345\", \"234\")" "c(\"323\", \"55432\", \"443\")"
# [3] "c(\"43\", \"23345\", \"321\")"
We can also extract the first two digits from a word boundary
library(stringr)
str_extract_all(a, "\\b\\d{2}")
#[[1]]
#[1] "12" "23" "34" "23"
#[[2]]
#[1] "32" "55" "44"
#[[3]]
#[1] "43" "23" "32"
I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"