I want to reshape a year '1984' into '84' in my dataset. I just want to remove the first to digits ('19') and ('20') so only the last two numbers will remain.
I've tried the following:
gsub('19+', '', year)
gsub('20+', '', year)
These codes also delete the years 1919 or 2020 completely but that's not the idea.
What code can I try while using gsub?
Using 19+ will match a 1 followed by 1 or more times a 9. Using 20+ will match a 2 followed by 1 or more times a zero. As gsub replaces all matches of a string, you will match both 1919 and 2020, as it will alse for example 19999919 or 200.
You might use a pattern to match either 19 or 20 and capture the last 2 digits in a capture group.
In the replacement use the first capture group using \\1, and use word boundaries \b around the pattern to prevent the digits being part of a larger string.
gsub('\\b(?:19|20)(\\d\\d)\\b', '\\1', "1984")
Output
[1] "84"
R demo
A broader match could be matching 2 digits at the start instead of 19 or 20.
gsub('\\b\\d{2}(\\d{2})\\b', '\\1', "1984")
Using ^ for beginning of string.
gsub("^19|^20", "", year)
# [1] "19" "28" "37" "46" "55" "64" "73" "82" "91" "00" "09" "18"
Alternatively using substring.
substring(year, 3)
# [1] "19" "28" "37" "46" "55" "64" "73" "82" "91" "00" "09" "18"
Data:
year <- seq(1919, 2021, 9)
Related
I am using the function filter()(in library dplyr) with this dataset. It contains a variable called "depth_m" which is numeric, I transformed it to a character class with sapply (see code below) and I didn't have problems.
Now the variable is a character however, when I filtered the dataset based on the "depth_m" variable either as =="20" (as a character) or == 20 (as a number) I obtain the same result So.. Shouldn't I get an error when filtering by number (== 20)?
Here is my code:
data <- read.table("env.txt", sep = "\t", header = TRUE)
class(data$depth_m)
Output:
[1] "integer"
# Variable transformation
data$depth_m <- sapply(data$depth_m, as.character)
class(data$depth_m)
Output:
[1] "character"
To check the data type:
class(data$depth_m)
Output:
[1] "1000" "500" "20" "1" "1000" "500" "20" "1" "1000" "320" "1" "20" "1"
[14] "20" "1" "120" "20" "20" "365" "20" "1" "375" "20" "1" "1000" "500"
[27] "20" "1" "200" "20" "1" "1000" "500" "25" "1" "1000" "500" "25" "1"
[40] "20" "300" "20" "1000" "20"
Here I'm filtering. In this code I expected to get some subdataset because the value "20" is a character and it is correct because it exists in the original dataset.
y <- filter(data, depth_m == "20") %>%
select(env_sample, depth_m)
head(y)
Output:
env_sample depth_m
1 Jan_B16_0020 20
2 Jan_B08_0020 20
3 Mar_M03_0020 20
4 Mar_M04_0020 20
5 Mar_M05_0020 20
6 Mar_M06_0020 20
Here I'm filtering again. In this code I didn't expect to get some subdataset because the value 20 is a number and it is'nt correct because it doesnt't exist in the original dataset.
y1 <- filter(data, depth_m == 20) %>%
select(env_sample, depth_m)
head(y1)
Output:
env_sample depth_m
1 Jan_B16_0020 20
2 Jan_B08_0020 20
3 Mar_M03_0020 20
4 Mar_M04_0020 20
5 Mar_M05_0020 20
6 Mar_M06_0020 20
Any comment will be helpful. Thank you.
In R, the expression 20 == "20" is valid, though some (from other programming languages) might consider that a little "sloppy". When that is evaluated, it up-classes the 20 to "20" for the comparison. This silent casting can be good (useful and flexible), but it can also cause unintended, undesired, and/or surprising results. (The fact that it's silent is what I dislike about it, but convenience is convenience.)
If you want to be perfectly clear about your comparison, you can test for class as well. In your example, you show 20 which is numeric and not technically integer (which would be 20L), but you can shape the precision of the conditional to your own tastes:
filter(data, is.numeric(depth_m) & depth_m == 20)
This will still up-class the 20 to "20", but because the first portion is.numeric(.) fails, the combination of the two will fail as well. Realize that the specificity of that test is absolute: if the column is indeed character, then you will always get zero rows, which may not be what you want. If instead you want to remove non-20 rows only if they are 20 and numeric, then perhaps
filter(data, !is.numeric(depth_m) | depth_m == 20)
This goes down the dizzying logic of "if it is not numeric, then it obviously cannot truly be 20, so keep it ... but if it is numeric, make sure it is definitely 20". Of course, we run into the premise here that there is no way that one portion of the column can be numeric while another cannot, so ... perhaps that's over-indulging the specificity of filtering.
I have the following vector.
v <- c('X100kmph','X95kmph', 'X90kmph', 'X85kmph', 'X80kmph',
'X75kmph','X70kmph','X65kmph','X60kmph','X55kmph','X50kmph',
'X45kmph','X40kmph','X35kmph','X30kmph','X25kmph','X20kmph',
'X15kmph','X10kmph')
I want to extract the digits representing speed. They all start at the 2nd position, but end at different places, so I need (length of element i) - 4 as the ending position.
The following doesn't work as length(v) returns the length of the vector and not of each element.
vnum <- substr(v, 2, length(v)-4)
Tried lengths() as well, but doesn't work.
How can I supply the length of each element to substr?
Context:
v actually represents a character column (called Speed) in a tibble which I'm trying to mutate into the corresponding numeric column.
mytibble <- mytibble %>%
mutate(Speed = as.numeric(substr(Speed, 2, length(Speed) - 4)))
Using nchar() instead of length() as suggested by tmfmnk does the trick!
vnum <- substr(v, 2, nchar(v)-4)
If you just want to extract the digits, then here is another option
vnum <- gsub("\\D","",v)
such that
> vnum
[1] "100" "95" "90" "85" "80" "75" "70" "65" "60" "55"
[11] "50" "45" "40" "35" "30" "25" "20" "15" "10"
I am searching for "1." in a matrix in R using grep function. grep("1\.",VADeaths,value=TRUE). However, it is not showing me 41.0 in the result. Why is it so? 41.0 is one of the values in one of the columns.
If we convert in to character, this will become more apparent
as.character(VADeaths)
#[1] "11.7" "18.1" "26.9" "41" "66" "8.7" "11.7" "20.3" "30.9" "54.3"
#[11] "15.4" "24.3" "37" "54.6" "71.1" "8.4" "13.6" "19.3" "35.1" "50"
For 41, it is just round and there is no . there
If we need to get those elements as well
grep("1\\.|^[^.]*1$", VADeaths, value = TRUE)
#[1] "11.7" "41" "11.7" "71.1"
For the following vector, I would to keep only the first two digits of each integer:
a <- c('1234 2345 345 234', '323 55432 443', '43 23345 321')
I've attempted to do this by converting the vector into a nested list using strsplit and then applying substr to the list:
a <- strsplit(a, ' ')
a <- substr(a, start = 1, stop = 2)
However, this seems to just extract eh beginning of the concatenated command:
a
[1] "c(" "c(" "c("
Ideally, I would be able to coerce the vector into the following form:
[[1]]
[1] "12" "23" "34" "23"
[[2]]
[1] "32" "55" "44"
[[3]]
[1] "43" "23" "32"
How about
lapply(strsplit(a, " "), substr, 1, 2)
this explicitly does an lapply over the results of the strsplit. This is because substr() tries to coerce your list to a character vector first (it doesn't expect a list as it's first parameter). You can see what it's looking at if you do
as.character(strsplit(a, ' '))
# [1] "c(\"1234\", \"2345\", \"345\", \"234\")" "c(\"323\", \"55432\", \"443\")"
# [3] "c(\"43\", \"23345\", \"321\")"
We can also extract the first two digits from a word boundary
library(stringr)
str_extract_all(a, "\\b\\d{2}")
#[[1]]
#[1] "12" "23" "34" "23"
#[[2]]
#[1] "32" "55" "44"
#[[3]]
#[1] "43" "23" "32"
I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"