R - gsub() for to remove dates from data set - r

I am using the gsub() function to remove the unwanted text from the data. I just want to have the age in the brackets, not the dates of birth. However, this is in a large data set with differing birth days.
Example of the data:
Test1$Age
Sep 10, 1990(27)
Mar 26, 1987(30
Feb 24, 1997(20)

You can do this using str_extract() from the stringr package:
s <- "Sep 10, 1990(27)"
# get the age in parentheses
stringr::str_extract(s, "\\([0-9]+\\)")
# just the age, with parentheses removed
stringr::str_extract(s, "(?<=\\()[0-9]+")
And the output is:
> s <- "Sep 10, 1990(27)"
>
> # get the age in parentheses
> stringr::str_extract(s, "\\([0-9]+\\)")
[1] "(27)"
>
> # just the age, with parentheses removed
> stringr::str_extract(s, "(?<=\\()[0-9]+")
[1] "27"
The first regular expression matches paired parentheses containing one or more digits. The second regular expression uses positive lookbehind to match one or more digits following an opening parenthesis.
If your data is in a data.frame df with the column named age, then you could do the following:
df$age <- stringr::str_extract(df$age, "\\([0-9]+\\)")
Or, in tidyverse notation:
df <- df %>% mutate(age = stringr::str_extract(age, "\\([0-9]+\\)"))

There seems to be two problems:
the date prior to the left parenthesis is not wanted
the right parenthesis is sometimes missing and it needs to be inserted
1) sub These can be addressed with sub. Match
any number of characters .* followed by
a literal left parenthesis [(] followed by
digits in a capture group (\\d+) followed by
an optional right parenthesis [)]?
and then replace that with a left parenthesis, the match to the capture group \\1 and a right parenthesis.
No packages are used.
pat <- ".*[(](\\d+)[)]?"
transform(test, Age = sub(pat, "(\\1)", Age))
If, instead, you wanted the age as a numeric field then:
transform(test, Age = as.numeric(sub(pat, "\\1", Age)))
2) substring/sub Another possibility is to take the 13th character onwards which gives everything from the left parenthesis to the end of the string and insert a ) if missing. )?$ matches a right parenthesis at the end of the string or just the end of the string if none. That is replaced with a right parenthesis. Again, no packages are used.
transform(test, Age = sub(")?$", ")", substring(Age, 13))
A variation of this if we wanted a numeric Age instead would be to take everything from the 14th character and remove the final ) if present.
transform(test, Age = as.numeric(sub(")", "", substring(Age, 14))))
3) read.table Use read.table to read the Age field with sep = "(" and comment.char = ")" and pick off the second column read. This will give the numeric age and we can use sprintf to surround that with parentheses. If Age were character (as opposed to factor) then as.character(Age) could optionally be written as just Age.
Again, no packages are used. This one does not use regular expressions.
transform(test, Age =
sprintf("(%s)", read.table(text = as.character(Age), sep = "(", comment.char = ")")$V2)
Note: The input in reproducible form is:
test <- data.frame(Age = c("Sep 10, 1990(27)", "Mar 26, 1987(30", "Feb 24, 1997(20)"))

Related

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")

Find and extract year within sentence for each cell in R

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.
It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)
Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.
You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

Transform suffix into prefix in column name

I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...
A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Resources