Extracting string before a fixed character position - r

Its fairly simple question, I tried multiple combinations however I am not getting to what I want to achieve.
I have a columns which has statement separate by "-". I want to extract the words before the fourth instance of "-" from
the month of April.
I am using this code which trims the part before the 4th "-" and it returns anything left after that.
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"), sub(".?-.?-.?-.?-", "", data$Email), ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
However I want to extract the portion before the 4th "-" for eg if this my string "19Q1-XYZ-JA-All-OutR-random-key-March" I want only 19Q1-XYZ-JA-All instead of having OutR-random-key-March which is what i get currently
This is my dataset
Email date
18Q4-ABC-SEA-CO-TM 1/8/2019
19Q1-DEF-ABJPODTSST 1/16/2019
19Q1-ABC-CMJ 2/8/2019
19Q1-APC-CORP 4/9/2019
19Q1-XYZ-ALP-SEA-MOO ABc_1 5/13/2019
19Q1-WXY-All-SF- Coral 01_24 1/27/2019
19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send 3/14/2019
19Q1-XYZ-CN-All-cra-foo world-2901 1/30/2019
19Q1-XYZ-CN-All-get-foo world-2901 1/31/2019
19Q1-XYZ-CN-All-opc-foo world-2901 7/31/2019
19Q1-XYX-FI-AC-DEC-kites 1/21/2019
19Q1-XYZ-JA-All-OutR-random-key-March 7/19/2019
19Q1-XYZ-JA-All-OutR-random-key-March 6/19/2019
19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March 3/29/2019
19Q1-XYZ-unavailable-random-key-balaji 4/20/2019

An option is to to match 3 sets of characters that are not a - followed by - and the next set of characters that are not a - ([^-]+), capture as a group and replace with the backreference (\\1) of that captured group
data$date <- as.Date(data$date, "%m/%d/%Y")
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"),
sub("^(([^-]+-){3}[^-]+)-.*", "\\1", data$Email),
ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
data
data <- structure(list(Email = c("18Q4-ABC-SEA-CO-TM", "19Q1-DEF-ABJPODTSST",
"19Q1-ABC-CMJ", "19Q1-APC-CORP", "19Q1-XYZ-ALP-SEA-MOO ABc_1",
"19Q1-WXY-All-SF- Coral 01_24", "19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send",
"19Q1-XYZ-CN-All-cra-foo world-2901", "19Q1-XYZ-CN-All-get-foo world-2901",
"19Q1-XYZ-CN-All-opc-foo world-2901", "19Q1-XYX-FI-AC-DEC-kites",
"19Q1-XYZ-JA-All-OutR-random-key-March", "19Q1-XYZ-JA-All-OutR-random-key-March",
"19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March", "19Q1-XYZ-unavailable-random-key-balaji"
), date = c("1/8/2019", "1/16/2019", "2/8/2019", "4/9/2019",
"5/13/2019", "1/27/2019", "3/14/2019", "1/30/2019", "1/31/2019",
"7/31/2019", "1/21/2019", "7/19/2019", "6/19/2019", "3/29/2019",
"4/20/2019")), class = "data.frame", row.names = c(NA, -15L))

An easy solution is to use ?gregexpr function to get the position of all - and then extract the string based on its position:
I use the data created by #akrun
result <- sapply(data$Email, function(x)substr(x, 1, gregexpr("-",x)[[1]][4]-1))
result
This will simply generate NA value since some string only has 3 "-", you can simply modify the code using if condition to filter them.

Related

Decoding GS1 string using R

In a dataframe, one column includes a GS1 code scanned from barcodes. A GS1 code is a string including different types of information. Application Identifiers (AI) indicate what type of information the next part of the string is.
Here is an example of a GS1 string: (01)8714729797579(17)210601(10)23919374
the AI is indicated between brackets. In this case (01) means 'GTIN', (17) means 'Expiration Date' and (10) means 'LOT'.
What I like to do in R is create three different columns from the single column, using the AI as the new column names.
I tried using 'separate', but the brackets aren't removed. Why aren't the brackets removed?
df <- data.frame(id =c(1, 2, 3), CODECONTENT = c("(01)871(17)21(10)2391", "(01)579(17)26(10)9374", "(01)979(17)20(10)9193"))
df <- df %>% separate(CODECONTENT, c("GTIN", "Expiration_Date"), "(17)", extra = "merge") %>%
separate(Expiration_Date, c("Expiration Date", "LOT"), "(10)", extra = "merge")
The above returns the following:
id
GTIN
Expiration Date
LOT
1
1
(01)871(
)21(
)2391
2
2
(01)579(
)26(
)9374
3
3
(01)979(
)20(
)9193
I am not sure why the brackets are still there. Besides removing the bracket would there be a smarter way to also remove the first AI (01) in the same code?
Because the parenthesis symbols are special characters, you need to tell the regex to treat them literally. One option is to surround them in square brackets.
df %>%
separate(col = CODECONTENT,
sep = "[(]17[)]",
into = c("gtin", "expiration_date")) %>%
separate(expiration_date,
sep = "[(]10[)]",
into = c("expiration_date", "lot"),
extra = "merge")
id gtin expiration_date lot
1 1 (01)871 21 2391
2 2 (01)579 26 9374
3 3 (01)979 20 9193

How can I change a date style in R?

A column in my dataset includes dates such as Month Name and Year. I want to change the month's name to number.
My dataset looks like this (but is not limited to only 3 rows):
I want to change the ldr_start column to this:
ldr_start
3/92
7/93
8/93
Thank you.
It's not really a "date" in either case. The zoo package does define a 'yearmon' class whose use in this case has been illustrated by one of that package's authors, #G.Grothendieck.
Here we can just use strsplit and process the month-character, match against the R ?Constants, month.abb, and then rejoin:
dat <- scan(text="Mar-92,Feb-93,Jul-94,Sep-95", what = "", sep=",")
#Read 4 items
datspl <- strsplit(dat, split="-")
sapply( datspl, function(mnyr){ paste( match(mnyr[1], month.abb), mnyr[2], sep="/")})
#[1] "3/92" "2/93" "7/94" "9/95"
Using the input defined reproducibly in the Note at the end we have some alternatives.
1) yearmmon Using the test data in the Note at the end convert the input column to yearmon class and then format it in the desired fashion. See ?strptime for information on the percent codes.
library(zoo)
transform(DF, start2 = sub("^0", "", format(as.yearmon(start, "%b-%y"), "%m/%y")))
## start start2
## 1 Mar-92 3/92
1a) If it is ok to have a leading zero on one digit months then we can omit the sub and just write:
transform(DF, start2 = format(as.yearmon(start, "%b-%y"), "%m/%y"))
## start start2
## 1 Mar-92 03/92
2) Base R Using only base R we can append 1 to start to make it a valid date (valid dates require year, month and day and not just month and year) and then proceed in a similar way as in (1) or (1a).
transform(DF,
start2 = sub("^0", "", format(as.Date(paste(start, 1), "%b-%y %d"), "%m/%y")))
## start start2
## 1 Mar-92 3/92
Note
DF <- data.frame(start = "Mar-92") # test data frame
We could also use stringr's str_replace_all:
data <- c("Mar-92", "Jul-93", "Aug-93")
str_replace_all(data, setNames(as.character(1:12), month.abb))
Output:
[1] "3-92" "7-93" "8-93"
Update 14/aug: Why this works
As pointed out by #IRTFM this functionality might be unexpected at first glance. However, it can be found in the documentation for the replacement-argument:
To perform multiple replacements in each element of string, pass a named vector (c(pattern1 = replacement1)) to str_replace_all. Alternatively, pass a function to replacement: it will be called once for each match and its return value will be used to replace the match.
The functionality is also evident in the code. If we pass a named vector, the names get assigned to the 'pattern' argument and the values get assigned to the 'replacement' argument, exactly as expected:
if (!is.null(names(pattern))) {
vec <- FALSE
replacement <- unname(pattern)
pattern[] <- names(pattern)
}

Renaming column but capturing number

I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df

conditional str_replace based on matching regex within mutate?

For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.
Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01
You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)

odd behavior when substituting parts of a string within a for loop

I'm trying to replace a series of numbers in a character string with information that comes from a dataframe.
My string comes from a text file that I imported using the readr package as follows: read_file("Human.txt")
I've checked the class, it is character. The string contains the following information (I've named it treeString):
"(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
My dataframe (labels.csv) was originally in factor format, but I changed the format of the second column to character using the following command: labels[,2] = as.character(labels[,2]). It looks like this
v1 v2
1 1 name1
2 2 name2
3 3 name3
My goal is to substitute every number in the string with the corresponding name (i.e. V2) in the dataframe. This should result in the following:
"(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Here is the code I am using to accomplish this:
for(i in 1:nrow(labels)){
gsub(as.character(i), labels[i,2], treeString)
}
The weird thing is that if I run the gsub() command on its own (with specified numbers - eg. 2) it does the substitution, however, when I run it in a loop it does not substitute the numbers.
As pointed out by Kumar Manglam in the comments, you forgot to assign the result of gsub() back to treeString.
There is something else you should be aware of: The way you specified the regular expression in your question it will also replace patterns like "(241)" with "(name24name1)". To avoid this behaviour, you should check whether the numbers you want to replace are preceded by a comma or opening parenthesis and succeeded by a comma or closing parenthesis:
# Option1
for(i in 1:nrow(labelnames)){
reg_pattern <- paste0("(?<=[(,])(", i, ")(?=[),])")
treeString <- gsub(reg_pattern, labelnames$v2[i], treeString, perl=T)
}
Another, nicer, option is drop the for-loop and do it all at once:
# Option2
reg_pattern <- paste0("(?<=[(,])([1-", nrow(labelnames), "])(?=[),])")
treeString <- gsub(reg_pattern, "name\\1", treeString, perl=T)
# Result
treeString
# "(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Data
treeString <- "(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
labelnames <- structure(list(v1 = 1:3, v2 = c("name1", "name2", "name3")), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA, -3L))

Resources