Decoding GS1 string using R - r

In a dataframe, one column includes a GS1 code scanned from barcodes. A GS1 code is a string including different types of information. Application Identifiers (AI) indicate what type of information the next part of the string is.
Here is an example of a GS1 string: (01)8714729797579(17)210601(10)23919374
the AI is indicated between brackets. In this case (01) means 'GTIN', (17) means 'Expiration Date' and (10) means 'LOT'.
What I like to do in R is create three different columns from the single column, using the AI as the new column names.
I tried using 'separate', but the brackets aren't removed. Why aren't the brackets removed?
df <- data.frame(id =c(1, 2, 3), CODECONTENT = c("(01)871(17)21(10)2391", "(01)579(17)26(10)9374", "(01)979(17)20(10)9193"))
df <- df %>% separate(CODECONTENT, c("GTIN", "Expiration_Date"), "(17)", extra = "merge") %>%
separate(Expiration_Date, c("Expiration Date", "LOT"), "(10)", extra = "merge")
The above returns the following:
id
GTIN
Expiration Date
LOT
1
1
(01)871(
)21(
)2391
2
2
(01)579(
)26(
)9374
3
3
(01)979(
)20(
)9193
I am not sure why the brackets are still there. Besides removing the bracket would there be a smarter way to also remove the first AI (01) in the same code?

Because the parenthesis symbols are special characters, you need to tell the regex to treat them literally. One option is to surround them in square brackets.
df %>%
separate(col = CODECONTENT,
sep = "[(]17[)]",
into = c("gtin", "expiration_date")) %>%
separate(expiration_date,
sep = "[(]10[)]",
into = c("expiration_date", "lot"),
extra = "merge")
id gtin expiration_date lot
1 1 (01)871 21 2391
2 2 (01)579 26 9374
3 3 (01)979 20 9193

Related

Renaming column but capturing number

I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df

Parsing JSON string into columns using R - JSON string with inconsistent ordering of properties

I have a CSV with a JSON string with inconsistent ordering of the fields. So it looks like this:
Row 1: '{"name":"John", "age":30, "car":null}'
Row 2: '{"name":"Chuck", "car":black, "age":25}'
Row 3: '{"car":blue, "age":54, "name":"David"}'
I’m hoping to use R to parse this out into columns with the appropriate data. So I’d like to create a ‘name’ column, ‘age’ column, and ‘car’ column and have them populate with the appropriate data. Is there anyway to do this using JSONlite, or would I need to figure out a way to essentially query the JSON string for the property name (car, name, age) and populate that column with the subsequent value?
you can use jsonlite library, but however in order to parse the data you must make some "adjustments" to your string. Lets say that you have the df as follows
my_df <- data.frame(column_1 =
c('{"name":"John", "age":30, "car":null}',
'{"name":"Chuck", "car":"black", "age":25}',
'{"car":"blue", "age":54, "name":
"David"}')
)
You must have a valid json format in order to parse the data properly. In this case is an array json format, so the data must have [ and ]. Also each element must be separated by ,. Be careful with the strings, each one must have "<string>". (You didn't add it in your example with blue and black data)
With that in mind we can now make some code:
# Base R
# Add "commas" to sep each element
new_json_str <- paste(my_df$column_1, collapse = ",")
# Add "brackets" to the string
new_json_str <- paste("[", new_json_str, "]")
# Parse the JSON string with jsonlite
jsonlite::fromJSON(new_json_str)
# With dplyr library
my_df %>%
pull(column_1) %>% # Get column as "vector"
paste(collapse = ",") %>% # Add "commas"
paste("[", . ,"]") %>% # Add "bracket" (`.` represents the current value, in this case vectors sep by ",")
jsonlite::fromJSON() # Parse json to df
# Result
# name age car
# 1 John 30 <NA>
# 2 Chuck 25 black
# 3 David 54 blue
Alternatively, the RcppSimdJson package can be used. Depending on the format of the data file we can
either convert the data row by row using fparse()
or read and convert the file in one go using fload()
Converting the data row by row
If the data file json_data_1.csv has the format
{"name":"John", "age":30, "car":null}
{"name":"Chuck", "car":"black", "age":25}
{"car":"blue", "age":54, "name":"David"}
(note that blue and black have been enclosed in double quotes to obey JSON syntax rules)
the JSON data need to be converted row by row, e.g.
library(magrittr) # piping used to improve readability
readLines("json_data_1.csv") %>%
lapply(RcppSimdJson::fparse) %>%
data.table::rbindlist(fill = TRUE)
name age car
1: John 30 <NA>
2: Chuck 25 black
3: David 54 blue
Reading and converting the file in one go
If the data file json_data_2.csv has the format
[
{"name":"John", "age":30, "car":null},
{"name":"Chuck", "car":"black", "age":25},
{"car":"blue", "age":54, "name":"David"}
]
(note the square brackets and the commas which indicate an array in JSON syntax)
the file can be read and converted by one line of code:
RcppSimdJson::fload("json_data_2.csv")
name age car
1 John 30 <NA>
2 Chuck 25 black
3 David 54 blue

Using R Separate_Rows doesn't work with a "|"

Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?
It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C
Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!
Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values

Extracting string before a fixed character position

Its fairly simple question, I tried multiple combinations however I am not getting to what I want to achieve.
I have a columns which has statement separate by "-". I want to extract the words before the fourth instance of "-" from
the month of April.
I am using this code which trims the part before the 4th "-" and it returns anything left after that.
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"), sub(".?-.?-.?-.?-", "", data$Email), ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
However I want to extract the portion before the 4th "-" for eg if this my string "19Q1-XYZ-JA-All-OutR-random-key-March" I want only 19Q1-XYZ-JA-All instead of having OutR-random-key-March which is what i get currently
This is my dataset
Email date
18Q4-ABC-SEA-CO-TM 1/8/2019
19Q1-DEF-ABJPODTSST 1/16/2019
19Q1-ABC-CMJ 2/8/2019
19Q1-APC-CORP 4/9/2019
19Q1-XYZ-ALP-SEA-MOO ABc_1 5/13/2019
19Q1-WXY-All-SF- Coral 01_24 1/27/2019
19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send 3/14/2019
19Q1-XYZ-CN-All-cra-foo world-2901 1/30/2019
19Q1-XYZ-CN-All-get-foo world-2901 1/31/2019
19Q1-XYZ-CN-All-opc-foo world-2901 7/31/2019
19Q1-XYX-FI-AC-DEC-kites 1/21/2019
19Q1-XYZ-JA-All-OutR-random-key-March 7/19/2019
19Q1-XYZ-JA-All-OutR-random-key-March 6/19/2019
19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March 3/29/2019
19Q1-XYZ-unavailable-random-key-balaji 4/20/2019
An option is to to match 3 sets of characters that are not a - followed by - and the next set of characters that are not a - ([^-]+), capture as a group and replace with the backreference (\\1) of that captured group
data$date <- as.Date(data$date, "%m/%d/%Y")
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"),
sub("^(([^-]+-){3}[^-]+)-.*", "\\1", data$Email),
ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
data
data <- structure(list(Email = c("18Q4-ABC-SEA-CO-TM", "19Q1-DEF-ABJPODTSST",
"19Q1-ABC-CMJ", "19Q1-APC-CORP", "19Q1-XYZ-ALP-SEA-MOO ABc_1",
"19Q1-WXY-All-SF- Coral 01_24", "19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send",
"19Q1-XYZ-CN-All-cra-foo world-2901", "19Q1-XYZ-CN-All-get-foo world-2901",
"19Q1-XYZ-CN-All-opc-foo world-2901", "19Q1-XYX-FI-AC-DEC-kites",
"19Q1-XYZ-JA-All-OutR-random-key-March", "19Q1-XYZ-JA-All-OutR-random-key-March",
"19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March", "19Q1-XYZ-unavailable-random-key-balaji"
), date = c("1/8/2019", "1/16/2019", "2/8/2019", "4/9/2019",
"5/13/2019", "1/27/2019", "3/14/2019", "1/30/2019", "1/31/2019",
"7/31/2019", "1/21/2019", "7/19/2019", "6/19/2019", "3/29/2019",
"4/20/2019")), class = "data.frame", row.names = c(NA, -15L))
An easy solution is to use ?gregexpr function to get the position of all - and then extract the string based on its position:
I use the data created by #akrun
result <- sapply(data$Email, function(x)substr(x, 1, gregexpr("-",x)[[1]][4]-1))
result
This will simply generate NA value since some string only has 3 "-", you can simply modify the code using if condition to filter them.

replacing repeated strings using regex in R

I have a string as follows:
text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I want to eliminate all duplicated addresses, so my expected result is:
expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:
gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)
I've tried with perl = FALSE and TRUE to no avail.
What am I doing wrong?
If they are sequential, you just need to modify your regex slightly.
Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.
([\w.:/]+)(?:,\1)+
https://regex101.com/r/FDzop9/1
( [\w.:/]+ ) # (1), The adress
(?: # Cluster
, \1 # Comma followed by what found in group 1
)+ # Cluster end, 1 to many times
Note - if you use split and unique then combine, you will lose the ordering of
the items.
An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text
paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
"http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)
You can use functions from tidyverse if your strings are in a dataframe:
library(tidyverse)
separate_rows(df, text, sep = ",") %>%
distinct %>%
group_by(no) %>%
mutate(text = paste(text, collapse = ",")) %>%
slice(1)
The output is:
# no text
# <int> <chr>
# 1 1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2 2 http://q.co/imag/qrs.png

Resources