Using R Separate_Rows doesn't work with a "|" - r

Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?

It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C

Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!

Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values

Related

Decoding GS1 string using R

In a dataframe, one column includes a GS1 code scanned from barcodes. A GS1 code is a string including different types of information. Application Identifiers (AI) indicate what type of information the next part of the string is.
Here is an example of a GS1 string: (01)8714729797579(17)210601(10)23919374
the AI is indicated between brackets. In this case (01) means 'GTIN', (17) means 'Expiration Date' and (10) means 'LOT'.
What I like to do in R is create three different columns from the single column, using the AI as the new column names.
I tried using 'separate', but the brackets aren't removed. Why aren't the brackets removed?
df <- data.frame(id =c(1, 2, 3), CODECONTENT = c("(01)871(17)21(10)2391", "(01)579(17)26(10)9374", "(01)979(17)20(10)9193"))
df <- df %>% separate(CODECONTENT, c("GTIN", "Expiration_Date"), "(17)", extra = "merge") %>%
separate(Expiration_Date, c("Expiration Date", "LOT"), "(10)", extra = "merge")
The above returns the following:
id
GTIN
Expiration Date
LOT
1
1
(01)871(
)21(
)2391
2
2
(01)579(
)26(
)9374
3
3
(01)979(
)20(
)9193
I am not sure why the brackets are still there. Besides removing the bracket would there be a smarter way to also remove the first AI (01) in the same code?
Because the parenthesis symbols are special characters, you need to tell the regex to treat them literally. One option is to surround them in square brackets.
df %>%
separate(col = CODECONTENT,
sep = "[(]17[)]",
into = c("gtin", "expiration_date")) %>%
separate(expiration_date,
sep = "[(]10[)]",
into = c("expiration_date", "lot"),
extra = "merge")
id gtin expiration_date lot
1 1 (01)871 21 2391
2 2 (01)579 26 9374
3 3 (01)979 20 9193

Decimal read in does not change

I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on.
When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".") nothing changes. It doesn't matter what I define, dec = "." or dec = "," it will always keep these numbers above.
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ; as a separator, read_delim is more appropriate.

Renaming column but capturing number

I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df

use dplyr to combine columns of data.frame when column names are not known

Given a tibble:
library(tibble)
myTibble <- tibble(a = letters[1:3], b = c(T, F, T), c = 1:3)
I can use transmute to paste the columns, separated by '.':
> library(dplyr)
> transmute(myTibble, concat = paste(a, b, c, sep = "."))
# A tibble: 3 x 1
concat
<chr>
1 a.TRUE.1
2 b.FALSE.2
3 c.TRUE.3
If I want to use the above transmute statement in a function that receives a tibble, I won't know the names of the tibble or the number of columns ahead of time. What dplyr syntax would allow me to paste all columns in a tibble separated by a '.'?
Please note, I can do this with something like:
> apply(myTibble, 1, paste, collapse = ".")
[1] "a.TRUE.1" "b.FALSE.2" "c.TRUE.3"
but I am trying to understand dplyr better. So, yes, this is a specific problem I am trying to solve, but I am also stumped as to why I can't solve it with dplyr, which means there is something key about dplyr column selection I don't yet understand, and I'd like to learn, so that is why I'm asking specifically about a dplyr solution.
With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.
vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

replacing repeated strings using regex in R

I have a string as follows:
text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I want to eliminate all duplicated addresses, so my expected result is:
expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:
gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)
I've tried with perl = FALSE and TRUE to no avail.
What am I doing wrong?
If they are sequential, you just need to modify your regex slightly.
Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.
([\w.:/]+)(?:,\1)+
https://regex101.com/r/FDzop9/1
( [\w.:/]+ ) # (1), The adress
(?: # Cluster
, \1 # Comma followed by what found in group 1
)+ # Cluster end, 1 to many times
Note - if you use split and unique then combine, you will lose the ordering of
the items.
An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text
paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
"http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)
You can use functions from tidyverse if your strings are in a dataframe:
library(tidyverse)
separate_rows(df, text, sep = ",") %>%
distinct %>%
group_by(no) %>%
mutate(text = paste(text, collapse = ",")) %>%
slice(1)
The output is:
# no text
# <int> <chr>
# 1 1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2 2 http://q.co/imag/qrs.png

Resources