tidyr::separate() producing unexpected results - r

I am providing a data frame to tidyr::separate() and getting unexpected results. I have a minimal working example below where I show how I am using it, what I expect it to produce, and what it is actually producing. Why is this not working?
# Create toy data frame
dat <- data.frame(text = c("time_suffer|suffer_employ|suffer_sick"),
stringsAsFactors = FALSE)
# Separate variable into 3 columns a,b,c using | as a delimiter
dat %>% tidyr::separate(., col = "text", into = c("a","b","c"), sep = "|")
# What I'm expecting
data.frame(a = "time_suffer", b = "suffer_employ", c = "suffer_sick")
# What I'm actually getting:
data.frame(a = NA, b = "t", c = "1")
I am also getting the warning "Warning message: Expected 3 pieces. Additional pieces discarded in 1 rows [1]."

According to the documentation, the sep argument to separate is interpreted as a regular expression if it is a character (extremely useful if you have complicated separators). This does mean, however, that you need to escape characters with special meaning in regular expressions if you want to match on them literally. Use "\\|" as your separator:
library(tidyverse)
dat <- data.frame(text = c("time_suffer|suffer_employ|suffer_sick"),
stringsAsFactors = FALSE)
dat %>%
tidyr::separate(., col = "text", into = c("a","b","c"), sep = "\\|")
#> a b c
#> 1 time_suffer suffer_employ suffer_sick
Created on 2019-04-02 by the reprex package (v0.2.1)

Related

Is there an R function that reads text files with \n as a (column) delimiter?

The Problem
I'm trying to come up with a neat/fast way to read files delimited by newline (\n) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
#
2021-01-02
text
...
#
...
Where the ... represents potentially multiple rows in the input file, and the # signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1
X2
...
Xn
2021-01-01
text
...
...
2021-01-02
text
...
...
...
...
...
...
My attempt
I've tried base, data.table, readr and vroom, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines(), to read it as a character vector, then manually adding some "proper" column separators (e.g. ;), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "#", "2021-01-02", "text", "#"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way??
Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the # signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the # sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("#", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
Benchmark
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a
You can get round some of the string manipulation with something along the lines of:
input <- readLines("input.txt")[-1] #Read in and remove header
ncol <- which(input=="#")[1]-1 #Number of columns of data
data.frame(matrix(input[input != "#"], ncol = ncol, byrow=TRUE)) #Convert to dataframe
# X1 X2
#1 2021-01-01 text
#2 2021-01-02 text
At this point, you might consider going the full mile and use a proper grammar to parse it. I don't know how big or complex the situation really is, but using pegr it might look something like this:
input <-
"Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#
"
library(pegr)
peg <- new.parser(commonRules,action=TRUE) +
c("HEADER <- 'Header Info' EOL" , "{}" ) + # Rule to match literal 'Header Info' and a \n, then discard
c("TYPE <- 'text' EOL" , "{-}" ) + # Rule to match literal 'text', store paste and store as $TYPE
c("DATE <- (!EOL .)* EOL" , "{-}" ) + # Rule to match any character leading up to a new line. Could improve to look for a date format
c("EOS <- '#' EOL" , "{}" ) + # Rule to match end of section, then discard
c("BODY <- (!EOS .)*" , "{-}" ) + # Rule to match body of text, including newlines
c("SECTION <- DATE TYPE BODY EOS" ) + # Combining rules to match each section
c("DOCUMENT <- HEADER SECTION*" ) # Combining more rules to match the endire document
res <- peg[["DOCUMENT"]](input))
final <- matrix( value(res), ncol=3, byrow=TRUE ) %>%
as.data.frame %>%
setnames( names(value(res))[1:3])
final
Produces:
DATE TYPE BODY
1 2021-01-01 text multiple lines\nof\ntext\n
2 2021-01-02 text more\nlines of text\n
It might feel clunky if you don't know the syntax, but once you do, its a fire and forget solution. It'll run according to spec until the spec doesn't hold. You don't have to worry about fragile pretreatment and it is easy to adapt to changing formats in the future.
There's also the tidyverse way:
library(tidyr)
library(readr)
library(stringr)
max_columns <- 5
d <- {
readr::read_file("file.txt") %>%
stringr::str_remove("^Header Info\n") %>%
tibble::enframe(name = NULL) %>%
separate_rows(value, sep = "#\n") %>%
separate("value", into = paste0("X", 1:max_columns) , sep = "\n")
}
Using your example input in a file called file.txt, the d looks like:
# A tibble: 3 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 2021-01-01 text ... "" NA
2 2021-01-02 text ... "" NA
3 ... NA NA NA NA
Warning message:
Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
Note that the warning is simply to make sure you know you're getting NA's, this is inevitable if number of rows between # varies
I am using the data similar to that provided by Sirius for demonstration. you can also do something like this to have variable number of columns in resulting data frame
example <- "Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#"
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate_rows(dummy, sep = '\\n') %>%
filter(row_number() !=1) %>%
group_by(rowid = rev(cumsum(rev(dummy == '#')))) %>%
filter(dummy != '#') %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 2 x 6
#> # Groups: rowid [2]
#> rowid X1 X2 X3 X4 X5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 2 2021-01-01 text multiple lines of text
#> 2 1 2021-01-02 text more lines of text <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)

How to use select_helpers() [starts_with()] when using readr::read_csv()

I have a rather wide dataset to read in with over 1000 missing values at the top, but all the variable names follow the same pattern. Is there a way to use starts_with() to force certain variables to be parsed correctly?
MWE:
library(tidyverse)
library(readr)
mwe.csv <- data.frame(id = c("a", "b"), #not where I actually get the data from
amount1 = c(NA, 20),
currency1 = c(NA, "USD")
)
mwe <- readr::read_csv("mwe.csv", guess_max = 1) #guess_max() for example purposes
I'd like to be able do
mwe<- read_csv("mwe.csv", guess.max = 1
col_types = cols(starts_with("amount") = "d",
starts_with("currency") = "c"))
)
> mwe
# A tibble: 2 x 3
id amount currency
<chr> <dbl> <chr>
1 a NA NA
2 b 20 USD
But I get the error "unexpected '=' in: read_csv". Any thoughts? I cannot hard code it because the number of columns will change regularly, but the pattern (amountN) will be constant. There will also be other columns that are not id or amount/currency. I would prefer not to increase the guess.max() option for speed purposes.
The answer is to cheat!
mwe <- read_csv("mwe.csv", n_max = 0) # only need the col_names
cnames <- attr(mwe, "spec") # grab the col_names
ctype <- rep("?", ncol(mwe)) # create the col_parser abbr -- all guesses
currency <- grepl("currency", names(cnames$col)) # which ones are currency?
# or use base::startsWith(names(cnames$col), "currency")
ctype[currency] <- "c" # do not guess on currency ones, use character
# repeat lines 4 & 5 as needed
mwe <- read_csv("mwe.csv", col_types = paste(ctype, collapse = ""))

flag specific pattern using string r

I am working with a data set where I need to flag all specific codes that start with "C13.xxx." There are other tree codes in the column and all tree codes are separated as follows: "C13.xxx|B12.xxx" - and all tree codes have a period in them. But the data set has other variables that are causing my string r function to flag characters that are not tree codes. Example:
library(tidyverse)
# test data
test <- tribble(
~id, ~treecode, ~contains_c13_xxx,
#--|--|----
1, "B12.123|C13.234.432|A11.123", "yes",
2, "C12.123|C13039|", "no"
)
# what I tried
test %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))
# code above is flagging both id's as containing C13.xxx
in id 2, there is a variable that begins with C13, but it is not a tree codes (all tree codes have a period). The contains_c13_xxx variable is what I would like the code to produce. In the string detect function, I specified the period, so I'm not sure what is going wrong here.
The tricky part is there are multiple tree codes in the same column with a separator which makes it difficult to flag. We can bring each treecode into separate rows and then check for the code that we need. Using separate_rows from tidyr.
library(dplyr)
test %>%
tidyr::separate_rows(treecode, sep = "\\|") %>%
group_by(id) %>%
summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
treecode = paste(treecode, collapse = "|"))
# A tibble: 2 x 3
# id contains_C13_error treecode
# <dbl> <lgl> <chr>
#1 1 TRUE B12.123|C13.234.432|A11.123
#2 2 FALSE C12.123|C13039|
This is assuming that there could be codes of the pattern "C13" without a dot. If the treecode would always have "C13" followed by a dot then simply escaping the dot in your regex would work.
Base R solution:
# Split on the | delim:
split_treecode <- strsplit(df$treecode, "[|]")
# Roll out the ids the number of times of each relevant treecode:
rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))
# Test whether or not string contains "C13"
rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)
# Does the id have an element containing "C13" ?
rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,
rolled_out_df$id,
FUN = function(x){as.logical(sum(x))}), "yes", "no")
# Build back orignal df:
df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")
Data:
df <-
structure(
list(
id = c(1, 2),
treecode = c("B12.123|C13.234.432|A11.123",
"C12.123|C13039|"),
contains_c13_xxx = c("yes", "no")
),
row.names = c(NA,-2L),
class = "data.frame"
)

colMeans not working in R

The data set I have as file Dummy.txt is as follows
A|B|C|D
1|2|1.9|5
2.5|5|53|3
4|48|49|0.4
8|94|495|B6
(please note a text character in 5th row, 4th column)
I would like to obtain the mean of each column (i.e. column A, B, C and D).
The code I am using is as follows:
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = FALSE, row.names = NULL)
mydata_1 <- as.numeric(as.character(mydata_1))
colMeans(mydata_1, na.rm = TRUE,)
However, this doesn't seem to be working. Any suggestions please?
You need to set header = TRUE to have the A|B|C|D row be used for column names, otherwise they are included as values, and all columns are parsed as string columns.
Then, passing stringsAsFactors = FALSE prevents columns D from being turned into a factor, and then the value 'B6' will automatically be turned into an NA when converted to a numeric type.
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = TRUE,
row.names = NULL, stringsAsFactors = FALSE)
mydata_1[] <- lapply(mydata_1, as.numeric)
#> Warning message:
#> In lapply(mydata_1, as.numeric) : NAs introduced by coercion
colMeans(mydata_1, na.rm = TRUE)
#> A B C D
#> 3.875 37.250 149.725 2.800
The syntax mydata_1[] <- ... makes mydata_1 keep its data frame structure even though a list is being returned on the right-hand side.
The problem here is that as.numeric(as.character(mydata_1)) returns [1] NA NA NA NA.
My suggestion would be to first go through all columns and coerce the types using sapply(), and then calculate the means of the columns:
library(magrittr)
mydata_1 %>%
sapply(., function(col) as.numeric(as.character(col))) %>%
colMeans(na.rm = TRUE)
This will return:
A B C D
3.875 37.250 149.725 2.800
Note: I am using magrittr to make use of the pipe (%>%) operator to chain the operations so you can check the output of every step.

Using dplyr functions on variables named "."

Sometimes when generating a data frame from a list, the variable is named "." by default. How can I refer to this variable within dplyr functions, if only to change the variable name to something more appropriate.
# Code that produces my data frame with "." as column name
library(tidyverse)
d <- data.frame(`.` = 1, row.names = "a")
# Now my code fails because `.` is a poor column name for dplyr functions:
d %>% select(model = rownames(.), outlier = `.`)
This isn't actually a problem with the column named . its a problem with referencing the rownames in select() see
d <- data.frame(test = 1, row.names = "a")
d %>% select(model = rownames(.), outlier = test)
still returns Error: Strings must match column names. Unknown columns: a
just use
d <- data.frame(`.` = 1, row.names = "a")
d %>% select(outlier = '.')
will rename the column to outlier
Given
d <- data.frame(`.` = 1, row.names = "a")
Base R Solution
colnames(d) <- 'newname'
Dplyr Solution
d %>% rename(newname = '.')

Resources