How to use select_helpers() [starts_with()] when using readr::read_csv() - r

I have a rather wide dataset to read in with over 1000 missing values at the top, but all the variable names follow the same pattern. Is there a way to use starts_with() to force certain variables to be parsed correctly?
MWE:
library(tidyverse)
library(readr)
mwe.csv <- data.frame(id = c("a", "b"), #not where I actually get the data from
amount1 = c(NA, 20),
currency1 = c(NA, "USD")
)
mwe <- readr::read_csv("mwe.csv", guess_max = 1) #guess_max() for example purposes
I'd like to be able do
mwe<- read_csv("mwe.csv", guess.max = 1
col_types = cols(starts_with("amount") = "d",
starts_with("currency") = "c"))
)
> mwe
# A tibble: 2 x 3
id amount currency
<chr> <dbl> <chr>
1 a NA NA
2 b 20 USD
But I get the error "unexpected '=' in: read_csv". Any thoughts? I cannot hard code it because the number of columns will change regularly, but the pattern (amountN) will be constant. There will also be other columns that are not id or amount/currency. I would prefer not to increase the guess.max() option for speed purposes.

The answer is to cheat!
mwe <- read_csv("mwe.csv", n_max = 0) # only need the col_names
cnames <- attr(mwe, "spec") # grab the col_names
ctype <- rep("?", ncol(mwe)) # create the col_parser abbr -- all guesses
currency <- grepl("currency", names(cnames$col)) # which ones are currency?
# or use base::startsWith(names(cnames$col), "currency")
ctype[currency] <- "c" # do not guess on currency ones, use character
# repeat lines 4 & 5 as needed
mwe <- read_csv("mwe.csv", col_types = paste(ctype, collapse = ""))

Related

Conditional creation (mutate) of new columns

I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.

Using a numlist loop when renaming variables

I´m trying to rename two types of variables in R using tidyverse/dplyr. The first type "var_a_year", I want to rename it as "sample_year". The second type of variable "var_b_7", I want to rename it as "index_year".
The second variable, "var_b" starts on the number 7 for the first year "2004". And increases by 2 for each year. So for year 2005, the second type variable is called "var_b_9" as shown.
I would like to use a loop so I can make this faster instead of writting a line for each year.
Many thanks in advance!
df <- df %>%
rename(
sample_2004 = var_a_2004, index_2004 = var_b_7,
sample_2005 = var_a_2005, index_2005 = var_b_9,
sample_2006 = var_a_2006, index_2006 = var_b_11,
sample_2007 = var_a_2007, index_2007 = var_b_13,
...
sample_2020 = var_a_2020, index_2020 = var_b_39)
There's no need to use a loop. rename_with will do the trick:
df <- tibble(var_a_2004=NA, var_b_7=NA, var_a_2005=NA, var_b_8=NA)
renameA <- function(x) {
return(paste0("sample_", stringr::str_sub(x, -4)))
}
df %>% rename_with(renameA, starts_with("var_a"))
Gives
# A tibble: 1 x 4
sample_2004 var_b_7 sample_2005 var_b_8
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA
I'll leave you to work out how to code the corresponding function for your var_b_XXXX columns.
In addition to the answer of Limey:
#sample data
df <- structure(list(var_a_2004 = NA, var_b_7 = NA, var_a_2005 = NA,
var_b_9 = NA), row.names = c(NA, -1L), class = "data.frame")
#load data.table package
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#convert var_a in columnnames to sample_
colnames(dt) <- gsub("var_a_", "sample_", colnames(dt))
#use a loop to replace var_b to index_
for(i in 2004:2005){
year <- i
nr <- 2* i -4001
setnames(dt, old = paste0("var_b_", nr), new = paste0("index_", year))
}
This function now works for the years 2004:2005 to match the sample data. You can change it to 2004:2020 for your dataset.

Is there an R function that reads text files with \n as a (column) delimiter?

The Problem
I'm trying to come up with a neat/fast way to read files delimited by newline (\n) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
#
2021-01-02
text
...
#
...
Where the ... represents potentially multiple rows in the input file, and the # signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1
X2
...
Xn
2021-01-01
text
...
...
2021-01-02
text
...
...
...
...
...
...
My attempt
I've tried base, data.table, readr and vroom, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines(), to read it as a character vector, then manually adding some "proper" column separators (e.g. ;), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "#", "2021-01-02", "text", "#"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way??
Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the # signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the # sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("#", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
Benchmark
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a
You can get round some of the string manipulation with something along the lines of:
input <- readLines("input.txt")[-1] #Read in and remove header
ncol <- which(input=="#")[1]-1 #Number of columns of data
data.frame(matrix(input[input != "#"], ncol = ncol, byrow=TRUE)) #Convert to dataframe
# X1 X2
#1 2021-01-01 text
#2 2021-01-02 text
At this point, you might consider going the full mile and use a proper grammar to parse it. I don't know how big or complex the situation really is, but using pegr it might look something like this:
input <-
"Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#
"
library(pegr)
peg <- new.parser(commonRules,action=TRUE) +
c("HEADER <- 'Header Info' EOL" , "{}" ) + # Rule to match literal 'Header Info' and a \n, then discard
c("TYPE <- 'text' EOL" , "{-}" ) + # Rule to match literal 'text', store paste and store as $TYPE
c("DATE <- (!EOL .)* EOL" , "{-}" ) + # Rule to match any character leading up to a new line. Could improve to look for a date format
c("EOS <- '#' EOL" , "{}" ) + # Rule to match end of section, then discard
c("BODY <- (!EOS .)*" , "{-}" ) + # Rule to match body of text, including newlines
c("SECTION <- DATE TYPE BODY EOS" ) + # Combining rules to match each section
c("DOCUMENT <- HEADER SECTION*" ) # Combining more rules to match the endire document
res <- peg[["DOCUMENT"]](input))
final <- matrix( value(res), ncol=3, byrow=TRUE ) %>%
as.data.frame %>%
setnames( names(value(res))[1:3])
final
Produces:
DATE TYPE BODY
1 2021-01-01 text multiple lines\nof\ntext\n
2 2021-01-02 text more\nlines of text\n
It might feel clunky if you don't know the syntax, but once you do, its a fire and forget solution. It'll run according to spec until the spec doesn't hold. You don't have to worry about fragile pretreatment and it is easy to adapt to changing formats in the future.
There's also the tidyverse way:
library(tidyr)
library(readr)
library(stringr)
max_columns <- 5
d <- {
readr::read_file("file.txt") %>%
stringr::str_remove("^Header Info\n") %>%
tibble::enframe(name = NULL) %>%
separate_rows(value, sep = "#\n") %>%
separate("value", into = paste0("X", 1:max_columns) , sep = "\n")
}
Using your example input in a file called file.txt, the d looks like:
# A tibble: 3 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 2021-01-01 text ... "" NA
2 2021-01-02 text ... "" NA
3 ... NA NA NA NA
Warning message:
Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
Note that the warning is simply to make sure you know you're getting NA's, this is inevitable if number of rows between # varies
I am using the data similar to that provided by Sirius for demonstration. you can also do something like this to have variable number of columns in resulting data frame
example <- "Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#"
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate_rows(dummy, sep = '\\n') %>%
filter(row_number() !=1) %>%
group_by(rowid = rev(cumsum(rev(dummy == '#')))) %>%
filter(dummy != '#') %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 2 x 6
#> # Groups: rowid [2]
#> rowid X1 X2 X3 X4 X5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 2 2021-01-01 text multiple lines of text
#> 2 1 2021-01-02 text more lines of text <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)

flag specific pattern using string r

I am working with a data set where I need to flag all specific codes that start with "C13.xxx." There are other tree codes in the column and all tree codes are separated as follows: "C13.xxx|B12.xxx" - and all tree codes have a period in them. But the data set has other variables that are causing my string r function to flag characters that are not tree codes. Example:
library(tidyverse)
# test data
test <- tribble(
~id, ~treecode, ~contains_c13_xxx,
#--|--|----
1, "B12.123|C13.234.432|A11.123", "yes",
2, "C12.123|C13039|", "no"
)
# what I tried
test %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))
# code above is flagging both id's as containing C13.xxx
in id 2, there is a variable that begins with C13, but it is not a tree codes (all tree codes have a period). The contains_c13_xxx variable is what I would like the code to produce. In the string detect function, I specified the period, so I'm not sure what is going wrong here.
The tricky part is there are multiple tree codes in the same column with a separator which makes it difficult to flag. We can bring each treecode into separate rows and then check for the code that we need. Using separate_rows from tidyr.
library(dplyr)
test %>%
tidyr::separate_rows(treecode, sep = "\\|") %>%
group_by(id) %>%
summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
treecode = paste(treecode, collapse = "|"))
# A tibble: 2 x 3
# id contains_C13_error treecode
# <dbl> <lgl> <chr>
#1 1 TRUE B12.123|C13.234.432|A11.123
#2 2 FALSE C12.123|C13039|
This is assuming that there could be codes of the pattern "C13" without a dot. If the treecode would always have "C13" followed by a dot then simply escaping the dot in your regex would work.
Base R solution:
# Split on the | delim:
split_treecode <- strsplit(df$treecode, "[|]")
# Roll out the ids the number of times of each relevant treecode:
rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))
# Test whether or not string contains "C13"
rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)
# Does the id have an element containing "C13" ?
rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,
rolled_out_df$id,
FUN = function(x){as.logical(sum(x))}), "yes", "no")
# Build back orignal df:
df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")
Data:
df <-
structure(
list(
id = c(1, 2),
treecode = c("B12.123|C13.234.432|A11.123",
"C12.123|C13039|"),
contains_c13_xxx = c("yes", "no")
),
row.names = c(NA,-2L),
class = "data.frame"
)

Extra column using tidyr's `unite_` vs `unite`

In the following example, why is there an additional column in the unite_() output vs the unite() output?
library(tidyr)
x1 <- data.frame(Sample=c("A", "B"), "1"=c("-", "y"),
"2"=c("-", "z"), "3"=c("x", "a"), check.names=F)
# Sample 1 2 3
# 1 A - - x
# 2 B y z a
Here we see the desired output:
unite(x1, mix, 2:ncol(x1), sep=",")
# Sample mix
# 1 A -,-,x
# 2 B y,z,a
Why is there an additional column here (the 1 column)? The default is to remove the columns used by unite_().
unite_(x1, "mix", 2:ncol(x1), sep=",")
# Sample 1 mix
# 1 A - -,-,x
# 2 B y y,z,a
Note: tidyr version 0.5.1
The syntax are slightly different between the two usages:
#unite(data, col, ..., sep = "_", remove = TRUE)
#unite_(data, col, from, sep = "_", remove = TRUE)
From the unite_ help page, the from option is defined as: "Names of existing columns as character vector."
Use column names as opposed to the column numbers provided the desired results:
unite_(x1, "mix", names(x1[,2:ncol(x1)]), sep=",")
# Sample mix
#1 A -,-,x
#2 B y,z,a
I tried with "Unite" but it did not work. However, It worked very well with "paste" function.
df$new_col <- paste(df$col1,df$col2,sep="-")
or if you have more columns to join,
df$new_col <- paste(df$col1,df$col2,df$col3,....,sep="-")

Resources