I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)
I have a data file that follows roughly this format:
HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc
where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.
I am reading this block in using readLines and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:
datablocks <- readLines(filename, skipNul = T)
headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)
statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)
Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:
all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))
But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:
header_id <- NA
for(i in seq(1:length(datablocks))) {
is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
if(!is.na(is_header_line)) {
header_id <- is_header_line
}
datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}
This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.
Am I missing something obvious here?
Edit
If the input looks literally like this
HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2
The output should look like this:
e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.
EDIT:
Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.
Here I use tidyr::separate to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill to propagate header info down into the following status rows.
library(tidyverse)
read_table(col_names = "text",
"HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%
mutate(status_row = str_detect(text, "STATUS"),
header_row = str_detect(text, "HEADER"),
header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
fill(header, stub) %>%
filter(status_row) %>%
mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
select(output)
Result
# A tibble: 3 x 1
output
<chr>
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
I have materials lists from vendors that I would like to expand the description to other columns so I can use the filter function Excel to more easily find products based on their description. He's an example of the description I receive from a vendor:
2 SS 150LB 304L SLIP ON FLANGE
I would like to take this description and have R identify certain bits of text, and based on that text, add data to another column. For instance: if the string "SS" is in this cell, then put the word "STAINLESS" in a Materials column. If the string "BLK" is found in this cell, then put the word "BLACK" in the Materials column. If the string "FLANGE" is found in this cell, then put the word "FLANGE" in another column called Part_Type.
Here is one simple approach which looks for certain character sequences to use as a trigger to add strings to other columns.
library(tidyverse)
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK ON FLANGE')))
# add new columns filled with NA
df <- df %>%
add_column(Materials = NA_character_) %>%
add_column(Part_Type = NA_character_)
df %>%
mutate(Materials = if_else(str_detect(x, 'SS'), 'STAINLESS', Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), 'BLACK', Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'),'FLANGE', Part_Type))
Can an item be both 'stainless steel' and 'black'? i.e. do we want to add multiple strings to one column? In that case, it would be necessary to append rather than overwrite. Here's one approach to that problem.
my_nrow = 2
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK SS ON FLANGE')),
Materials = vector('character', my_nrow),
Part_type = vector('character', my_nrow))
df
df %>%
mutate(Materials = ifelse(str_detect(x, 'SS'), str_c(Materials,'STAINLESS '), Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), str_c(Materials,'BLACK '), Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'), str_c(Part_type,'FLANGE', sep = ' '), Part_type))