Clever way to avoid for loop in R - r

I have a data file that follows roughly this format:
HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc
where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.
I am reading this block in using readLines and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:
datablocks <- readLines(filename, skipNul = T)
headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)
statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)
Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:
all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))
But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:
header_id <- NA
for(i in seq(1:length(datablocks))) {
is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
if(!is.na(is_header_line)) {
header_id <- is_header_line
}
datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}
This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.
Am I missing something obvious here?
Edit
If the input looks literally like this
HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2
The output should look like this:
e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.

EDIT:
Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.
Here I use tidyr::separate to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill to propagate header info down into the following status rows.
library(tidyverse)
read_table(col_names = "text",
"HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%
mutate(status_row = str_detect(text, "STATUS"),
header_row = str_detect(text, "HEADER"),
header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
fill(header, stub) %>%
filter(status_row) %>%
mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
select(output)
Result
# A tibble: 3 x 1
output
<chr>
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012

Related

How can I change the cell format from ":" to lat and long (degree, minute, second) type

The data has 34773 rows and 28 column. One of these columns contains the latitude and the other longitude information for each row.
But these coordinates are supposed to be separated in degrees minutes seconds (° ' ''), but instead they are separated by colons. How can I change this?
Hi enesson,
Be careful because you haven't read the names of the variables as its first line, that is, the header parameter. So, you will have to adapt the second line of this code to your data.
As you can see here, I used the separate function according to this post. It may be useful for you!
library(tidyr)
df <- read.csv("your_data.csv", header = TRUE, sep = ";")
df1 <- separate(data = df, col = KUZEY, into = c("lat_degrees", "lat_minutes", "lat_seconds"), sep = ":")
df2 <- separate(data = df1, col = DOGU, into = c("lon_degrees", "lon_minutes", "lon_seconds"), sep = ":")
I hope this helps you 😃

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

How to merge tables and format appripriately?

So I have the following in cityzone.txt:
"earth/city/somerset/forest/somerset-test.txt#53497",
"earth/city/nottingham/forest/nighthill.txt#53498",
"earth/city/bury/town/bishop-zone1.mp3#53695",
And the following in areasize.txt:
planet\mars\red\crater.txt;56,
pluto\distant\dwarfmoon.txt;181,
mars\hot\red\redmoon.txt;43,
earth\city\somerset\forest\somerset-test.txt;205,
earth\city\bury\town\bishop-zone1.mp3;499,
So what I need is for a new table to be created and written to an output file.
What should happen is - for each row in cityzone.txt, the title for that row should be looked up in areasize.txt. If the title exists, the areasize number from areasize.txt should be appended to the cityzone row like this:
"title#id#areasize",
With quotes and comma accordingly.
So for cityzones.txt above, the output should be thus:
"earth/city/somerset/forest/somerset-test.txt#53497#205",
"earth/city/bury/town/bishop-zone1.mp3#53695#499",
And then it should be output to a file with quote sand comma as shown.
So only 2 of the 3 cityzone.txt rows are included in the results because only 2 of the 3 rows exist in areasize.txt.
My starter code for this is really a continuation from this question:
How do I merge partial data and format it in R?
So I will add the code for this to the code in that question.
Thank you.
You can do :
library(dplyr)
library(tidyr)
#Read the text files and keep only 1st column
cityzone <- read.table('cityzone.txt')[1]
areasize <- read.table('areasize.txt', sep = ';')
#Separate columns on # and join
#Clean areasize dataframe
cityzone %>% separate(V1, c('V1', 'V2'), sep = '#') %>%
inner_join(areasize %>%
mutate(V1 = gsub('\\\\', '/', V1),
V2 = sub(',$', '', V2)),
by = 'V1') -> result
#Combine output in required format and write
cat(sprintf('"%s#%s#%s",', result$V1, result$V2.x, result$V2.y),
file = 'output.lua', sep = '\n')

Reading googledrive contents from R

I'm aiming to get a list of all files in a Google Drive folder, as well at the associated metadata for those files. When I use drive_ls, it returns 3 columns {name, id, drive_resource}. drive_resource is a structured like this: list(kind = "drive#file", id = "abc",...). However, some of the list is not qualified by quotations, and commas are also occassionally used when not a separator.
Any ideas how I might properly unlist this? I can't find anywhere in the package that can handle this.
Using the package 'googledrive', I can get a list of all the files
a <- drive_ls(path = "abc", recursive = TRUE)
The below attempt gets close, but fails to get thee column names and also splits some values at the wrong place based on a comma being contained in the string.
a$drive_resource <- vapply(a$drive_resource, paste, collapse = ",", character(1L))
abcd <- a%>% separate(drive_resource, sep = ",", into = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30") )
You can try the following approach. It's an example with only four elements of the list (selected names are specified in the function). The function maps each list contained in each row to a tibble, so you can unnest it
require(googledrive)
require(dplyr)
f <- function(l){
l[c("version","webContentLink","viewedByMeTime","mimeType")] %>% as_tibble()
}
dr_content <- drive_ls(path = "<path>", recursive = TRUE)
dr_content <- dr_content %>% mutate(drive_resource = purrr::map(drive_resource, f))
dr_content <- dr_content %>% tidyr::unnest(drive_resource)

Optimize calls to mutate and summarise?

I have this R script:
rm(list = ls())
library(tidyr)
suppressWarnings(library(dplyr))
outFile = "zFinal.lua"
cat("\014\n")
cat(file = outFile, sep = "")
filea <- read.csv("csva.csv", strip.white = TRUE)
fileb <- read.csv("csvb.csv", strip.white = TRUE, sep = ";", header=FALSE)
df <-
merge(filea, fileb, by.x = c(3), by.y = c(1)) %>%
subset(select = c(1, 3, 6, 2)) %>%
arrange(ColA, ColB, V2) %>%
group_by(ColA) %>%
mutate(V2 = paste0('"', V2, "#", ColB, '"')) %>%
summarise(ID = paste(V2, collapse = ", ", sep=";")) %>%
mutate(ID = paste0('["', ColA, '"] = {', ID, '},')) %>%
mutate(ID = paste0('\t\t', ID))
df <- df[c("ID")]
cat("\n\tmyTable = {\n", file = outFile, append = TRUE, sep = "\n")
write.table(df, append = TRUE, file = outFile, sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("\n\t}", file = outFile, append = TRUE, sep = "\n")
# Done
cat("\nDONE.", sep = "\n")
As you can see, this script opens csva.csv and csvb.csv.
This is csva.csv:
ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1
And this is csvb.csv:
100;file1
200;file2
300;file3
400;file4
This is the output file that my script and the csv files produce:
myTable = {
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},
}
This output file is exactly what I want. It's perfect.
This is what the script does. I'm not sure I can explain this very well so if I don't do a good job at that, please skip this section.
For each line in csva.csv, if ColC (csva) contains a number that is contained in Column 1 (csvb), then the output file should contain a line like this:
["3"] = {"file1#100", "file2#200"},
So, in the above example, the first line in ColA (csva) contains number 3 and colB for that line is 100. In csvb, column 1 contains 100 and column 2 contains file1#100.
Because csva contains another number 3 in ColA (the last line), this is also processed and output to the same line.
Ok so my script runs very well indeed and produces perfect output. The problem is it takes too long to run. csva and csvb in my question here are only a few lines long so the output is instant.
However, the data I have to work with in the real world - csva is over 300,000 lines and csvb is over 900,000 lines. So the script takes a long, long time to run (too long to make it feasible). It does work beautifully but it takes far too long to run.
From commenting out lines gradually, it seems that the slowdown is with mutate and summarise. Without those lines, the script runs in about 30 seconds. But with mutate and summarise, it takes hours.
I'm not too advanced with R so how can I make my script run faster possibly by improving my syntax or providing faster alternatives to mutate and summarise?
Here is a more compact version of your code in base R that should offer something of a performance boost.
(Edited to match the data provided by wibeasley.)
ds_a$file_name <- ds_b$file_name[match(ds_a$ColB, ds_b$ColB)]
ds_a <- ds_a[!is.na(ds_a$file_name), -4]
ds_a <- ds_a[order(ds_a$ColB),]
ds_a$file_name <- paste0('"', ds_a$file_name, "#", ds_a$ColB, '"')
res <- tapply(ds_a$file_name, ds_a$ColA, FUN = paste, collapse = ", ", sep=";")
res <- paste0("\t\t[\"", names(res), "\"] = {", res, "},", collapse = "\n")
cat("\n\tmyTable = {", res, "\t}", sep = "\n\n")
Outputting:
myTable = {
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},
}
Here's a dplyr approach that closely follows yours. The real differences are that rows and columns are dropped from the object as soon as possible so there's less baggage to move around.
I'm making some guesses what will actually help with the large datasets. Please report back what the before & after durations are. I like how you said which calls were taking the longest; reporting the new bottles would help too.
If this isn't fast enough, the next easiest move is probably move to sqldf (which uses SQLite under the cover) or data.table. Both require learning a different syntax (unless you already know sql), but could be worth your time in the long run.
# Pretend this info is being read from a file
str_a <-
"ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1"
str_b <-
"100;file1
200;file2
300;file3
400;file4"
# Declare the desired columns and their data types.
# Include only the columns needed. Use the smaller 'integer' data type where possible.
col_types_a <- readr::cols_only(
`ID` = readr::col_integer(),
`ColA` = readr::col_integer(),
`ColB` = readr::col_integer(),
`ColC` = readr::col_integer()
# `ColD` = readr::col_integer() # Exclude columns never used
)
col_types_b <- readr::cols_only(
`ColB` = readr::col_integer(),
`file_name` = readr::col_character()
)
# Read the file into a tibble
ds_a <- readr::read_csv(str_a, col_types = col_types_a)
ds_b <- readr::read_delim(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)
ds_a %>%
dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
ID,
ColB,
ColA
) %>%
dplyr::left_join(ds_b, by = "ColB") %>% # Join the two datasets
tidyr::drop_na(file_name) %>% # Dump the records you'll never use
dplyr::mutate( # Create the hybrid column
entry = paste0('"', file_name, "#", ColB, '"')
) %>%
dplyr::select( # Dump the unneeded columns
-ID,
-file_name
) %>%
dplyr::group_by(ColA) %>% # Create a bunch of subdatasets
dplyr::arrange(ColB, entry) %>% # Sorting inside the group usually is faster?
dplyr::summarise(
entry = paste(entry, collapse = ", ", sep = ";")
) %>%
dplyr::ungroup() %>% # Stack all the subsets on top of each other
dplyr::mutate( # Mush the two columns
entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
) %>%
dplyr::pull(entry) %>% # Isolate the desired vector
paste(collapse = "\n") %>% # Combine all the elements into one.
cat()
result:
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},
you could try to load your table as a data.table instead. usually data.tables are faster in their operations than data.frames
library(data.table)
filea <- fread("csva.csv")
just check that it is still a data.table before you come to the mutate function (just print it, you will see the obvious difference to the data.frame).
Here's another solution that leverages data.table's performance while still staying within your dplyr knowledge. I'm not sure there's much room to improve within only 10 seconds, but theoretically this could help larger datasets where the cost to create the indexes is amortized over a longer stretch of execution.
The dtplyr package is translating the dplyr verbs (that are familiar to you) to data.table syntax under the hood. That's leveraging the keys, which should improve the performance, especially with joining and grouping.
The dtplyr::lazy_dt feature might help optimize the dplyr-to-data.table translation.
Finally, vroom replaces readr, mostly out of curiosity. But it's independent from the other changes, and it sounds like that's never been a bottleneck
col_types_a <- vroom::cols_only(
`ID` = vroom::col_integer(),
`ColA` = vroom::col_integer(),
`ColB` = vroom::col_integer(),
`ColC` = vroom::col_integer()
# `ColD` = vroom::col_integer() # Leave out this column b/c it's never used
)
col_types_b <- vroom::cols_only(
`ColB` = vroom::col_integer(),
`file_name` = vroom::col_character()
)
ds_a <- vroom::vroom(str_a, col_types = col_types_a)
ds_b <- vroom::vroom(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)
# ds_a <- data.table::setDT(ds_a, key = c("ColB", "ColA"))
# ds_b <- data.table::setDT(ds_b, key = "ColB")
ds_a <- dtplyr::lazy_dt(ds_a, key_by = c("ColB", "ColA")) # New line 1
ds_b <- dtplyr::lazy_dt(ds_b, key_by = "ColB") # New line 2
ds_a %>%
dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
ID,
ColB,
ColA
) %>%
dplyr::inner_join(ds_b, by = "ColB") %>% # New line 3 (replaces left join)
# tidyr::drop_na(file_name) %>% # Remove this line
# dplyr::filter(!is.na(file_name)) %>% # Alternative w/ left join
dplyr::mutate(
entry = paste0('"', file_name, "#", ColB, '"')
) %>%
dplyr::select( # Dump the uneeded columns
-ID,
-file_name
) %>%
dplyr::group_by(ColA) %>%
dplyr::arrange(ColB, entry) %>% # Sort inside the group usually helps
dplyr::summarise(
entry = paste(entry, collapse = ", ", sep=";")
) %>%
dplyr::ungroup() %>%
dplyr::mutate(
entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
) %>%
dplyr::pull(entry) %>% # Isolate the desired vector
paste(collapse = "\n") %>%
cat()

Resources