How to merge tables and format appripriately? - r

So I have the following in cityzone.txt:
"earth/city/somerset/forest/somerset-test.txt#53497",
"earth/city/nottingham/forest/nighthill.txt#53498",
"earth/city/bury/town/bishop-zone1.mp3#53695",
And the following in areasize.txt:
planet\mars\red\crater.txt;56,
pluto\distant\dwarfmoon.txt;181,
mars\hot\red\redmoon.txt;43,
earth\city\somerset\forest\somerset-test.txt;205,
earth\city\bury\town\bishop-zone1.mp3;499,
So what I need is for a new table to be created and written to an output file.
What should happen is - for each row in cityzone.txt, the title for that row should be looked up in areasize.txt. If the title exists, the areasize number from areasize.txt should be appended to the cityzone row like this:
"title#id#areasize",
With quotes and comma accordingly.
So for cityzones.txt above, the output should be thus:
"earth/city/somerset/forest/somerset-test.txt#53497#205",
"earth/city/bury/town/bishop-zone1.mp3#53695#499",
And then it should be output to a file with quote sand comma as shown.
So only 2 of the 3 cityzone.txt rows are included in the results because only 2 of the 3 rows exist in areasize.txt.
My starter code for this is really a continuation from this question:
How do I merge partial data and format it in R?
So I will add the code for this to the code in that question.
Thank you.

You can do :
library(dplyr)
library(tidyr)
#Read the text files and keep only 1st column
cityzone <- read.table('cityzone.txt')[1]
areasize <- read.table('areasize.txt', sep = ';')
#Separate columns on # and join
#Clean areasize dataframe
cityzone %>% separate(V1, c('V1', 'V2'), sep = '#') %>%
inner_join(areasize %>%
mutate(V1 = gsub('\\\\', '/', V1),
V2 = sub(',$', '', V2)),
by = 'V1') -> result
#Combine output in required format and write
cat(sprintf('"%s#%s#%s",', result$V1, result$V2.x, result$V2.y),
file = 'output.lua', sep = '\n')

Related

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

Rename columns of dataframe by adding a prefix -in R

I want to rename all columns of my dataframe (expect id and t) by adding the prefix "le_".
Firstly I turn the data frame from wide to long format and after specifying the columns ( named 1 - 27) I want to rename them as le_1 - le_27. Any suggestions on how to do this?
I tried with rename but I got stuck.
df_long_le <- df_wide_le %>%
pivot_longer(cols = starts_with("le_"), names_to = c( "t", ".value"),
names_pattern = "le_(.*)_(.*)") %>%
rename(df_long_le[3:29] = "le_*[1-27]")
Thank you!
This is how the dataframe looks like
enter image description here
To change all of the columns:
colnames(df_long_le) <- paste("le", colnames(df_long_le), sep = '_')
To change all but 1 and 2:
newcolnames <- paste("le", colnames(df_long_le)[-c(1,2)], sep = '_')
colnames(df_long_le) <- c(colnames(df_long_le)[1:2], newcolnames)

How to combine two specific cells using a comma separation in R

I have the following data set
I'd like it such that cell [2,6] retains its current content with the addition of the cell below it, separated by a "," I have found paste() functions for concatenating columns into new columns but can't find an answer for specific cell combinations. Any help appreciated.
For example, I'd like the highlighted cell in the above to read John, Mary
Here is a way on how you can paste values with the values in below row.
# Loading required libraries
library(dplyr)
# Creating sample data
example <- data.frame(Type = c("Director", NA_character_),
Name = c("John", "Mary"))
example %>%
# Get value of the below row
mutate(new_col = lead(Name, 1),
# If Type is director then paste names else null
new_col = ifelse(Type == "Director", paste(Name, new_col, sep = ", "), NA_character_))

Clever way to avoid for loop in R

I have a data file that follows roughly this format:
HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc
where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.
I am reading this block in using readLines and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:
datablocks <- readLines(filename, skipNul = T)
headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)
statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)
Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:
all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))
But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:
header_id <- NA
for(i in seq(1:length(datablocks))) {
is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
if(!is.na(is_header_line)) {
header_id <- is_header_line
}
datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}
This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.
Am I missing something obvious here?
Edit
If the input looks literally like this
HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2
The output should look like this:
e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.
EDIT:
Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.
Here I use tidyr::separate to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill to propagate header info down into the following status rows.
library(tidyverse)
read_table(col_names = "text",
"HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%
mutate(status_row = str_detect(text, "STATUS"),
header_row = str_detect(text, "HEADER"),
header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
fill(header, stub) %>%
filter(status_row) %>%
mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
select(output)
Result
# A tibble: 3 x 1
output
<chr>
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012

How can I have R identify text in one column and then use it to create data in another column?

I have materials lists from vendors that I would like to expand the description to other columns so I can use the filter function Excel to more easily find products based on their description. He's an example of the description I receive from a vendor:
2 SS 150LB 304L SLIP ON FLANGE
I would like to take this description and have R identify certain bits of text, and based on that text, add data to another column. For instance: if the string "SS" is in this cell, then put the word "STAINLESS" in a Materials column. If the string "BLK" is found in this cell, then put the word "BLACK" in the Materials column. If the string "FLANGE" is found in this cell, then put the word "FLANGE" in another column called Part_Type.
Here is one simple approach which looks for certain character sequences to use as a trigger to add strings to other columns.
library(tidyverse)
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK ON FLANGE')))
# add new columns filled with NA
df <- df %>%
add_column(Materials = NA_character_) %>%
add_column(Part_Type = NA_character_)
df %>%
mutate(Materials = if_else(str_detect(x, 'SS'), 'STAINLESS', Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), 'BLACK', Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'),'FLANGE', Part_Type))
Can an item be both 'stainless steel' and 'black'? i.e. do we want to add multiple strings to one column? In that case, it would be necessary to append rather than overwrite. Here's one approach to that problem.
my_nrow = 2
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK SS ON FLANGE')),
Materials = vector('character', my_nrow),
Part_type = vector('character', my_nrow))
df
df %>%
mutate(Materials = ifelse(str_detect(x, 'SS'), str_c(Materials,'STAINLESS '), Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), str_c(Materials,'BLACK '), Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'), str_c(Part_type,'FLANGE', sep = ' '), Part_type))

Resources