How to define column specification for similarly named column with readr? - r

I have a data base with 250 columns and want to read only 50 of them instead of loading all of them then dropping columns with dplyr::select. I suppose I can do that using a column specification. I don't want to type the column specification manually for all those columns.
The 50 columns I want to keep have a common prefix, say 'blop', so I managed to manually change the column specification object I got from readr::spec_csv. I then used it to read my data file :
short_colspec <- readr::spec_csv('myfile.csv')
short_colspec$cols <- lapply(names(short_colspec$cols), function(name){
if (substr(name, 1, 4) == 'blop'){
return(col_character())
} else {
return(col_skip())
}
})
short_data <- read_csv('myfile.csv', col_types = short_colspec)
Is there a way to specify such a column specification with readr (or any other package) functions in a more robust way than what I did ?

using data.table's fread you can select columns you want to skip (=drop) or keep (=select)
#read first line of file to select which columns to keep
#adjust the strsplit-character here ';' according to your csv-type
keep_col <- readLines( "myfile.csv", n = 1L ) %>% strsplit( ";" ) %>% el() %>% grep( "blop", . )
#read file, only the desired columns
fread( "myfile.csv", select = keep_col )

Related

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

Ordering columns of data in R

I have a CSV file with 141 rows and several columns. I wanted my data to be ordered in ascending order by the first two columns i.e. 'label' and 'index'. Following is my code:
final_data <- read.csv("./features.csv",
header = FALSE,
col.names = c('label','index', 'nr_pix', 'rows_with_1', 'cols_with_1',
'rows_with_3p', 'cols_with_3p', 'aspect_ratio',
'neigh_1', 'no_neigh_above', 'no_neigh_below',
'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz',
'no_neigh_vert', 'connected_areas', 'eyes', 'custom'))
sorted_data_by_label <- final_data[order(label),]
sorted_data_by_index <- sorted_data_by_label[order(index),]
write.table(sorted_data_by_index, file = "./features.csv",
append = FALSE, sep = ',',
row.names = FALSE)
I chose to read from a CSV and use write.table because that was necessary for my code requirement to override the CSV with column names.
Now even when I added a , after order(label), and order(index), the code sorted data should still read other rows and columns right?
After running this code, I only get the first row out of 141 rows. Is there a way to fix this problem?
As #akrun has mentioned briefly, what you need to do is to change
sorted_data_by_label <- final_data[order(label),]
to
sorted_data_by_label <- final_data[order(final_data$label),]
and to change
sorted_data_by_index <- sorted_data_by_label[order(index),]
to
sorted_data_by_index <- sorted_data_by_label[order(sorted_data_by_label$index),]
This is because when you write label, R will try to find the index object in the global environment, not within the final_data data frame.
If you intended to use index that is a column of final_data, you need to use explicit final_data$index.
Other options
You can use with:
sorted_data_by_label <- with(final_data, final_data[order(label),])
sorted_data_by_index <- with(sorted_data_by_label, sorted_data_by_label[order(index),])
In dplyr you can use
sorted_data_by_label <- final_data %>% arrange(label)
sorted_data_by_index <- sorted_data_by_label %>% arrange(index)

Filtering process not fetching full data? Using dplyr filter and grep

I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”.
When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.
Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link.
Satish
Code is given below
library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")
sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)
sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)
write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)
Data (and code) is given at the link below. Sorry, I should have used dput.
https://spaces.hightail.com/space/arJlYkgIev
Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.
#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")
Edit: Simple way to split the above column into multiple columns
#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <- strsplit(new_sec1$V1, " ")
new_sec1 <- lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <- do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)
Change columns names for your analysis.

how to convert text files into dataframe in R?

I am trying to export datapoints from mongodb. I was unable to directly connect it to rstudio unfortunately. So from the query outcome I created a text file and attempted to read it as text file in R.
"cityid", "count"
"102","2"
"55","31"
"119","7"
"206","1"
"18","2"
"15","1"
"32","3"
"14","1"
"54","2"
"23","85"
"158","3"
"266","1"
"9","1"
"34","1"
"159","1"
"31","1"
"22","2"
"209","2"
"121","4"
"73","12"
"350","2"
"311","2"
"377","2"
"230","7"
"290","1"
"49","2"
"379","2"
"75","1"
"59","6"
"165","3"
"19","8"
"13","40"
"126","13"
"243","12"
"325","1"
"17","1"
"null","235"
"144","2"
"334","1"
"40","12"
"7","34"
"181","40"
"349","4"
So bascially the format is like above and I would like to convert this into a data frame of which I can make as reference for calculation with other datasets.
This is what I tried to do to make as data frame...
L <- readLines(file.choose())
L.df <- as.data.frame(L)
list <- strsplit(L.df, ",")
library("plyr")
df <- ldply(list)
colnames(df) <- c("city_id", "count")
str(df)
df$city_id <- suppressWarnings(as.numeric(as.character(df$city_id)))
At the last line, I tried to convert the character value as numeric value only to fail and coerced them into NA.
Does anyone have better suggestion to make them as numeric value table?
OR is there actually better way to bring the mongodb into R without copying and pasting them as text files? I was successful to connect to mongodb using Rmongo, but the syntax was way too complicated for me to understand.. The query I used is:
db.getCollection('logging_app_location_view_logs').aggregate([
{"$group": {"_id": "$city_id", "total": {"$sum":1}}}
]).forEach(function(l){
print('"' + l._id + '","' + l.total + '"');
});
Thanks in advance for your help!
You don't need to specify column names again when you have already passed header = TRUE in read.table function. colClasses argument will take care of the class of a column data.
df <- read.table(file.choose(), header = TRUE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')
# convert character to numeric format
char_cols <- which(sapply(df, class) == 'character') # identify character columns
df[char_cols] <- lapply(df[char_cols], as.numeric) # convert character to numeric column

Grab string from table and append as column in R

I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')

Resources