Controlling column name case - using fread from .txt to data.table - r

I have many years of data to read from .txt (tab delimited) to data.frame or data.table formats to work in R. For each year, quarterly files need to be appended. My searching has resulted in some nice code to find all quarterly files and, using FREAD and BIND_ROWS, create 1 annual file. #Maiasaura
One oddity I've found - using FREAD instead of READ.TABLE leads to different classes for some vectors. The pat_age is to be alphanumeric, "00", "01", "02". READ.TABLE seems to handle this as expected - FREAD creates an integer. Thus I've added colClasses to control PAT_AGE class.
Unfortunately - column names across the quarterly files are sometimes Upper Case - others are Lower Case (PAT_AGE pat_age). Any way to control that as I read in the .txt files? ColClasses with tolower didn't work for me.
tabtest <- list.files( pattern= ".*PUDF.*base.*tab.*" , full.names = TRUE)
%>% lapply( fread, header=TRUE, colClasses=c(pat_age="character")) %>%
dplyr::bind_rows()
I expect messy data - and may need to adjust other column names and classes as I move from year to year.
NOTE: Am I correct that if I can't change case within the lapply statement - I'd need to do it to the .txt files? The colClasses function requires "pat_age" to be lower cased across all files.
NOTE: Came across this question:
fread (data.table) select columns, throw error if column not found
Could it be modified to read and modify the header - and then read the entire .txt file with corrected headers?
Latest attempt - think it might work okay. Lots of effort/syntax just to change the case of column names!
read_cols <- function(x) {
titles <- fread(x , nrows = 0, header = TRUE, stringsAsFactors = FALSE )
var.names<-tolower(colnames(titles))
rest <- fread(x , skip =1 )
names(rest) <- var.names
return(rest)
}
tabtest2 <- list.files( pattern=".*PUDF.*base.*tab.*", full.names = TRUE)
%>% lapply( read_cols )
%>% dplyr::bind_rows()
Thank you.

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

How to bind rows in R such that instead of type conversion error binding defaults to filling value to NA?

I am currently tasked with merging multiple xlsx files into one master R (.rds) data file. Since these files are filled in manually there is a lot type conversion errors when using approaches such as dyplr::bind_rows such as
Column ``XYZ`` can't be converted from numeric to character
While I very much need the binding to be "smart" such that it happens according to the corresponding column names of the to be merged dataframes -when encountering conversion issues instead of getting an error, I would like to have these problematic cell contents treated as NA and not get an error - just a warning perhaps.
Is there a convenient way/function for doing this in R?
I have used bind_rows from dyplr package.
My current import procedure
files <- list.files("data",pattern = "xlsx", full.names = TRUE)
tmp <- read_excel(files[1], sheet = "data", trim_ws = TRUE)
names(tmp) <- make.names(str_squish(names(tmp)))
for (i in 2:length(files)) {
print(i)
tmp2 <- read_excel(files[i], sheet = "data",trim_ws = TRUE)
names(tmp2) <- make.names(str_squish(names(tmp2)))
tmp<-bind_rows(tmp,tmp2)
}
It has been pointed out that using a loop here is not efficient, but since the files are messy - many manual mistakes - and relatively small in number I focused on being able to sequentially track the binding process.

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Creating a new file with both a subset of data and file names from a group of .csv files

My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.

How to delete specific rows from multiple columns

I am importing some columns from multiple csv files from R. I want to delete all the data after row 1472.
temp = list.files(pattern="*.csv") #Importing csv files
Normalyears<-c(temp[1],temp[2],temp[3],temp[5],temp[6],temp[7],temp[9],temp[10],temp[11],temp[13],temp[14],temp[15],temp[17],temp[18],temp[19],temp[21],temp[22],temp[23])
leapyears<-c(temp[4],temp[8],temp[12],temp[16],temp[20]) #separating csv files with based on leap years and normal years.
Importing only the second column of each csv file.
myfiles_Normalyears = lapply(Normalyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
new.data.leapyears <- NULL
for(i in 1:length(myfiles_leapyears)) {
in.data <- read.table(if(is.null(myfiles_leapyears[i])),skip=c(1472:4399),sep=",")
new.data.leapyears <- rbind(new.data.leapyears, in.data)}
the loop is suppose to delete all the rows starting from 1472 to 4399.
Error: Error in read.table(myfiles_leapyears[i], skip = c(1472:4399), sep = ",") :
'file' must be a character string or connection
There is a nrows parameter to read.table, so why not try
read.table(myfiles_leapyears[i], nrows = 1471,sep=",")
Your myfiles_leapyears is a list. When subsetting a list, you need double brackets to access a single element, otherwise you just get a sublist of length 1.
So replace
myfiles_leapyears[i]
with
myfiles_leapyears[[i]]
that will at least take care of invalid subscript type 'list' errors. I'd second Josh W. that the nrows argument seems smarter than the skip argument.
Alternatively, if you define using sapply ("s" for simplify) instead of lapply ("l" for list), you'll probably be fine using [i]:
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
It is fine. I just turned the data from a list into a dataframe.
df <- as.data.frame(myfiles_leapyears,byrow=T)
leap_df<-head(df,-2928)

Resources