CSV with multiple datasets/different-number-of-columns - r

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Related

Set read_csv() to a fixed number of columns?

TLDR: How do I set Rstudio to import a CSV as a tibble exactly as Microsoft Excel (Rstudio for mac version: Version 1.3.959, Excel for mac: version 16.33 if that helps)? If this is not possible or it should already behave the same, how do I set it to read in a CSV file with no more than 8 columns and fill in blank values in rows so I can tidy it?
Long version:
I have a dozen CSV files (collected from archival animal tags) that messy (inconsistent width, multiple blocks of data on one file) and need to be read in. For workflow reasons, I would like to take the raw data and bring it straight into R. The data has a consistent structure between files: a metadata block, a summary by day that is 6 columns wide, and 2 blocks of constant logging that are 2 columns wide. If you were to count the blank cells in each section, it would be:
Section
Width
Length
Metadata
8
37
Summary Block
7
N days
Block 1
2
N*72
Block 2
2
N*72
The last three blocks of data can be thousands of entries long. I am unable to get this data to load into R as anything other than a single 1x500,000+ dataframe. Using tag1 = read_csv('file', skip = 37) to just start with the data I want crashes R. It works with read.csv(), but that removes the metadata block that I would like to keep.
Attempting to read the file into Excel shows the correct format (width, length, etc) but will not load all of the data. It cuts off a good chunk of the last block of data. reading in the data in a tabular format like read_xl() presents the same issue.
Ultimately, I like to either import the data as a nested tibble with these different sections, or better yet, automate this process so it can read in an entire folder's worth of csv files, automatically assign them to variables, and split them into sections. However, for now I just want to get this data into a workable format intact, and I would appreciate any help you can give me.
Get the number of lines in the file, n, and from that derive N. Then read the blocks one by one. Use the same connection so each read starts off from where the prior one ended.
n <- length(count.fields("myfile", sep = ""))
N = (n - 37) / (1 + 2 * 72)
con <- file("myfile", open = "r")
meta <- readLines(con, 37)
summary_block <- read.csv(con, header = FALSE, nrow = N)
block1 <- read.csv(con, header = FALSE, nrow = N * 37)
block2 <- read.csv(con, header = FALSE, nrow = N * 37)
close(con)

Issues using sapply function in R

I have multiple files (from different days) that all contain the same information that I want to be able to combine and analyse both together and separately. For example, I would like to be able to take averages of one of the columns for each day - and write the result to a new file - as well as the average for all files (or a weeks worth of files) and write that result to the same output file. I've been trying several methods but none quite work.
The code below I think is combining the files fine but the data (separated with a " ") are being joined together with a \t, so when I try to read the $name row it doesn't exist. How do I fix this so the headers and data remain separated and can be read individually?
data<- sapply(datafiles, function(x)read.table(file=paste0(x),
fill = TRUE, header= TRUE, sep=" ",
stringsAsFactors = TRUE, quote = ""))
#separating STARLINK (SL) satellites from entire list
SL<- data[grepl("^SLINK", data$name), ]```

Importing data in R from Excel with information cotained in header

As title says, I am trying to import data from Excel to R, where part of the information is contained in the header.
I a very simplified way, the Excel I have looks like this:
GROUP;1234
MONTH;"Jan"
PERSON;SEX;AGE;INCOME
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
Total;;;147000
After reading in to R, it should be a "clean" dataset that looks like this.
GROUP;MONTH;PERSON;SEX;AGE;INCOME
1234;Jan;John;m;26;20000
1234;Jan;Michael;m;24;40000
1234;Jan;Phillip;m;25;15000
1234;Jan;Laura;f;27;72000
I have several files that look like this. The number of persons however varies in each file. The last line contains a summary that should be skipped. There might be empty lines between the list and summary line.
Any help is higly apreciated.Thank you very much.
Excel files can be read using readxl::read_excel()
One of the parameters is skip, using which you can skip certain number of rows defined by you.
For your data, you need to skip the first two lines that contain GROUP and MONTH.
You will get the data in following format.
PERSON;SEX;AGE;INCOME;
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
After this, you can manually add the columns GROUP and MONTH
Thank you very much for your help. The hint from #Aurèle brought the missing puzzle piece. The solution I have now come up with is as follows:
group <- read_excel("TEST1.xlsx", col_names = c("C1","GROUP") ,n_max = 1)
group <- group[,2]
month <- read_excel("TEST1.xlsx", col_names = c("C1","MONTH") ,skip = 1, n_max = 1)
month <- month[,2]
data <- read_excel("TEST1.xlsx", col_names = c("NAME","SEX","AGE","INCOME") , skip = 4)
data <- data[data$AGE != NA,]
data <- cbind(data,group,month)
data

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Vectorise an imported variable in R

I have imported a CSV file to R but now I would like to extract a variable into a vector and analyse it separately. Could you please tell me how I could do that?
I know that the summary() function gives a rough idea but I would like to learn more.
I apologise if this is a trivial question but I have watched a number of tutorial videos and have not seen that anywhere.
Read data into data frame using read.csv. Get names of data frame. They should be the names of the CSV columns unless you've done something wrong. Use dollar-notation to get vectors by name. Try reading some tutorials instead of watching videos, then you can try stuff out.
d = read.csv("foo.csv")
names(d)
v = d$whatever # for example
hist(v) # for example
This is totally trivial stuff.
I assume you have use the read.csv() or the read.table() function to import your data in R. (You can have help directly in R with ? e.g. ?read.csv
So normally, you have a data.frame. And if you check the documentation the data.frame is described as a "[...]tightly coupled collections of variables which share many of the properties of matrices and of lists[...]"
So basically you can already handle your data as vector.
A quick research on SO gave back this two posts among others:
Converting a dataframe to a vector (by rows) and
Extract Column from data.frame as a Vector
And I am sure they are more relevant ones. Try some good tutorials on R (videos are not so formative in this case).
There is a ton of good ones on the Internet, e.g:
* http://www.introductoryr.co.uk/R_Resources_for_Beginners.html (which lists some)
or
* http://tryr.codeschool.com/
Anyways, one way to deal with your csv would be:
#import the data to R as a data.frame
mydata = read.csv(file="SomeFile.csv", header = TRUE, sep = ",",
quote = "\"",dec = ".", fill = TRUE, comment.char = "")
#extract a column to a vector
firstColumn = mydata$col1 # extract the column named "col1" of mydata to a vector
#This previous line is equivalent to:
firstColumn = mydata[,"col1"]
#extract a row to a vector
firstline = mydata[1,] #extract the first row of mydata to a vector
Edit: In some cases[1], you might need to coerce the data in a vector by applying functions such as as.numeric or as.character:
firstline=as.numeric(mydata[1,])#extract the first row of mydata to a vector
#Note: the entire row *has to be* numeric or compatible with that class
[1] e.g. it happened to me when I wanted to extract a row of a data.frame inside a nested function

Resources