Reading in a .csv with multiple data frames / Different number of columns - r

I have a .CSV file contains multiple data frames. It looks like this:
# A;Date;Price;Volume;Country
# B;Company;Available;StartDate;EndDate;Published;Modified
# C;ID;Timestamp;Capacity
# D;Rownumbers
#
A;2016-01-01 00:00:00;75.18;2500;DK
A;2016-01-01 00:00:00;55.25;8500;DE
A;2016-01-01 00:00:00;125.00;6500;UK
A;2016-01-01 01:00:00;65.28;2400;DK
# A; etc....
B;PRETZELS;TRUE;2016-01-01;2016-01-02;YES;2016-01-03
B;FAKES;FALSE;2016-01-01;2016-01-02;NO;2016-01-03
# B; etc....
C;11;2016-01-01 23:00:00;25
C;16;2016-01-01 22:00:00;15
# C; etc....
D;1175
So, the first part of the file contains information about the data in the file. From this you can see that depending on the information - there is a different number of columsn. In this case from A - D.
I've tried doing :
df <- read.table(file = x.csv, sep = ";", fill = TRUE)
But fill can't take care of a different number of columns - if you increase the number of columns later on for example.
Ideally, I would either create a number of data frames - based on the row name (such as A, B, C and D) in this case.
Or just have data frame with column-numbers = max(ncols(df)) with a lot of NA values I could then filter out to indivdual data frames later. Ie. just read everything in, with a specification of number of columns.

df <- read.delim(file.choose(),header=F,sep=";",fill=TRUE) # choose x.csv from you PC.
file.choose() opens up a dialog box for selecting the input file. Hope this helped.

Related

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Import multiple .txt files into R and skip to actual data rows

I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

Select multiple rows in multiple DFs with loop in R

I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

R: Cross reference a data frame with information from a text file

I have a data frame that has one column called activity.num. Each row (there are about 10,000 rows) contains a value between 1 and 8.
In a text file called activity.txt I have a description of the activity. The format of the file is:
1. Read1
2 Write2
...
8 Activity
My goal is to read this file and append a new column to the data frame called activity.desc with the proper description.
I managed to read in the file
# returns a list of the activity number and description
activityList <- function() {
con <- file("./activity.txt", open="rt")
data <- readLines(con)
close(con)
# split the list on the space
data <- strsplit(out," ")
}
The resulting output is a list with each line containing a vector with the first element being the number and the second being the description.
I would be grateful if you could:
Comment on whether my approach is efficiently correct
Help me with the generation of activity.desc.
Thanks.
I managed to find a solution (I'm not sure whether it can be improved upon)
# activityList() function is defined above
activityref <- activityList()
# Add a new column with the description. ydata is the original data frame.
ydata[,"activity.desc"] <- sapply(ydata[,"activity.num"], function(x) activityref[[x]][2])
Hope this helps.

Resources