R - Importing a Weird Messy CSV file - r

I have searched through Stackoverflow and the web and found some similar solutions to what I would think would be a very simple problem but nothing that addresses this. However, maybe I am just not thinking about it in correct "R" terms so here goes... Please help.
I have a few Odd CSV files which I have to process everyday.
Here is a mock up of the data as it comes in:
This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,
NOTES on the raw files:
they are all standard csvs
The number of columns may vary from file to file or day to day but the headers should always start with the same initial column name (in this example, "Header1").
Each file will have at least 2-10 lines which are worthless and I don't need.
The actual headers will appear within the first 10 rows
All of the data after the first header row is part of Group1 and I want to add a new column "Group" with that as the data
Eventually (5000 to 100,000 rows later), another set of the same header row will appear. All of the data after this second header row is part of Group2 and I want to alter the data in the new Group column to match (i.e. - change to putting "Group2" in that column).
In the end I would like to end up with this (given the initial data above):
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,NEWFIELD
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,Group1
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group1
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,Group1
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,Group1
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,Group2
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,Group2
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group2
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,Group2
I have tried to treat the data as a connection stream with a series of if/else statements to perform the identification of the headers, groups, adding the new columns, etc. but I am having issues putting it back into a form I can use with proper headers.
Group <- "Start"
processFile = function(datafilepath) {
con = file(datafilepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
print("EOF")
break
}
if (grepl("Header1", line) & Group == "Start") {
colnames(result) <- data.frame(paste(line,",","Group"))
print("Initial Headers found, Switching to Group1")
Group <- "Group1"
} else if (grepl("Systems.Name", line) & Group == "Group1") {
print("Switching to Group2")
Group <- "Group2"
} else if (Group == "Start") {print("At Start")}
if (Group != "Start") {
indresult <- (paste(line,",", Group))
result <- rbind(result, indresult)
}
}
return(result)
close(con)
}
This code fails to load the headers correctly and I am not finding a method for loading the headers directly and then the data after that. I am fairly certain the column additions should work if the other can be done but I can't get to the point of verifying the resulting data will be seen as a complete dataframe until I can get past this.
Main Questions: Is this the correct method to go about this and, if so, how do I get the data into a data frame to be able to be able to use it?
Thanks,
Solution I am using currently:
The earlier solution with fread was the closest but I had a hard time wrapping my brain around it and the := assignment operator isn't recognized on my setup.
Thus, here is what I eventually used:
#This line removes all rows before the appears of "Header1"
Data <- fread(paste(Folder, File, sep = ""), skip="Header1")
Group= "Group1"
#Add additional column to data frame to be filled in below
Data$Group= ""
#Loop through each row and add Group - I had tried using simply "Data" instead of 1:nrow(Data) but in that case R only took the initial column of Data and not each row itself.
for (dataline in 1:nrow(Data)) {
if (Data[dataline,]$"Header1" == "Header1" & Group == "Group1") {
#Reached second row of Headers indicating Group change
Group <- "Group2"
next
}
#Assign Group
Data[dataline,]$Group <- Group
}
#Remove Duplicate Header rows
Data <- Data[!(Data$Header == "Header1"),]
It is slow (takes about 4-5 minutes to run through on 50,000 rows) but it at least is automatic and gets what I need. If there is a way of speeding it up, please feel free to add. Thanks!

Something like this:
x = 'This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,'
require(data.table)
require(zoo) # for na.locf
o = fread(x, skip = 5,sep= ',')
# count how many headers
nh = nrow(o[grepl('Header1', V1) & grepl('Header2', V2)])
# add header id
o[grepl('Header1', V1) & grepl('Header2', V2), group := 1:nh]
# fill down header
o[, group := na.locf(group, na.rm = FALSE)]
# remove rows containing 'Header*'
o = o[!grepl('Header1', V1) & !grepl('Header2', V2) ]
o
V1 V2 V3 V4 V5 V6 V7 V8 V9 group
1: 20345604 10.21.1151.12.0 Daisy Petal Stem Data Data Data NA 1
2: 20345627 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 1
3: 20345600 10.21.1151.12.0 Samson Petal Stem Data Data Data NA 1
4: 20345623 10.21.1151.12.0 Cloud Petal Stem Data Data Data NA 1
5: 20345704 10.21.1151.12.0 Simmons Petal Stem Data Data Data NA 2
6: 20345677 10.21.1151.12.0 Butle Petal Stem Data Data Data NA 2
7: 20347600 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 2
8: 20745623 10.21.1151.12.0 Unicorn Petal Stem Data Data Data NA 2
x should be the path to your csv file.
Also, check out data.table::fread for more arguments that might be useful here.
You could further use setnames() to change the column names and perhaps change data types from character to numeric in case the original dataset has it.

Related

Import multiple .txt files into R and skip to actual data rows

I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

Scatter plot for each column of a DF : generates random number of files

I'm currently trying to create a function creating a scatter plot for each column of a dataframe with another column, and putting it into a png file.
The big issue is that running the script at moment A and at moment B with the same dataframe does not creates the same number of files : I may end up with 50 files or 80... It is random and I can't explain it...
No error is printed in the console...
I would like all the columns to be plotted, not only a random subset of them.
Does anyone know what could be the reason of this problem? And how to solve it?
Here's the code :
data_analysis_num <- function(variete, centre, dataframe, column_name)
{
column <- DataF[,column_name]
name_folder <- paste('Corr_centre_', centre,'_variete_', variete, sep='')
system(paste('mkdir ', name_folder, sep=''))
count <- 0
for(i in 1:length(DataF))
{
count <- count + 1
name <- compt
png(file = paste('/home/R/Fusion/',name_folder,'/',name, sep = ''))
plot(DataF[,i] ~ column)
dev.off()
}
}
data_analysis_num('all', Centre_4, DF, 'Column_name')
I've thought it could be due to my data, so I've tried to replace plot(DataF[,i] ~ column) with column ~ column) (column being a trusted column) and it still won't work as expected...
Thanks for your help!

How to pass an R function argument to subset a column

First I am new here, this is my first post so my apologies in advance if I am not doing everything correct. I did take the time to search around first but couldn't find what I am looking for.
Second, I am pretty sure I am breaking a rule in that this question is related to a 'coursera.org' R programming course I am taking (this was part of an assignment) but the due date has lapsed and I have failed for now, I will repeat the subject next month and try again but I am kind of now in damage control trying to find out what went wrong.
Basically below is my code:
What I am trying to do is read in data from a series of files. These files are four columns wide with the titles: Date, nitrate, sulfate and id and contain various rows of data.
The function I am trying to write should take the arguments of the directory of the files, the pollutant (so either nitrate or sulfate), and the set of numbered files, e.g. files 1 and 2, files 1 through to 4 etc. The return of the function should be the average value of the selected pollutant across the selected files.
I would call the function using a call like this
pollutantmean("datafolder", "nitrate", 1:3)
and the return should just be a number which is the average in this case of nitrate across data files 1 through to 3
OK, I hope I have provided enough information. Other stuff that may be useful is:
Operating system :Ubuntu
Language: R
Error message received:
Warning message:
In is.na(x) : is:na() applied to non(list or vector) of type 'NULL'
As I say, the data files are a series of files located in a folder and are four columns wide and vary as to the number of rows.
My function code is a follows:
pollutantmean <- function(directory, pollutant, id = 1:5) { #content of the function
#create a list of files, a vector I think
files_list <- dir(directory, full.names = TRUE)
# Now create an empty data frame
dat <- data.frame()
# Next step is to execute a loop to read all the selected data files into the dataframe
for (i in 1:5) {
dat <- rbind(dat, read.csv(files_list[i]))
}
#subsets the rows matching the selected monitor numbers
dat_subset <- dat[dat[, "ID"] == id, ]
#identify the median of the pollutant and ignore the NA values
median(dat_subset$pollutant, na.rm = TRUE)
ok, that is it, through trial and error I am pretty sure the final line of code, the "median(dat_subset$pollutant, na.rm = TRUE)" appears to be the problem. I pass an argument to the function of pollutant which should be either sulfate or nitrate but it seems the dat_subset$pollutant bit of code is what is not working. Somehow I am getting the passed pollutant argument to not come into the function body. the dat_subset$pollutant bit should ideally be equivalent to either dat_subset$nitrate or dat_subset$sulfate depending on the argument fed to the function.
You cannot subset with $ operator if you pass the column name in an object like in your example (where it is stored in pollutant). So try to subset using [], in your case that would be:
median(dat_subset[,pollutant], na.rm = TRUE)
or
median(dat_subset[[pollutant]], na.rm = TRUE)
Does that work?

R: changing column names for improved documentation

I have two csv files. One containing measurements at several points and one containing the description of the single points. It has about a 100 different points and 10000's of measurements but for simplification let's assume there are only two points and measurements.
data.csv:
point1,point2,date
25,80,11.06.2013
26,70,10.06.2013
description.csv:
point,name,description
point1,tempA,Temperature in room A
point2,humidA,Humidity in room A
Now I read both of the csv's into dataframes. Then I change the column names in the dataframe to make it more readable.
options(stringsAsFactors=F)
DataSource <- read.csv("data.csv")
DataDescription <- read.csv("description.csv")
for (name.source in names(DataSource))
{
count = 1
for (name.target in DataDescription$point)
{
if (name.source == name.target)
{
names(DataSource)[names(DataSource)==name.source] <- DataDescription[count,'name']
}
count = count + 1
}
}
So, my questions now are: Is there a way to do this without the loops? And would you change the names for readability as I did or not? If not, why?
The trick with replacements is sometimes to match the indexing on both sides of hte assignment:
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
DataDescription$name[match(DataDescription$point, names(DataSource))]
#> DataSource
tempA humidA date
1 25 80 11.06.2013
2 26 70 10.06.2013
Earlier effort :
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
gsub(" ", "_", DataDescription$description)[
match(DataDescription$point, names(DataSource))]
#> DataSource
Temperature_in_room_A Humidity_in_room_A date
1 25 80 11.06.2013
2 26 70 10.06.2013
Notice that I did not put non-syntactic names on that dataframe. To do so would have been a disservice. Anando Mahto's comment is well considered. I would not want to do this unless it were are the very end of data-processing or a side excursion on the way to a plotting effort. In that case I might not substitute the underscores. In the case where you wanted plotting lables there might be a further need for insertion of "\n" to fold the text within space constraints.
ok, I ordered the columns in the first one and the rows in the second one to work around the problem with the same order of the points. Now the description only need to have the same points as the data source. Here is my final code:
# set options to get strings right
options(stringsAsFactors=F)
# read in original data
DataOriginal <- read.csv("data.csv", sep = ";")
DataDescriptionOriginal <- read.csv("description.csv", sep = ";")
# sort the data
DataOrdered <- DataOriginal[,order(names(DataOriginal))]
DataDescriptionOrdered <- DataDescriptionOriginal[order(DataDescriptionOriginal$points),]
# copy data into final dataframe and replace names
Data <- DataOrdered
names(Data)[match(DataDescriptionOrdered$points, names(Data))] <- gsub(" ", "_", DataDescriptionOrdered$description)[match(DataDescriptionOrdered$points, names(Data))]
Thx a lot to everyone contributing to find a good solution for me!

Resources