I'm very new to R but do program. I'm probably just getting fed up with my own progress at this stage, so here's my issue;
Lots of .csv files, large (6MB) with spectrum data that I need to do analysis afterwards. I'm trying to read in the data - two columns of Frequency and Voltage (V as dB values), 500,000 data points per file. I would like to "merge" the data from the 2nd column in a new data set for every 10 files.
Eg: 10 files, ten Frequency (all the same for each so can be ignored for the moment) and ten Voltage. Take the data from the Voltage in the 2nd column and merge it into a data set. If I have 10 files = I end up with one data set, 100 files = 10 data sets. Hopefully in the end each data set will have 11 columns | Frequency | V1 | V2 | ... | V10 |. It would be nice to do an Index-Match on each file but I'm not sure my PC will be able for it until I upgrade resources.
This might seem quiet convoluted, all suggestions welcome, memory seems to be an issue when trying to sort through 1200 .csv files or even just reading 100 of them. Thanks for your time!
I haven't tested this since I obviously don't have your data, but something like the code below should work. Basically, you create a vector of all the file names and then read, combine, and write 10 of them at a time.
library(reshape2)
library(dplyr)
# Get the names of all the csv files
files = list.files(pattern="csv$")
# Read, combine, and save ten files at a time in each iteration of the loop
for (i in (unique(1:length(files)) - 1) %/% 10)) {
# Read ten files at a time into a list
dat = lapply(files[(1:length(files) - 1) %/% 10 == i], function(f) {
d=read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine the ten files into a single data frame
dat = bind_rows(dat)
# Reshape from long to wide format
dat = dcast(Frequency ~ file, value.var="Voltage")
# Write to csv
write.csv(dat, paste("Files_", i,".csv"), row.names=FALSE)
}
On the other hand, if you want to just combine them all into a single file in long format, which will make analysis easier (if you have enough memory of course):
# Read all files into a list
dat = lapply(files, function(f) {
d = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine into a single data frame
dat = bind_rows(dat)
# Save to csv
write.csv(dat, "All_files_combined.csv", row.names=FALSE)
Related
I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.
Thanks in advance.
While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.
library(XML)
doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))
You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.
dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"
xml_process <- function(dt) {
tryCatch({
# DOWNLOAD ZIP TO URL
url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
file <- paste0(xml_folder, "/dding2018", dt, ".xml")
download.file(url, temp_zip)
unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
unlink(temp_zip) # DESTROY TEMP ZIP
# PARSE XML TO DATA FRAME
doc <- xmlParse(file)
df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
unlink(file) # DESTROY TEMP XML
# RETURN XML DF
return(df)
}, error = function(e) NA)
}
# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)
# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)
# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)
Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.
This version can also handle cases where the number of variables per record is different.
library(xml2)
doc<-read_xml('<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value1</variable1>
<variable191>value2</variable191>
</ROW_DUASDIA>
<ROW_DUASDIA NUM="150236">
<variable1>value3</variable1>
<variable2>value_new</variable2>
<variable191>value4</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>')
#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")
#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)
#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))
#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength),
variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)
#dataframe is in a long format.
#Use the function cast, or spread from the tidyr to convert wide format
# NUM variable values
# 1 1 variable1 value1
# 2 1 variable191 value2
# 3 150236 variable1 value3
# 4 150236 variable2 value_new
# 5 150236 variable191 value4
#Convert to wide format
library(tidyr)
spread(df, variable, values)
Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)
I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)
Hi so I have a data in the following format
101,20130826T155649
------------------------------------------------------------------------
3,1,round-0,10552,180,yellow
12002,1,round-1,19502,150,yellow
22452,1,round-2,28957,130,yellow,30457,160,brake,31457,170,red
38657,1,round-3,46662,160,yellow,47912,185,red
and I have been reading them and cleaning/formating them by this code
b <- read.table("sid-101-20130826T155649.csv", sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep="") )
b$id<- b[1,1]
b<-b[-1,]
b<-b[-1,]
b$yellow<-B$V6
and so on
There are about 300 files like this, and ideally they will all compiled without the first two lines, since the first line is just id and I made a separate column to identity these data. Does anyone know how to read these table quickly and clean and format the way I want then compile them into a large file and export them?
You can use lapply to read all the files, clean and format them, and store the resulting data frames in a list. Then use do.call to combine all of the data frames into single large data frame.
# Get vector of files names to read
files.to.load = list.files(pattern="csv$")
# Read the files
df.list = lapply(files.to.load, function(file) {
df = read.table(file, sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep=""))
... # Cleaning and formatting code goes here
df$file.name = file # In case you need to know which file each row came from
return(df)
})
# Combine into a single data frame
df.combined = do.call(rbind, df.list)
I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!