I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!
Related
I have been working on code that will extract four rows from every .CSV in a folder, compile the rows into a new table, and pivot wider so there is only one entry per each original .csv.
Everything works great until I try to write the resulting table to a .csv,
The table looks great in viewer and as a tibble. If I could just get this format to export to .csv I'd be thrilled.
However, when I export it using write.csv I get this:
It looks like each column is being exported as a cell and then copied for each row. I figure the problem is something to do with exporting after completing the pivot_wider function.
Here is my code. Everything works and I am not getting any errors, it just doesn't export the way I need it to.
library(dplyr)
library(tidyverse)
library(tidyr)
#Create a list of files
files <-list.files(pattern = "*Correlations.csv")
# Create an empty object to store all combined rows
combined_df= data.frame(material=character(),
code=double(),
measurement=double(),
name=character(),
spectra=character())
#Create empty object to temporarily collect data produced every run in the loop
data_it= data.frame(matrix(ncol = 5, nrow = 4))
colnames(data_it) =c("material", "code", "measurement", "name", "spectra")
#Run a loop
for (j in 1:length(files)){
data <- read_csv(files[j], skip = 2, col_names =c("material", "code", "measurement", "name", "spectra"))
data_it$material <- data[3:6, 1]
data_it$code <- data[3:6, 2]
data_it$measurement<- data[3:6, 3]
data_it$name <- paste(files[j])
data_it$spectra <- c("spectra 1", "spectra 2", "spectra 3", "spectra 4")
combined_df <- rbind(combined_df, data_it)
}
#pivot table so that there is one row per file
combined_pivot<-pivot_wider(combined_df,
names_from = spectra,
values_from = c(material, code, measurement),
names_vary = "slowest")
#export to .csv
write.csv(combined_pivot, file ="combined_results.csv")
Here is a sample of one of the .csv files I am trying to extract from. I have hundreds of these with 100 rows each and I just need the top three rows from each one
material
code
measurement
animal furs/natural polyamides
13
0.6777
animal furs/natural polyamides
13
0.6065
cellulose/plant fibres
14
0.5725
animal furs/natural polyamides
13
0.5698
animal furs/natural polyamides
13
0.5171
animal furs/natural polyamides
13
0.5128
animal furs/natural polyamides
13
0.4932
animal furs/natural polyamides
13
0.4904
Thanks for reading and for any suggestions. I am new to R and out of ideas.
Without the data to check and write code, I cannot fix it, however I see the fault. So take this as suggestions rather than answers.
You are generating lists (of 4 things) inside each "cell" of combined_df.
I think you need to pivot_wider inside your for loop before you rbind, so as to break up each files data into 4 lines before you start to add them to a "master" table.
The other way to do it (using for loops) is to have a nested loop inside your original loop that is doing the 3:6 part to a new line each time.
for (i in 3:6){
data_it$material[some variable in here that represents a new line each time] <- data[i, 1]}
as an example, extrapolate from there. The first way seems easier.
Good luck
EDIT again:
I mean, you really need a triple nested loop.
One to grab the files
Next to increment your final table each row as you grab the data.
Next to grab the data each line 3:6 (Also has the nested counter to increment the 2nd for loop)
I was able to get it to export by using this code before running the pivot:
combined_df2 <- as.data.table(combined_df)
Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)
I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)
I have a data frame that has one column called activity.num. Each row (there are about 10,000 rows) contains a value between 1 and 8.
In a text file called activity.txt I have a description of the activity. The format of the file is:
1. Read1
2 Write2
...
8 Activity
My goal is to read this file and append a new column to the data frame called activity.desc with the proper description.
I managed to read in the file
# returns a list of the activity number and description
activityList <- function() {
con <- file("./activity.txt", open="rt")
data <- readLines(con)
close(con)
# split the list on the space
data <- strsplit(out," ")
}
The resulting output is a list with each line containing a vector with the first element being the number and the second being the description.
I would be grateful if you could:
Comment on whether my approach is efficiently correct
Help me with the generation of activity.desc.
Thanks.
I managed to find a solution (I'm not sure whether it can be improved upon)
# activityList() function is defined above
activityref <- activityList()
# Add a new column with the description. ydata is the original data frame.
ydata[,"activity.desc"] <- sapply(ydata[,"activity.num"], function(x) activityref[[x]][2])
Hope this helps.
I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))