Add filename as column header when combining multiple files - r

I am combining 1 column of data from multiple/many text files into a single CSV file. This part is fine with the code I have. However, I would like to have after importing the filename (e.g., "roth_Aluminusa_E1.0.DPT") as the column header for the data column taken from that file. I know, similar questions have been asked but I can't work it out. Thanks for any help :-)
Code I am using which works for combining the files:
files3 <- list.files()
WAVELENGTH <- read.table(files3[1], header=FALSE, sep=",")[ ,1]
TEST9 <- do.call(cbind,lapply(files3,function(fn)read.table(fn, header=FALSE, sep = ",")[ , 2]))
TEST10 <- cbind(WAVELENGTH, TEST9)

You can do the following to add column names to TEST10. This assumes the column name you want for the first column is files3[1]
colnames(TEST10) <- c(files3[1], files3)
In case you want to keep the name of the first column as is, then we add the desired column names before binding WAVELENGTH with TEST9.
colnames(TEST9) <- files3
TEST10 <- cbind(WAVELENGTH, TEST9)
Then you can write to a csv as usual, keeping the column names as headers in the resulting file.
write.csv(TEST10, file = "TEST10.csv", row.names = FALSE)

Related

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

Load header for data.frame from file

I have two files. One file (csv) contains data, and second contains header for data (in one column). I need to unite both files and get data.frame with data from first file and header from second file. How it can be done?
Reduced sample. Data file:
10;21;36
7;56;543
7;7;7
7890;1;1
Header file:
height
weight
light
I need data.frame as from csv file:
height;weight;light
10;21;36
7;56;543
7;7;7
7890;1;1
You could use the col.names argument in read.table() to read the header file as the column names in the same call used to read the data file.
read.table(datafile, sep = ";", col.names = scan(headerfile, what = ""))
As #chinsoon12 shows in the comments, readLines() could also be used in place of scan().
We can read both the datasets with header=FALSE and change the column names with the first column of second dataset.
df1 <- read.csv("firstfile.csv", sep=";", header=FALSE)
df2 <- read.csv("secondfile.csv", header=FALSE)
colnames(df1) <- as.character(df2[,1])

Import multiple .txt files into R and skip to actual data rows

I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

Extracting specific column of different files and put them together in one big file in R

I have 100 files, of which I want to extract the 4th column (total_volume) containing 100.000 rows and put it together in 1 big file which then contains 100 columns with each 100.000 rows. I was trying something with the following script:
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
#read file in fileNames
for (fileName in fileNames) {
dataDf <- read.delim(fileName, header = FALSE)
# remove columns with only example values
dataDf <- dataDf[, -(7:14)]
# convert data frame to data table
dataDt <- data.table(dataDf)
# set column names
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
#new file with only total volume
total_volume <- dataDt$total_volume
#export file
write.table(dataDt$total_volume, file = "total_volume20.csv")
But what I get then is that all columns get superimposed with as result a .csv file with the 4th column of only the last file. I would like the columns to be next to eachother instead of being superimposed. How could I do that?
Thanks in advance!
P.S. Obviously the overwriting thing happens because I used a loop. However, I am not sure how else to combine everything, so suggestions are very welcome!
You haven't given us a reproducible example, so I can't test this properly, but this should give you a table with one column for total volume from each of the files you get from the call to Sys.glob(). The idea is to make a function that does what you want for one file; use lapply() to make a list with the results of that function for each file in your target environment; then cbind the columns in that list into one big table.
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
# For the function, I'm reproducing your code. You could do in fewer lines and without
# data.table if you like, but maybe there's a reason you chose this approach.
extractor <- function(fileName) {
require(data.table)
dataDf <- read.delim(fileName, header = FALSE)
dataDf <- dataDf[, -(7:14)]
dataDt <- data.table(dataDf)
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
total_volume <- dataDt$total_volume
return(total_volume)
}
total.list <- lapply(fileNames, extractor)
total.table <- Reduce(cbind, total.list)
write.table(total.table, file = "total_volume20.csv")
Or do that last bit in one line if you like:
write.table(Reduce(cbind, lapply(Sys.glob("*.txt.csv"), extractor)), file="total_volume20.csv")

lapply r to one column of a csv file

I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.

Resources