New variable from list.file - r

I am creating an object calling all .csv files in a directory, reading them in according to some specifications, and merging them.
Before merging them I want to take the first two letters of the file names and create a new column in each table reporting that two letter as a variable.
I got this far:
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, function(x) read.csv(x,
header=TRUE,
#sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA",""),
colClasses=c("code"="character")))
myfiles.final = do.call(rbind, myfiles)
When I try to create the new variable though I generate a replacement that has double the rows of the data:
temp.2 <- lapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = lapply(myfiles,
function(x){
a <- temp.2[seq_along(myfiles)]
x$identifier <- rep(a,nrow(x))
return(x)
})
In the folder the files are named, for example AA029893.csv,BB024593.csv..., for the first table I just want a new column called "identifier" that has "AA" for all entries, for the second "BB", and so on.
Thanks a lot

lapply is good for iterating along 1 list (e.g., myfiles data frames). To add a column to each data frame, you want to iterate in parallel over two lists, the list of data frames and the list of names. Map does this (for an arbitrary number of lists):
myfiles.2 = Map(function(dd, nn) {dd$identifier = nn; return(dd)},
dd = myfiles, nn = temp.2)
An easier alternative is to add the column post-hoc:
myfiles.final = do.call(rbind, myfiles)
myfiles.final$identifier = rep(
sapply(temp, function(x) substr(x, start = 1, stop = 2)),
each = lengths(myfiles)
)
The easiest alternative is to use data.table::rbindlist or dplyr::bind_rows, either of which will automatically add an ID column based on the names of your list. Depending on the size of your data, they might be quite a bit faster as well.
names(myfiles) = sapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = dplyr::bind_rows(myfiles)
myfiles.2 = data.table::rbindlist(myfiles, idcol = "identifier")

Related

Loop through files in R and select rows by string

I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)

Trying to rename multiple.csv files using data contained within the file

A machine I use spits out .csv files named by the time. But I need them named after the plate they were read from, which is contained within the file.
I created list of files:
files <- list.files(path="", pattern="*.csv")
I then tried using a for-loop to first create a data frame from each file containing the 1st row only, then to create a variable from the relevant piece of data, (the desired name), and then renaming the files.
for(x in files)
{
y <- read.csv(x, nrow = 1, header = FALSE, stringsAsFactors = TRUE)
z <- y[2, 2]
file.rename(x, z)
}
It didn't work. After 7 hours of trying (new to R) I am here. Please give simple advice, I have basically zero R experience.
I believe the following for loop does what the question asks for if the new filename is the second column header value.
If it is not, change nmax to the appropriate column number.
fls <- list.files(pattern = '\\.csv')
for(f in fls){
x <- scan(file = f, what = character(), nmax = 2, nlines = 1, sep = ',')
g <- paste0(x[2], '.csv')
file.rename(f, g)
}

Modify columns in a dataframe by using function

I'm trying to modify my data frame columns and positions. Finally I found some solution to do that but I want to do all process in a function for all data sets in the directory and overwrite the real data.
kw <- matrix(1:11400, ncol = 19) # to make sample data
kw <- kw[, !(colnames(kw) %in% c("V18","V19"))] # to remove last two cols
add <- c(kw$V18 <- 0,kw$V19<- 0) # add new columns with all zero values
kw$V1 <- kw$V1 * 1000 # to modify first col of data frame
kw <- kw[ ,c(1,18:19,2:17)] # to replace col positions
lets say I have data set in the directory
kw<-read.table("5LSTT-test10.avgm", header = FALSE,fill=FALSE) # example which shows how I read single data
`5LSTT-test10.avgm`
.
.
.
.
5LSTT-test10.avgm`
how can apply this column modification process to each data separately and overwrite or make new data?
edit output readLines("5LSTT-test10.avgm", n = 1)
you can see 19 columns and think this data has 600 rows
[1] " 9.0000E-02 0.0000E+00 2.3075E-03 -6.4467E-03 9.9866E-01 9.8648E-02 4.5981E-02 9.8004E-01 1.2359E-01 6.1175E-02 9.7701E-01 8.6662E-02 3.0034E-02 9.7884E-01 7.0891E-02 8.2247E-03 9.8564E-01 -8.7967E-11 4.3105E-02"
With "data.table" you would be able to do something like:
setcolorder(
fread(yourfile)[, c("V1", "V18", "V19") := list(V1 * 1000, 0, 0)], c(1, 18:19, 2:17))
Thus, if you really needed a function, you can do something like:
myFun <- function(infile) {
require(data.table)
write.table(
setcolorder(
fread(infile)[
, c("V1", "V18", "V19") := list(V1 * 1000, 0, 0)],
c(1, 18:19, 2:17)),
file = gsub("(.*)(\\..*)", "\\1_new\\2", infile),
row.names = FALSE)
}
You can then use myFun within lapply over a vector of the files you want to read and process.
In other words:
lapply(myListOfFilePaths, myFun)
By default, this function renames (rather than overwrites) your file appending "_new" at the end, but before the extension.
This could be another way
Read all the files and store it in a list like this
# to list down all the files in the directory
files.new = list.files(directory.path, recursive = TRUE, pattern=".avgm")
# to read all the files and store it in list
file.contents = lapply(paste(directory.path,files.new, sep="/"), read.table, sep='\t', header = TRUE)
Next you can do the modifications to each of the dataset in the list something like this
outlist = lapply(file.contents, function(x){
# modifications
kw <- x[, !(colnames(x) %in% c("V18","V19"))]
add <- c(kw$V18 <- 0,kw$V19<- 0)
kw$V1 <- kw$V1 * 1000
kw <- kw[ ,c(1,18:19,2:17)]
})
and write the modified data into new files using the function below
# function to write files from a list object
write.files = function(modified.list, path){
outlist = file.contents[sapply(modified.list, function(x) length(x) > 1)]
sapply(names(outlist), function(x)
write.table( outlist[[x]], file= paste(path, x, sep="/"),
sep="\t", row.names=FALSE))
}
Writing the data to files
write.files(outlist, "/directory/path")

R: Dynamically create a variable name

I'm looking to create multiple data frames using a for loop and then stitch them together with merge().
I'm able to create my data frames using assign(paste(), blah). But then, in the same for loop, I need to delete the first column of each of these data frames.
Here's the relevant bits of my code:
for (j in 1:3)
{
#This is to create each data frame
#This works
assign(paste(platform, j, "df", sep = "_"), read.csv(file = paste(masterfilename, extension, sep = "."), header = FALSE, skip = 1, nrows = 100))
#This is to delete first column
#This does not work
assign(paste(platform, j, "df$V1", sep = "_"), NULL)
}
In the first situation I'm assigning my variables to a data frame, so they inherit that type. But in the second situation, I'm assigning it to NULL.
Does anyone have any suggestions on how I can work this out? Also, is there a more elegant solution than assign(), which seems to bog down my code? Thanks,
n.i.
assign can be used to build variable names, but "name$V1" isn't a variable name. The $ is an operator in R so you're trying to build a function call and you can't do that with assign. In fact, in this case it's best to avoid assign completely. You con't need to create a bunch of different variables. If you data.frames are related, just keep them in a list.
mydfs <- lapply(1:3, function(j) {
df<- read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100))
df$V1<-NULL
df
})
Now you can access them with mydfs[[1]], mydfs[[2]], etc. And you can run functions overall data.sets with any of the *apply family of functions.
As #joran pointed out in his comment, the proper way of doing this would be using a list. But if you want to stick to assign you can replace your second statement with
assign(paste(platform, j, "df", sep = "_"),
get(paste(platform, j, "df", sep = "_"))[
2:length(get(paste(platform, j, "df", sep = "_")))]
If you wanted to use a list instead, your code to read the data frames would look like
dfs <- replicate(3,
read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100), simplify = FALSE)
Note you can use replicate because your call to read.csv does not depend on j in the loop. Then you can remove the first column of each
dfs <- lapply(dfs, function(d) d[-1])
Or, combining everything in one command
dfs <- replicate(3,
read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100)[-1], simplify = FALSE)

Reading multiple csv of same format in a data frame

I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan

Resources