R Function to predict which csv file would not be modified - r

I am trying to identify which types of csv files would not be modified in the future.
There are 540 csv files in one folder, and only 518 are modified. Basically, I wrote code to read and prepare this files to be modified by Java application and by running terminal on Linux they are modified.
This is what terminal shows:
data_3_5.csv
Error in mapmatching or profiling!
No edge matches found for path. Too short? Sequence size 2
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
file_names <-list.files(directory)
predict(file_names, model, filename="", fun=predict, ext=NULL,
const=NULL, index=1, na.rm=TRUE)
I think, it doesn't work only for those files what have small length? Maybe just apply code which calculates the length of all columns in all csv files and which would be small than n?

Welcome, and good job posting some code. You're pretty close, the predict function is used in modelling though, try this on:
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
## let's take out a little bit of protection to ensure we are only getting csvs
file_names <-list.files(directory, pattern = ".csv", full.names = TRUE)
## ^ ok so the above gives us all the filenames, but we haven't read them in yet...
## so let's create a function that reads the files in and counts how many columns in each.
library(tidyverse)
## if the above fails, run install.packages("tidyverse")
## let's create a function that will open the csv file and read the number of columns for each.
openerFun <- function(x){ ## here x is the input, or the path
openedFile <- read.csv(x, stringsAsFactors = FALSE) ## open the file
numCols <- ncol(openedFile) ## Count columns
tibble(name = x, numCols = numCols) ## output the file with the # columns
}
## and now let's call it with map, but map_dfr it's better cause we have a nice dataframe!
map_dfr(file_names, openerFun)
Once you have that, you can use it to compare against which files failed... hopefully that will help!

Related

Filter CSV files for specific value before importing

I have a folder with thousands of comma delimited CSV files, totaling dozens of GB. Each file contains many records, which I'd like to separate and process separately based on the value in the first field (for example, aa, bb, cc, etc.).
Currently, I'm importing all the files into a dataframe and then subsetting in R into smaller, individual dataframes. The problem is that this is very memory intensive - I'd like to filter the first column during the import process, not once all the data is in memory.
This is my current code:
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, fread, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
DF <- rbindlist(temp)
DFaa <- subset(DF, V1 =="aa")
If possible, I'd like to move the "subset" process into lapply.
Thanks
1) read.csv.sql This will read a file directly into a temporarily set up SQLite database (which it does for you) and then only read the aa records into R. The rest of the file will not be read into R at any time. The table will then be deleted from the database.
File is a character string that contains the file name (or pathname if not in the current directory). Other arguments may be needed depending on the format of the data.
library(sqldf)
read.csv.sql(File, "select * from file where V1 == 'aa'", dbname = tempfile())
2) grep/findstr Another possibility is to use grep (Linux) or findstr (Windows) to extract the lines with aa. That should get you the desired lines plus possibly a few others and at that point you have a much smaller input so it could be subset it in R without memory problems. For example,
fread("findstr aa File")[V1 == 'aa'] # Windows
fread("grep aa File")[V1 == 'aa'] # Linux
sed or gawk could also be used and are included with Linux and in Rtools on Windows.
3) csvfix The free csvfix utility is available on all platforms that R supports and can be used to select field values -- there also exist numerous other similar utilities such as csvkit, csvtk, miller and xsv.
The line below says to return only lines for which the first comma separated field equals aa. This line may need to be modified slightly depending on the cmd line or shell processor used.
fread("csvfix find -if $1==aa File") # Windows
fread("csvfix find -if '$1'==aa File") # Linux bash
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, function(x) subset(fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE), V1=="aa"))
DF <- rbindlist(temp)
Untested, but this will probably work - replace your function call with an anonymous function.
This could help but you have to expand the function:
#Function
myload <- function(x)
{
y <- fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
y <- subset(y, V1 =="aa")
return(y)
}
#Apply
temp <- lapply(files, myload)
If you don't want to muck with SQL, consider using the skip argument in a loop. Slower, but that way you read in a block of lines, filter them, then read in the next block of lines (to the same temp variable so as not to take extra memory), etc.
Inside your lapply call, either a second lapply or equivalently
for (jj in 0: N) {
foo <- fread(filename, skip = (jj*1000+1):((jj+1)*1000), sep=",", fill=TRUE, integer64="numeric",header=FALSE)
mydata[[jj]] <- do_something_to_filter(foo)
}

Loop through subfolders and extract data from CSV files

I am trying to loop through all the subfolders of my wd, list their names, open 'data.csv' in each of them and extract the second and last value from that csv file.
The df would look like this :
Name_folder_1 2nd value Last value
Name_folder_2 2nd value Last value
Name_folder_3 2nd value Last value
For now, I managed to list the subfolders and each of the file (thanks to this thread: read multiple text files from multiple folders) but I struggle to implement (what I'm guessing should be) a nested loop to read and extract data from the csv files.
parent.folder <- "C:/Users/Desktop/test"
setwd(parent.folder)
sub.folders1 <- list.dirs(parent.folder, recursive = FALSE)
r.scripts <- file.path(sub.folders1)
files.v <- list()
for (j in seq_along(r.scripts)) {
files.v[j] <- dir(r.scripts[j],"data$")
}
Any hints would be greatly appreciated !
EDIT :
I'm trying the solution detailed below but there must be something I'm missing as it runs smoothly but does not produce anything. It might be something very silly, I'm new to R and the learning curve is making me dizzy :p
lapply(files, function(f) {
dat <- fread(f) # faster
dat2 <- c(basename(dirname(f)), head(dat$time, 1), tail(dat$time, 1))
write.csv(dat2, file = "test.csv")
})
Not easy to reproduce but here is my suggestion:
library(data.table)
files <- list.files("PARENTDIR", full.names = T, recursive = T, pattern = ".*.csv")
lapply(files, function(f) {
dat <- fread(f) # faster
# Do whatever, get the subfolder name for example
basename(dirname(f))
})
You can simply look recursivly for all CSV files in your parent directory and still get their corresponding parent folder.

lapply r to one column of a csv file

I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.

Applying an R script to multiple files

I have an R script that reads a certain type of file (nexus files of phylogenetic trees), whose name ends in *.trees.txt. It then applies a number of functions from an R package called bGMYC, available here and creates 3 pdf files. I would like to know what I should do to make the script loop through the files for each of 14 species.
The input files are in a separate folder for each species, but I can put them all in one folder if that facilitates the task. Ideally, I would like to output the pdf files to a folder for each species, different from the one containing the input file.
Here's the script
# Call Tree file
trees <- read.nexus("L_boscai_1411_test2.trees.txt")
# To use with different species, substitute "L_boscai_1411_test2.trees.txt" by the path to each species tree
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
pdf('results_single_boscai.pdf')
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
pdf('results_boscai.pdf') # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
pdf('trees_boscai.pdf') # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
I've read several posts with a similar question, but they almost always involve .csv files, refer to multiple files in a single folder, have a simpler script or do not need to output files to separate folders, so I couldn't find a solution to my specific problem.
Shsould I use a for loop or could I create a function out of this script and use lapply or another sort of apply? Could you provide me with sample code for your proposed solution or point me to a tutorial or another reference?
Thanks for your help.
It really depends on the way you want to run it.
If you are using linux / command line job submission, it might be best to look at
How can I read command line parameters from an R script?
If you are using GUI (Rstudio...) you might not be familiar with this, so I would solve the problem
as a function or a loop.
First, get all your file names.
files = list.files(path = "your/folder")
# Now you have list of your file name as files. Just call each name one at a time
# and use for loop or apply (anything of your choice)
And since you would need to name pdf files, you can use your file name or index (e.g loop counter) and append to the desired file name. (e.g. paste("single_boscai", "i"))
In your case,
files = list.files(path = "your/folder")
# Use pattern = "" if you want to do string matching, and extract
# only matching files from the source folder.
genPDF = function(input) {
# Read the file
trees <- read.nexus(input)
# Store the index (numeric)
index = which(files == input)
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
outname = paste('results_single_boscai', index, '.pdf', sep = "")
pdf(outnam)
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
outname = paste('results_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
outname = paste('trees_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
}
for (i in 1:length(files)) {
genPDF(files[i])
}

Assigning Directory as a Variable in R

I need to create a function called PollutantMean with the following arguments: directory, pollutant, and id=1:332)
I have most of the code written but I can't figure out how to assign my directory as a variable. My current working directory is C:/Users/User/Documents. I tried writing the variable as:
directory <- "C:/Users/User/specdata" and that didn't work.
Next I tried the following:
directory <- list.files("specdata", full.names=TRUE) and that didn't work either.
Any ideas on how to change this?
If you are trying to assign the values in your current working directory to the variable "directory" Why not take the simple method and add:
directory <- getwd()
This should take the contents of the working directory and assign the values to the variable "directory".
I've already worker with directory as variables, I usually declare them like that
directory<-"C://Users//User//specdata//"
To take back your example.
Then, if I want to read a specific file in this directory, I will just go like :
read.table(paste(directory,"myfile.txt",sep=""),...)
It's the same process to write in a file
write.table(res,file=paste(directory,"myfile.txt",sep=""),...)
Is this helping ?
EDIT : you can then use read.csv and it will work fine
I think you are confused by the assignment operation in R. The following line
directory <- "C:/Users/User/specdata"
assigns a string to a new object that just happened to be called directory. It has the same effect on your working environment as
elephant <- "C:/Users/User/specdata"
To change where R reads its files, use the function setwd (short for set working directory):
setwd("C:/Users/User/specdata")
You can also specify full path names to functions that read in data (like read.table). For your specific problem,
# creates a list of all files ending with `csv` (i.e. all csv files)
all.specdata.files <- list.files(path = "C:/Users/User/specdata", pattern = "csv$")
# creates a list resulting from the application of `read.csv` to
# each of these files (which may be slow!!)
all.specdata.list <- lapply(all.specdata.files, read.csv)
Then we use dplyr::rbind_all to row-bind them into one file.
library(dplyr)
all.specdata <- rbind_all(all.specdata.list)
Then use colMeans to determine the grand means. Not sure how to do this without seeing the data.
Assuming that the columns in each of the 300+ csv files are the same, that is have column j contains the same type of data in all files, then the following example should be of use:
# let's use a temp directory for storing the files
tmpdr <- tempdir()
# Let's creat a large matrix of values and then split it into many different
# files
original_data <- data.frame(matrix(rnorm(10000L), nrow = 1000L))
# write each row to a file
for(i in seq(1, nrow(original_data), by = 1)) {
write.csv(original_data[i, ],
file = paste0(tmpdr, "/", formatC(i, format = "d", width = 4, flag = 0), ".csv"),
row.names = FALSE)
}
# get a character vector with the full path of each of the files
files <- list.files(path = tmpdr, pattern = "\\.csv$", full.names = TRUE)
# read each file into a list
read_data <- lapply(files, read.csv)
# bind the read_data into one data.frame,
read_data <- do.call(rbind, read_data)
# check that our two data.frames are the same.
all.equal(read_data, original_data)
# [1] TRUE

Resources