lapply r to one column of a csv file - r

I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.

I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv

Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.

Related

R Function to predict which csv file would not be modified

I am trying to identify which types of csv files would not be modified in the future.
There are 540 csv files in one folder, and only 518 are modified. Basically, I wrote code to read and prepare this files to be modified by Java application and by running terminal on Linux they are modified.
This is what terminal shows:
data_3_5.csv
Error in mapmatching or profiling!
No edge matches found for path. Too short? Sequence size 2
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
file_names <-list.files(directory)
predict(file_names, model, filename="", fun=predict, ext=NULL,
const=NULL, index=1, na.rm=TRUE)
I think, it doesn't work only for those files what have small length? Maybe just apply code which calculates the length of all columns in all csv files and which would be small than n?
Welcome, and good job posting some code. You're pretty close, the predict function is used in modelling though, try this on:
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
## let's take out a little bit of protection to ensure we are only getting csvs
file_names <-list.files(directory, pattern = ".csv", full.names = TRUE)
## ^ ok so the above gives us all the filenames, but we haven't read them in yet...
## so let's create a function that reads the files in and counts how many columns in each.
library(tidyverse)
## if the above fails, run install.packages("tidyverse")
## let's create a function that will open the csv file and read the number of columns for each.
openerFun <- function(x){ ## here x is the input, or the path
openedFile <- read.csv(x, stringsAsFactors = FALSE) ## open the file
numCols <- ncol(openedFile) ## Count columns
tibble(name = x, numCols = numCols) ## output the file with the # columns
}
## and now let's call it with map, but map_dfr it's better cause we have a nice dataframe!
map_dfr(file_names, openerFun)
Once you have that, you can use it to compare against which files failed... hopefully that will help!

How to call several variables in a for loop in R?

I have several .csv files of data stored in a directory, and I need to import all of them into R.
Each .csv has two columns when imported into R. However, the 1001st row needs to be stored as a separate variable for each of the .csv files (it corresponds to an expected value which was stored here during the simulation; I want it to be outside of the main data).
So far I have the following code to import my .csv files as matrices.
#Load all .csv in directory into list
dataFiles <- list.files(pattern="*.csv")
for(i in dataFiles) {
#read all of the csv files
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=T))
}
This produces several matrices with the naming convention "sim_data_L_mu" where L and mu are parameters from the simulation. How can I remove the 1001st row (which has a number in the first column, and the second column is null) from each matrix and store it as a variable named "sim_data_L_mu_EV"? The main problem I have is that I do not know how to call all of the newly created matrices in my for loop.
Couldn't post long code in comments so am writing here:
# Use dialog to select folder
# Full names are required to access files that are not in the current working directory
file_list <- list.files(path = choose.dir(), pattern = "*.csv", full.names = T)
big_list <- lapply(file_list, function(z){
df <- read.csv(z)
scalar <- df[1000,1]
return(list(df, scalar))
})
To access the scalar value from the third file, you can use
big_list[[3]][2]
The elements in big_list follow the order of file_list so you always know which file the data comes from.
If you use data.table::fread() instead of read.csv, you can play around with assigning column names, selecting which rows/columns to read etc. It's also considerably faster for large datafiles.
Hope this helps!

Replace values within dataframe with filename while importing using read.table (R)

I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)

Get summary / plot for a column out of a folder (~3000 csv files) in R

I'm a student from Germany. I want to create a summary (0.25 & 0.75 quantile, mean, min, max) and different plots for special columns (e.g. Inflow or Low).
The issue is that there is not only one .csv file, there are about 3200 files in that folder - different names (ISIN numbers of portfolios all starting with DE000LS9xxx).
After I looked through different platforms and this forum I tried different possibilities. My last try was to name every file 001.csv, 002.csv, etc. and use an answer out of this forum:
directory <- setwd("~/Desktop/Uni/paper/testdata/")
Inflowmean <- function(directory, Inflow, id = 1:3) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep=";", dec=",")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
summary(df[, Inflow], na.rm = TRUE)
}
I really hope that you can help me, cause I'm new and just started to learn commands in RStudio - seems that I'm not able to handle it, also tried different tutorials and the help function in the program...
Thank you so much!
It is rather unclear what your question actually is, but there are a number of problems with your code:
directory <- setwd("~/Desktop/Uni/paper/testdata/"): See ?setwd - it returns the current directory before changing the working directory, not ~/Desktop/Uni/paper/testdata/. You probably want
directory <- "~/Desktop/Uni/paper/testdata/"
setwd(directory)
filenames <- paste(directory, filenames, sep=";", dec=",") -- this will create filenames like "~/Desktop/Uni/paper/testdata/;001.csv;,". You probably want the separator to be / or .Platform$file.sep. I don't know why you have dec="," but that will just paste it onto the end. Try pasteing a few things together to see what gives you file names that make sense for your data.
Your ldply syntax is wrong: you probably want
ldply(ldf, function (x) summary(x[, Inflow], na.rm=T))
See ?ldply for more information. Also, to use ldply, you need library(plyr) somewhere. If you just want base R, you could try
do.call(rbind, lapply(x, function (x) summary(x[, Inflow], na.rm=T)))
Where the lapply applies your function (summary(x[, Inflow], na.rm=T)) to each of your dataframes, and do.call(rbind, ...) just joins all the summaries together into a single dataframe.
from
Using R to list all files with a specified extension
and
Opening all files in a folder, and applying a function
filenames <- list.files("~/Desktop/Uni/paper/testdata", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)

Building a mean across several csv files

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.

Resources