I am new in R programming language. As a function below, I return a data frame but output always shows the "length" name instead of indexes. Can somebody advice, please.
The indicates appear if it is more than 2.
My expected result is to show 1, 2, 3
complete <- function(directory, id = 1:322){
#set working directory
setwd(directory)
#list all csv files in the working dir and save to listScvFile variable
listCsvFile <- list.files(pattern = ".csv$")
#create original DataSet
originalData <- lapply(listCsvFile[id],read.csv)
#create working Dataset based on the pollutan argument
#and save to a vector
workingDataSetVector <- c(length = length(id))
for (i in 1:length(id)) {
workingDataSet <- originalData[[i]][,"sulfate"]
badWorkingDataSet <- is.na(workingDataSet)
goodWorkingDataSet <- workingDataSet[!badWorkingDataSet]
workingDataSetVector[i] = length(goodWorkingDataSet)
}
return(data.frame(id = id, nobs = workingDataSetVector))
}
example image
Try
workingDataSetVector <- c()
Instead of
workingDataSetVector <- c(length = length(id))
Related
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
I am new to R and not sure why I have to rename data frame column names at the end of the program though I have defined data frame with column names at the beginning of the program. The use of the data frame is, I got two columns where I have to save sequence under ID column and some sort of number in NOBS column.
complete <- function(directory, id = 1:332) {
collectCounts = data.frame(id=numeric(), nobs=numeric())
for(i in id) {
fileName = sprintf("%03d",i)
fileLocation = paste(directory, "/", fileName,".csv", sep="")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[,2]), na.rm=TRUE)
collectCounts <- rbind(collectCounts, c(id=i, completeCount))
#print(completeCount)
}
colnames(collectCounts)[1] <- "id"
colnames(collectCounts)[2] <- "nobs"
print(collectCounts)
}
Its not quite clear what your specific problem is, as you did not provide a complete and verifiable example. But I can give a few pointers on improving the code, nonetheless.
1) It is not recommended to 'grow' a data.frame within a loop. This is extremely inefficient in R, as it copies the entire structure each time. Better is to assign the whole data.frame at the outset, then fill in the rows in the loop.
2) R has a handy functionpaste0 that does not require you to specify sep = "".
3) There's no need to specify na.rm = TRUE in your sum, because is.na will never return NA's
Putting this together:
complete = function(directory, id = 1:332) {
collectCounts = data.frame(id=id, nobs=numeric(length(id)))
for(i in 1:length(id)) {
fileName = sprintf("%03d", id[i])
fileLocation = paste0(directory, "/", fileName,".csv")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[, 2]))
collectCounts[i, 'nobs'] <- completeCount
}
}
Always hard to answer questions without example data.
You could start with
collectCounts = data.frame(id, nobs=NA)
And in your loop, do:
collectCounts[i, 2] <- completeCount
Here is another way to do this:
complete <- function(directory, id = 1:332) {
nobs <- sapply(id, function(i) {
fileName = paste0(sprintf("%03d",i), ".csv")
fileLocation = file.path(directory, fileName)
fileData = read.csv(fileLocation, header=TRUE)
sum(!is.na(fileData[,2]), na.rm=TRUE)
}
)
data.frame(id=id, nobs=nobs)
}
This question already has answers here:
What's wrong with my function to load multiple .csv files into single dataframe in R using rbind?
(6 answers)
Closed 5 years ago.
I am quite new to R and I need some help. I have multiple csv files labeled from 001 to 332. I would like to combine all of them into one data.frame. This is what I have done so far:
filesCSV <- function(id = 1:332){
fileNames <- paste(id) ## I want fileNames to be a vector with the names of all the csv files that I want to join together
for(i in id){
if(i < 10){
fileNames[i] <- paste("00",fileNames[i],".csv",sep = "")
}
if(i < 100 && i > 9){
fileNames[i] <- paste("0", fileNames[i],".csv", sep = "")
}
else if (i > 99){
fileNames[i] <- paste(fileNames[i], ".csv", sep = "")
}
}
theData <- read.csv(fileNames[1], header = TRUE) ## here I was trying to create the data.frame with the first csv file
for(a in 2:332){
theData <- rbind(theData,read.csv(fileNames[a])) ## here I wanted to use a for loop to cycle through the names of files in the fileNames vector, and open them all and add them to the 'theData' data.frame
}
theData
}
Any help would be appreciated, Thanks!
Hmm it looks roughly like your function should already be working. What is the issue?
Anyways here would be a more idiomatic R way to achieve what you want to do that reduces the whole function to three lines of code:
Construct the filenames:
infiles <- sprintf("%03d.csv", 1:300)
the %03d means: insert an integer value d padded to length 3 zeroes (0). Refer to the help of ?sprintf() for details.
Read the files:
res <- lapply(infiles, read.csv, header = TRUE)
lapply maps the function read.csv with the argument header = TRUE to each element of the vector "infiles" and returns a list (in this case a list of data.frames)
Bind the data frames together:
do.call(rbind, res)
This is the same as entering rbind(df1, df2, df3, df4, ..., dfn) where df1...dfn are the elments of the list res
You were very close; just needed ideas to append 0s to files and cater for cases when the final data should just read the csv or be an rbind
filesCSV <- function(id = 1:332){
library(stringr)
# Append 0 ids in front
fileNames <- paste(str_pad(id, 3, pad = "0"),".csv", sep = "")
# Initially the data is NULL
the_data <- NULL
for(i in seq_along(id)
{
# Read the data in dat object
dat <- read.csv(fileNames[i], header = TRUE)
if(is.null(the_data) # For the first pass when dat is NULL
{
the_data <- dat
}else{ # For all other passes
theData <- rbind(theData, dat)
}
}
return(the_data)
}
I try to solve a problem from a question I have previously posted looping inside list in r
Is there a way to get the name of a dataframe that is on a list of dataframes?
I have listed a serie of dataframes and to each dataframe I want to apply myfunction. But I do not know how to get the name of each dataframe in order to use it on nameofprocesseddf of myfunction.
Here is the way I get the list of my dataframes and the code I got until now. Any suggestion how I can make this work?
library(missForest)
library(dplyr)
myfunction <- function (originaldf, proceseddf, nonproceseddf, nameofprocesseddf=character){
NRMSE <- nrmse(proceseddf, nonproceseddf, originaldf)
comment(nameofprocesseddf) <- nameofprocesseddf
results <- as.data.frame(list(comment(nameofprocesseddf), NRMSE))
names(results) <- c("Dataset", "NRMSE")
return(results)
}
a <- data.frame(value = rnorm(100), cat = c(rep(1,50), rep(2,50)))
da1 <- data.frame(value = rnorm(100,4), cat2 = c(rep(2,50), rep(3,50)))
dataframes <- dir(pattern = ".txt")
list_dataframes <- llply(dataframes, read.table, header = T, dec=".", sep=",")
n <- length(dataframes)
# Here is where I do not know how to get the name of the `i` dataframe
for (i in 1:n){
modified_list <- llply(list_dataframes, myfunction, originaldf = a, nonproceseddf = da1, proceseddf = list_dataframes[i], nameof processeddf= names(list_dataframes[i]))
write.table(file = sprintf("myfile/%s_NRMSE20%02d.txt", dataframes[i]), modified_list[[i]], row.names = F, sep=",")
}
as a matter of fact, the name of a data frame is not an attribute of the data frame. It's just an expression used to call the object. Hence the name of the data frame is indeed 'list_dataframes[i]'.
Since I assume you want to name your data frame as the text file is named without the extension, I propose you use something like (it require the library stringr) :
nameofprocesseddf = substr(dataframes[i],start = 1,stop = str_length(dataframes[i])-4)
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...