R Data Frames column names rename - r

I am new to R and not sure why I have to rename data frame column names at the end of the program though I have defined data frame with column names at the beginning of the program. The use of the data frame is, I got two columns where I have to save sequence under ID column and some sort of number in NOBS column.
complete <- function(directory, id = 1:332) {
collectCounts = data.frame(id=numeric(), nobs=numeric())
for(i in id) {
fileName = sprintf("%03d",i)
fileLocation = paste(directory, "/", fileName,".csv", sep="")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[,2]), na.rm=TRUE)
collectCounts <- rbind(collectCounts, c(id=i, completeCount))
#print(completeCount)
}
colnames(collectCounts)[1] <- "id"
colnames(collectCounts)[2] <- "nobs"
print(collectCounts)
}

Its not quite clear what your specific problem is, as you did not provide a complete and verifiable example. But I can give a few pointers on improving the code, nonetheless.
1) It is not recommended to 'grow' a data.frame within a loop. This is extremely inefficient in R, as it copies the entire structure each time. Better is to assign the whole data.frame at the outset, then fill in the rows in the loop.
2) R has a handy functionpaste0 that does not require you to specify sep = "".
3) There's no need to specify na.rm = TRUE in your sum, because is.na will never return NA's
Putting this together:
complete = function(directory, id = 1:332) {
collectCounts = data.frame(id=id, nobs=numeric(length(id)))
for(i in 1:length(id)) {
fileName = sprintf("%03d", id[i])
fileLocation = paste0(directory, "/", fileName,".csv")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[, 2]))
collectCounts[i, 'nobs'] <- completeCount
}
}

Always hard to answer questions without example data.
You could start with
collectCounts = data.frame(id, nobs=NA)
And in your loop, do:
collectCounts[i, 2] <- completeCount
Here is another way to do this:
complete <- function(directory, id = 1:332) {
nobs <- sapply(id, function(i) {
fileName = paste0(sprintf("%03d",i), ".csv")
fileLocation = file.path(directory, fileName)
fileData = read.csv(fileLocation, header=TRUE)
sum(!is.na(fileData[,2]), na.rm=TRUE)
}
)
data.frame(id=id, nobs=nobs)
}

Related

The "length" name always shown in the output - R programming

I am new in R programming language. As a function below, I return a data frame but output always shows the "length" name instead of indexes. Can somebody advice, please.
The indicates appear if it is more than 2.
My expected result is to show 1, 2, 3
complete <- function(directory, id = 1:322){
#set working directory
setwd(directory)
#list all csv files in the working dir and save to listScvFile variable
listCsvFile <- list.files(pattern = ".csv$")
#create original DataSet
originalData <- lapply(listCsvFile[id],read.csv)
#create working Dataset based on the pollutan argument
#and save to a vector
workingDataSetVector <- c(length = length(id))
for (i in 1:length(id)) {
workingDataSet <- originalData[[i]][,"sulfate"]
badWorkingDataSet <- is.na(workingDataSet)
goodWorkingDataSet <- workingDataSet[!badWorkingDataSet]
workingDataSetVector[i] = length(goodWorkingDataSet)
}
return(data.frame(id = id, nobs = workingDataSetVector))
}
example image
Try
workingDataSetVector <- c()
Instead of
workingDataSetVector <- c(length = length(id))

Breaking the golden rules on FOR loops with R

Firstly, apologies as this may seem a bit long winded, but I hope to give as much information on this problem as I can...
I have written a script that loops through a set of files defined in a csv file. Each file within this csv listing is an XML file, each one is for a particular event in an application, and all files within this list are of the same event type. However, each file can contain different data. For instance, one could hold an attribute with no child nodes beneath, while others contain nodes.
My script works perfectly fine, but when it gets to about XML file 5000, it has slowed down considerably.
Problem is that my code creates a blank dataframe initially, and then grows is at new columns are detected.
I understand that this is a big NO NO when it comes to writing R FOR loops, but am unsure how to get around this problem, give my smallest file listing is 69000, which makes going through each one in turn and counting the nodes a task in itself.
Are there any ideas on how to get around this?
pseudo code or actual R code to do this would be great. So would ideas/opinions, as I am unsure on the best approach to this task.
Here is my current code.
library(XML)
library(xml2)
library(plyr)
library(tidyverse)
library(reshape2)
library(foreign)
library(rio)
# Get file data to be used
#
setwd('c:/temp/xml')
headerNames <- c('GUID','EventId','AppId','RequestFile', 'AE_Type', 'AE_Drive')
GetNames <- rowid_to_column(read.csv(file= 'c:/temp/xml/R_EventIdA.csv', fileEncoding="UTF-8-BOM", header = FALSE, col.names = headerNames),'ID')
inputfiles <- as.character(GetNames[,5]) # Gets list of files
# Create empty dataframes
#
df <- data.frame()
transposed.df1 <- data.frame()
allxmldata <- data.frame()
findchildren<-function(nodes, df) {
numchild <- sapply(nodes, function(x){length(xml_children(x))})
xml.value <- xml_text(nodes[numchild==0])
xml.name <- xml_name(nodes[numchild==0])
xml.path <- sapply(nodes[numchild==0], function(x) {gsub(', ','_', toString(rev(xml_name(xml_parents(x)))))})
fieldname <- paste(xml.path,xml.name,sep = '_')
contents <- sapply(xml.value, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
fieldname <- paste(fieldname,xml.value, sep = '_')
dftemp <- data.frame(fieldname, contents)
df <- rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
findchildren2<-function(nodes, df){
numchild<-sapply(nodes, function(embeddedinputfile){length(xml_children(embeddedinputfile))})
xmlvalue<-xml_text(nodes[numchild==0])
xmlname<-xml_name(nodes[numchild==0])
xmlpath<-sapply(nodes[numchild==0], function(embeddedinputfile) {gsub(', ','_', toString(rev(xml_name(xml_parents(embeddedinputfile)))))})
fieldname<-paste(xmlpath,xmlname,sep = '_')
contents<-sapply(xmlvalue, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
dftemp<-data.frame(fieldname, contents)
df<-rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren2(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
# Loop all files
#
for (x in inputfiles) {
df1 <- findchildren(xml_children(read_xml(x)),df)
## original xml dataframe
if (length(df1) > 0) {
xml.df1 <- data.frame(spread(df1, key = fieldname, value = contents), fix.empty.names = TRUE)
}
##
xml.df1 %>%
pluck('Response_RawData') -> rawxml
if (length(rawxml)>0) {
df.rawxml <- data.frame(rawxml)
export(df.rawxml,'embedded.xml')
embeddedinputfile <-as.character('embedded.xml')
rm(df1)
df1 <- findchildren2(xml_children(read_xml(embeddedinputfile)),df)
if (length(df1) > 0) {
xml.df2 <- spread(df1, key = fieldname, value = contents)
}
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1,xml.df2))
} else {
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1))
}
}
if(nrow(allxmldata)==nrow(GetNames)) {
alleventdata<-cbind(GetNames,allxmldata)
}
dbConn2 <- odbcDriverConnect('driver={SQL Server};server=PC-XYZ;database=Events;trusted_connection=true')
sqlSave(dbConn2, alleventdata, tablename = 'AE_EventA', append = TRUE )

Using a for loop to read certain files using read.csv() but it's only returning "'file' must be a character string or connection"

Here is the data I am working with. https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
I'm trying to create a function called pollutantmean that will load selected files, aggregate (rbind) the columns, and return a mean of a certain column. I have figured out everything except how to run the loop so I can turn the multiple files into one big data frame.
for (id in 1:5) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[id]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
This for loop works by itself but when i try and use my bigger function
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332) {
empty_tbl <- data.frame()
for (id in 1:332) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[i]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
goodata <- na.omit(empty_tbl)
if(pollutant == "sulfate") {
mean(goodata[,2])
} else {
mean(goodata[,3])
}
}
I get the:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection".
I am at a complete loss over how to fix this and have tried many, many different ways. I'm sure I'm messing something up with the naming of the file but I try the for loop by itself and it works fine...
Consider using lapply() on csv files that uses the directory argument of function. Below assumes specdata is a subfolder of the current working directory:
pollutantmean <- function(directory = "specdata", pollutant) {
files_full <- Sys.glob(paste0(directory,"/*.csv"))[1:332] # FIRST 332 CSVs IN DIRECTORY
dfList <- lapply(files_full, read.csv, header=TRUE)
df <- do.call(rbind, dfList)
gooddata <- na.omit(df)
pmean <- ifelse(pollutant == "sulfate", mean(gooddata[,2]), mean(gooddata[,3]))
}

Overwriting result with for loop in R

I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}

Output formatting in R

I am new to R and trying to do some correlation analysis on multiple sets of data. I am able to do the analysis, but I am trying to figure out how I can output the results of my data. I'd like to have output like the following:
NAME,COR1,COR2
....,....,....
....,....,....
If I could write such a file to output, then I can post process it as needed. My processing script looks like this:
run_analysis <- function(logfile, name)
{
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result1 <- cor(some_col2, another_col2)
# somehow output name,result1,result2 to a CSV file
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
for (f in logfiles) {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
Is there an easy way to create a blank data frame and then add data to it, row by row?
Have you looked at the functions in R for writing data to files? For instance, write.csv. Perhaps something like this:
rs <- data.frame(name = name, COR1 = result1, COR2 = result2)
write.csv(rs,"path/to/file",append = TRUE,...)
I like using the foreach library for this sort of thing:
library(foreach)
run_analysis <- function(logfile, name) {
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result2 <- cor(some_col2, another_col2)
# Return one row of results.
data.frame(name=name, cor1=result1, cor2=result2)
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
## Collect results from run_analysis into a table, by rows.
dat <- foreach (f=logfiles, .combine="rbind") %do% {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
## Write output.
write.csv(dat, "output.dat", quote=FALSE)
What this does is to generate one row of output on each call to run_analysis, binding them into a single table called dat (the .combine="rbind" part of the call to foreach causes row binding). Then you can just use write.csv to get the output you want.

Resources