I am new to R and trying to do some correlation analysis on multiple sets of data. I am able to do the analysis, but I am trying to figure out how I can output the results of my data. I'd like to have output like the following:
NAME,COR1,COR2
....,....,....
....,....,....
If I could write such a file to output, then I can post process it as needed. My processing script looks like this:
run_analysis <- function(logfile, name)
{
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result1 <- cor(some_col2, another_col2)
# somehow output name,result1,result2 to a CSV file
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
for (f in logfiles) {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
Is there an easy way to create a blank data frame and then add data to it, row by row?
Have you looked at the functions in R for writing data to files? For instance, write.csv. Perhaps something like this:
rs <- data.frame(name = name, COR1 = result1, COR2 = result2)
write.csv(rs,"path/to/file",append = TRUE,...)
I like using the foreach library for this sort of thing:
library(foreach)
run_analysis <- function(logfile, name) {
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result2 <- cor(some_col2, another_col2)
# Return one row of results.
data.frame(name=name, cor1=result1, cor2=result2)
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
## Collect results from run_analysis into a table, by rows.
dat <- foreach (f=logfiles, .combine="rbind") %do% {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
## Write output.
write.csv(dat, "output.dat", quote=FALSE)
What this does is to generate one row of output on each call to run_analysis, binding them into a single table called dat (the .combine="rbind" part of the call to foreach causes row binding). Then you can just use write.csv to get the output you want.
Related
I have files with name ABC_1, ABC_2 and so on. I am using the following code to read all the files from a folder.
s.name <- "ABC"
mylist <- list.files(folder.path, pattern <- glob2rx(paste(s.name,"_?.csv",sep="")), full.names=TRUE)
for (k in mylist){
mydata <- read.csv(k,sep=",")
attach(mydata)
sub_mydata <- data.frame(x=mydata$VarA,y=mydata$VarB)
}
There is another csv file XYZ, that has a specific value for each of my s.name.It looks like
XZY <- data.frame(s.name = c("ABC_1","ABC_2","ABC_3"),val = c(2,6,8))
what I want inside the for loop is for the matching s.name, take the val and use it to calculate in sub_mydata as z = mydata$VarC * val of ABC_?.
You can try the following with lapply -
s.name <- "ABC"
mylist <- list.files(folder.path, pattern = paste0(s.name,"_?.csv"), full.names=TRUE)
lapply(mylist, function(x) {
data <- read.csv(x)
filename <- tools::file_path_sans_ext(basename(x))
transform(data, VarC = VarC * XZY$val[match(filename, XZY$s.name)])
}) -> result
I want to export a couple of data frames to an excel file using the function write.xlsx() from openxlsx. So, for example the following:
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
assign(name, data.frame(1:4, 2:3))
path <- paste("/some_directory/",name,".xlsx" , sep = "")
write.xlsx(name, file = path)
}
This does create three different data frames with the values 1 to 4 and 2 to 3, those have the right names, it also creates three different excel files with the right names, but the excel files only contain the name instead of the values from the dataframe. Does anyone know how to change that?
you need to keep your data.frame in a variable:
library(glue)
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
df <- data.frame(1:4, 2:3) # This step is missing in your example
assign(name, df)
path <- glue("/some_directory/{name}.xlsx", name = name)
write.xlsx(df, file = path)
}
``
I am trying to write a function to combine a csvs into 1 dataframe based on the name of the csv and then name the dataframe from the pattern that is input into the function. Everything work except returning the dataframe. I can't figure out how to do that using the function input as the name of the dataframe.
I tried what is in this post, but I think the issue is that I only know the name of the dataframe based on the function input: Return a data frame from function
#### Create files in your current working directory ####
dir <- getwd()
subDir <- 'temp'
dir.create(subDir)
setwd(file.path(dir, subDir))
dir.create('Run1')
dir.create('Run2')
employeeID <- c('123','456','789')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employeeID <- c('123','456','789')
first <- c('John','Jane','Tom')
last <- c('Doe','Smith','Franks')
data <- data.frame(employeeID,salary,startdate)
name <- data.frame(employeeID,first,last)
write.csv(data, file = "Run1/data.csv",row.names=FALSE, na="")
write.csv(name, file = "Run1/name.csv",row.names=FALSE, na="")
employeeID <- c('465','798','132')
salary <- c(100000, 500000, 300000)
startdate <- as.Date(c('2000-11-1','2001-3-25','2003-3-14'))
employeeID <- c('465','798','132')
first <- c('Jay','Susan','Tina')
last <- c('Jones','Smith','Thompson')
data <- data.frame(employeeID,salary,startdate)
name <- data.frame(employeeID,first,last)
write.csv(data, file = "Run2/data.csv",row.names=FALSE, na="")
write.csv(name, file = "Run2/name.csv",row.names=FALSE, na="")
#### function ####
files_to_df <- function(pattern){
# pattern <- "data"
filenames <- list.files(recursive = TRUE, pattern = pattern)
df_list <- lapply(filenames, read.csv, header = TRUE)
# Name each dataframe with the run and filename
names(df_list) <- str_sub(list, 1, 4)
# Create combined dataframe
df <- df_list %>%
bind_rows(.id = 'run')
# Assign dataframe to the name of the pattern
assign(pattern, df)
# Return the dataframe
return(data.frame(pattern))
#list2env(pattern,.GlobalEnv)
}
#### Run function ####
files_to_df(c("data"))
I made two changes of your code:
1.) str_sub(list, 1, 4) -> str_sub(filenames, 1, 4)
list is a function and dont contain any content.
2.) return(data.frame(pattern)) -> return(df)
returning the data.frame and not a sting.
files_to_df <- function(pattern){
# pattern <- "data"
filenames <- list.files(recursive = TRUE, pattern = pattern)
df_list <- lapply(filenames, read.csv, header = TRUE)
# Name each dataframe with the run and filename
names(df_list) <- str_sub(filenames, 1, 4)
# Create combined dataframe
df <- df_list %>%
bind_rows(.id = 'run')
# Assign dataframe to the name of the pattern
assign(pattern, df)
# Return the dataframe
return(data.frame(df))
#list2env(pattern,.GlobalEnv)
}
I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}
I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan