merge data from two different files in r

merge data from two different files in r - r

I have files with name ABC_1, ABC_2 and so on. I am using the following code to read all the files from a folder.
s.name <- "ABC"
mylist <- list.files(folder.path, pattern <- glob2rx(paste(s.name,"_?.csv",sep="")), full.names=TRUE)
for (k in mylist){
mydata <- read.csv(k,sep=",")
attach(mydata)
sub_mydata <- data.frame(x=mydata$VarA,y=mydata$VarB)
}
There is another csv file XYZ, that has a specific value for each of my s.name.It looks like
XZY <- data.frame(s.name = c("ABC_1","ABC_2","ABC_3"),val = c(2,6,8))
what I want inside the for loop is for the matching s.name, take the val and use it to calculate in sub_mydata as z = mydata$VarC * val of ABC_?.

You can try the following with lapply -
s.name <- "ABC"
mylist <- list.files(folder.path, pattern = paste0(s.name,"_?.csv"), full.names=TRUE)
lapply(mylist, function(x) {
data <- read.csv(x)
filename <- tools::file_path_sans_ext(basename(x))
transform(data, VarC = VarC * XZY$val[match(filename, XZY$s.name)])
}) -> result

Related

List elements getting overwritten in for loop R?

I have a bunch of csv files that I'm trying to read into R all at once, with each data frame from a csv becoming an element of a list. The loops largely work, but they keep overriding the list elements. So, for example, if I loop over the first 2 files, both data frames in list[[1]] and list[[2]] will contain the data frame for the second file.
#function to open one group of files named with "cores"
open_csv_core<- function(year, orgtype){
file<- paste(year, "/coreco.core", year, orgtype, ".csv", sep = "")
df <- read.csv(file)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C","D"),]
df<- df[!(df$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df)
}
#function to open one group of files named with "nccs"
open_csv_nccs<- function(year, orgtype){
file2<- paste(year, "/nccs.core", year, orgtype, ".csv", sep="")
df2 <- read.csv(file2)
names(df2) <- tolower(names(df2))
df2 <- df2[df2$ntee1 %in% c("C","D"),]
df2<- df2[!(df2$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df2)
}
#############################################################################
yrpc<- list()
yrpf<- list()
yrco<- list()
fname<- vector()
file_yrs<- as.character(c(1989:2019))
for(i in 1:length(file_yrs)){
fname<- list.files(path = file_yrs[i], pattern = NULL)
#accessing files in a folder and assigning to the proper function to open them based on how the file is named
for(j in 1:length(fname)){
if(grepl("pc.csv", fname[j])==T) {
if(grepl("nccs", fname[j])==T){
a <- open_csv_nccs(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- a
} else {
b<- open_csv_core(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- b
}
} else if (grepl("pf.csv", fname[j])==T){
if(grepl("nccs", fname[j])==T){
c <- open_csv_nccs(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- c
} else {
d<- open_csv_core(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- d
}
} else {
if(grepl("nccs", fname[j])==T){
e<- open_csv_nccs(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- e
} else {
f<- open_csv_core(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- f
}
}
}
}

Actually, both of your csv reading functions do exactly the same,
except that the paths are different.
If you find a way to list your files with abstract paths instead of relative
paths (just the file names), you wouldn't need to reconstruct the paths like
you do. This is possible by full.names = TRUE in list.files().
The second point is, it seems there is never from same year and same type
a "nccs.core" file in addition to a "coreco.core" file. So they are mutually
exclusive. So then, there is no logics necessary to distinguish those cases, which simplifies our code.
The third point is, you just want to separate the data frames by filetype ("pc", "pf", "co") and years.
Instead of creating 3 lists for each type, I would create one res-ults list, which contains for each type an inner list.
I would solve this like this:
years <- c(1989:2019)
path_to_type <- function(path) gsub(".*(pc|pf|co)\\.csv", "\\1", path)
res <- list("pc" = list(),
"pf" = list(),
"co" = list())
lapply(years, function(year) {
files <- list.files(path = year, pattern = "\\.csv", full.names = TRUE)
dfs <- lapply(files, function(path) {
print(path) # just to signal that the path is getting processed
df <- read.csv(path)
file_type <- path_to_type(path)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C", "D"), ]
df <- df[!(df$nteecc %in% c("D20", "D40", "D50", "D60", "D61")), ]
res[[file_type]][[year]] <- df
})
})
Now you can call from result's list by file_type and year
e.g.:
res[["co"]][[1995]]
res[["pf"]][[2018]]
And so on.
Actually, the results of the lapply() calls in this case are
not interesting. Just the content of res ... (result list).

It seems that in your for(j in 1:length(fname)){... you are creating one of 4 variable a, b, c or d. And you're reusing these variable names, so they are getting overwritten.
The "correct" way to do this is to use lapply in place of the for loop. Pass the list of files, and the required function (i.e. open_csv_core, etc) to lapply, and the return value that you get back is a list of the results.

R loop over write.xlsx()

I want to export a couple of data frames to an excel file using the function write.xlsx() from openxlsx. So, for example the following:
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
assign(name, data.frame(1:4, 2:3))
path <- paste("/some_directory/",name,".xlsx" , sep = "")
write.xlsx(name, file = path)
}
This does create three different data frames with the values 1 to 4 and 2 to 3, those have the right names, it also creates three different excel files with the right names, but the excel files only contain the name instead of the values from the dataframe. Does anyone know how to change that?

you need to keep your data.frame in a variable:
library(glue)
library(openxlsx)
x <- c(1,2,3)
for (i in x) {
name <- paste("sheet", i, sep = "")
df <- data.frame(1:4, 2:3) # This step is missing in your example
assign(name, df)
path <- glue("/some_directory/{name}.xlsx", name = name)
write.xlsx(df, file = path)
}
``

Return a dataframe from a function using the input variable as the name of the dataframe

I am trying to write a function to combine a csvs into 1 dataframe based on the name of the csv and then name the dataframe from the pattern that is input into the function. Everything work except returning the dataframe. I can't figure out how to do that using the function input as the name of the dataframe.
I tried what is in this post, but I think the issue is that I only know the name of the dataframe based on the function input: Return a data frame from function
#### Create files in your current working directory ####
dir <- getwd()
subDir <- 'temp'
dir.create(subDir)
setwd(file.path(dir, subDir))
dir.create('Run1')
dir.create('Run2')
employeeID <- c('123','456','789')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employeeID <- c('123','456','789')
first <- c('John','Jane','Tom')
last <- c('Doe','Smith','Franks')
data <- data.frame(employeeID,salary,startdate)
name <- data.frame(employeeID,first,last)
write.csv(data, file = "Run1/data.csv",row.names=FALSE, na="")
write.csv(name, file = "Run1/name.csv",row.names=FALSE, na="")
employeeID <- c('465','798','132')
salary <- c(100000, 500000, 300000)
startdate <- as.Date(c('2000-11-1','2001-3-25','2003-3-14'))
employeeID <- c('465','798','132')
first <- c('Jay','Susan','Tina')
last <- c('Jones','Smith','Thompson')
data <- data.frame(employeeID,salary,startdate)
name <- data.frame(employeeID,first,last)
write.csv(data, file = "Run2/data.csv",row.names=FALSE, na="")
write.csv(name, file = "Run2/name.csv",row.names=FALSE, na="")
#### function ####
files_to_df <- function(pattern){
# pattern <- "data"
filenames <- list.files(recursive = TRUE, pattern = pattern)
df_list <- lapply(filenames, read.csv, header = TRUE)
# Name each dataframe with the run and filename
names(df_list) <- str_sub(list, 1, 4)
# Create combined dataframe
df <- df_list %>%
bind_rows(.id = 'run')
# Assign dataframe to the name of the pattern
assign(pattern, df)
# Return the dataframe
return(data.frame(pattern))
#list2env(pattern,.GlobalEnv)
}
#### Run function ####
files_to_df(c("data"))

I made two changes of your code:
1.) str_sub(list, 1, 4) -> str_sub(filenames, 1, 4)
list is a function and dont contain any content.
2.) return(data.frame(pattern)) -> return(df)
returning the data.frame and not a sting.
files_to_df <- function(pattern){
# pattern <- "data"
filenames <- list.files(recursive = TRUE, pattern = pattern)
df_list <- lapply(filenames, read.csv, header = TRUE)
# Name each dataframe with the run and filename
names(df_list) <- str_sub(filenames, 1, 4)
# Create combined dataframe
df <- df_list %>%
bind_rows(.id = 'run')
# Assign dataframe to the name of the pattern
assign(pattern, df)
# Return the dataframe
return(data.frame(df))
#list2env(pattern,.GlobalEnv)
}

Reading multiple csv of same format in a data frame

I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)

I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan

Output formatting in R

I am new to R and trying to do some correlation analysis on multiple sets of data. I am able to do the analysis, but I am trying to figure out how I can output the results of my data. I'd like to have output like the following:
NAME,COR1,COR2
....,....,....
....,....,....
If I could write such a file to output, then I can post process it as needed. My processing script looks like this:
run_analysis <- function(logfile, name)
{
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result1 <- cor(some_col2, another_col2)
# somehow output name,result1,result2 to a CSV file
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
for (f in logfiles) {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
Is there an easy way to create a blank data frame and then add data to it, row by row?

Have you looked at the functions in R for writing data to files? For instance, write.csv. Perhaps something like this:
rs <- data.frame(name = name, COR1 = result1, COR2 = result2)
write.csv(rs,"path/to/file",append = TRUE,...)

I like using the foreach library for this sort of thing:
library(foreach)
run_analysis <- function(logfile, name) {
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result2 <- cor(some_col2, another_col2)
# Return one row of results.
data.frame(name=name, cor1=result1, cor2=result2)
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
## Collect results from run_analysis into a table, by rows.
dat <- foreach (f=logfiles, .combine="rbind") %do% {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
## Write output.
write.csv(dat, "output.dat", quote=FALSE)
What this does is to generate one row of output on each call to run_analysis, binding them into a single table called dat (the .combine="rbind" part of the call to foreach causes row binding). Then you can just use write.csv to get the output you want.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

merge data from two different files in r - r

Related

List elements getting overwritten in for loop R?

R loop over write.xlsx()

Return a dataframe from a function using the input variable as the name of the dataframe

Reading multiple csv of same format in a data frame

Output formatting in R

Categories

Resources