Import multiple xlsx file in R - r

I have several xlsx files in a directory with the same structure (i.e. column A,B,C); every file is the data of one day.
I need to import all the data in R and find the differences between one day and the next one.
files <- list.files(pattern = ".xlsx")
for (i in seq_along(files)) {
assign(paste("Day", i, sep = "."), read.xlsx(files[i]))
}
I can't figure out how to use the imported data.
For example
Day.1 <- data.frame(Day.1)
Day.1$A <- as.character(Day.1$A)
Day.2 <- data.frame(Day.2)
Day.2$A <- as.character(Day.2$A)
anti_join (Day.1, Day.2)
This code works fine but how should it be with a variable?
Day.[i] <- data.frame(Day.[i])
Day.[i]$A <- as.character(Day.[i]$A)
Day.[i+1] <- data.frame(Day.[i+1])
Day.[i+1]$A <- as.character(Day.[i+1]$A)
anti_join (Day.[i], Day.[i+1])
I tried to import all the files in a single data frame but I have a similar problem about how to use the new data
file.list <- list.files(pattern='*.xlsx')
days.list <- lapply(file.list, read_excel)
days <- rbindlist(days.list, idcol = "id")
days <- data.frame(days)
days$B <- as.character(days$B)
But I don't know how to do something like:
day1 <- filter(days, id==1)
day2 <- filter(days, id==2)
diff1 <- anti_join (day1, day2, by=c("B", "C"))
using a counter variable (i)
day(i) <- filter(days, id==(i))
day(i+1) <- filter(days, id==(i+1))
diff1 <- anti_join (day1, day2, by=c("B", "C"))

Consider using base R's Map (wrapper to mapply) between a dataframe list of (days) and (days + 1), respectively the left and right sides of dplyr::anti_join. Of course the very last day will not have a forward day comparison.
library(xlsx)
library(dplyr)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, function(f){
read.xlsx(f, 1, stringsAsFactors = FALSE)
})
left_days <- df.list[1:length(df.list)-1] # SUBSET OUT LAST DAY
right_days <- df.list[2:length(df.list)] # SUBSET OUT FIRST DAY
# WITHOUT ARGS
anti_join_list <- Map(anti_join, left_days, right_days)
# WITH ARGS
anti_join_list <- Map(function(x,y) anti_join(x, y, by=c("B", "C")), left_days, right_days)

Related

automation to merge data frames adding a line to keep note of the origin

I am a newbie with R. I have 6 different data frames (U, V, W, X, Y, Z), coming from different CSV files, each of them has the same columns (Surname, Name, Winter, Spring, Summer), and I would like to create a new data frame containing the 5 rows and a sixth row which indicates one of the letters (U, V, ...) where the original data comes from. I have tried with the following code:
U <- read.csv(file = "U", header = T)
V <- read.csv(file = "V", header = T)
W <- read.csv(file = "W", header = T)
X <- read.csv(file = "X", header = T)
Y <- read.csv(file = "Y", header = T)
Z <- read.csv(file = "Z", header = T)
U['class'] <- rep("U")
V['class'] <- rep("V")
W['class'] <- rep("W")
X['class'] <- rep("X")
Y['class'] <- rep("Y")
Z['class'] <- rep("Z")
students <- rbind(U, V, W, X, Y, Z)
I would really need to use a loop, so that I can in future go from A to Z. I would like to do something like this, which is totally nonsense.
for(class.name in list(U, V, W, X, Y, Z)){
class.name['class'] <- rep('class')
}
Is there a reasonable way to do it?
Thank you
Edited
To clarify my question, the idea is that I have 6 different stations collecting raw data and giving me 6 different data frames. I want to merge them together, maintaining the information of from which station the raw data comes from.
Possible incomplete solution
Following #MrFlick's advice, I have managed to put everything in one list as follows
classes <- c('U', 'V', 'W', 'X', 'W', 'Z')
my.files <- paste(classes,".csv",sep="")
year.eight <- lapply(my.files, read.csv, header = T)
name(year.eight) <- classes
However, the final outcome should be one single data frame with a further column to indicate which class are the students in. Can someone help me with this, please?
Let me try to share an example
Suppose we have 3 files A.csv, B.csv and C.csv in a folder called "data" within our working directory. Suppose they contain a single column with a numeric value. Then this code does what you want.
library(readr)
files <- paste0("data/", list.files("data"))
df_list <- list()
for (i in seq_along(files)) {
tmp <- read_csv(files[[i]])
tmp["class"] <- sub("\\..*", "", basename(files[[i]])) # ".csv$" also works in this case
df_list[[i]] <- tmp
}
output <- dplyr::bind_rows(df_list)
output
## A tibble: 3 x 2
# x class
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
Edited following Tensibai's excellent suggestion.
To do this more easily with a list of data.frames, it might look something like this
classes <- c('U', 'V', 'W', 'X', 'W', 'Z')
my.files <- paste(classes,".csv",sep="")
year.eight <- mapply(function(path, code) {
data <- read.csv(path, header = T)
data$class <- code
data
}, my.files, classes)
combined <- do.call("rbind", year.eight)
Or using dplyr
classes <- c('U', 'V', 'W', 'X', 'W', 'Z')
my.files <- paste(classes,".csv",sep="")
year.eight <- lapply(my.files, read.csv, header = T)
names(year.eight) <- classes
combined <- dplyr::bind_rows(year.eight, .id="class")
If you save all the files of interest in a specific directory you can then access them using list.files(). Then loop over this using map_df from purrr package. Think this does the trick
#Load package
library(purrr)
#Define the directory where files are saved
path <- "your_file_path/" #e.g. my Mac desktop "~/Desktop/"
#Create vector of file names
files <- list.files(path)
#Use map_df function from purrr to loop over and return a data frame with extra label variable
map_df(files, function(x){
#save as df
df <- read.csv(paste0(path, "/",x))
#use gsub to remove ".csv" from file name
df['class'] <- gsub("\\.csv", "", x)
df
})

Rbinding large list of dataframes after I did some data cleaning on the list

My problem is, that I can't merge a large list of dataframes before doing some data cleaning. But it seems like my data cleaning is missing from the list.
I have 43 xlsx-files, which I've put in a list.
Here's my code for that part:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read.xlsx(i, sheet=1, startRow=2, colNames = T,
skipEmptyCols = T, skipEmptyRows = T)
# Create column with file name
x$file = i
# Return data
x
})
I then did some datacleaning. Some of the dataframes had some empty columns that weren't skipped in the loading and some columns I just didn't need.
Example of how I removed one column (X1) from all dataframes in the list:
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
I also applies column names:
colnames <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "AFVBOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS", "SPECIFIK1",
"SPECIFIK2", "SPECIFIK3", "PERIODE","FILE")
dat <- lapply(dat, setNames, colnames)
My problem is, when I open the list or look at the elements in the list, my data cleaning is missing.
And I can't bind the dataframes before the data cleaning since they're aren't looking the same.
What am I doing wrong here?
EDIT: Sample data*
# Sample data
a <- c("a","b","c")
b <- c(1,2,3)
X1 <- c("", "","")
c <- c("a","b","c")
X2 <- c(1,2,3)
X1 <- c("", "","")
df1 <- data.frame(a,b,c,X1)
df2 <- data.frame(a,b,c,X1,X2)
# Putting in list
dat <- list(df1,df2)
# Removing unwanted columns
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
dat <- lapply(dat, function(x) { x["X2"] <- NULL; x })
# Setting column names
colnames <- c("Alpha", "Beta", "Gamma")
dat <- lapply(dat, setNames, colnames)
# Merging dataframes
df <- do.call(rbind,dat)
So I've just found that with my sample data this goes smoothly.
I had to reopen the list in View-mode to see the changes I made. That doesn't change the fact that when writing to csv and reopening all the data cleaning is missing (haven'tr tried this with my sample data).
I am wondering if it's because I've changed the merge?
# My merge when I wrote this question:
df <- do.call("rbindlist", dat)
# My merge now:
df <- do.call(rbind,dat)
When I use my real data it doesnøt go as smoothly, so I guess the sample data is bad. I don't know what I'm doing wrong so I can't give some better sample data.
The message I get when merging with rbind:
error in rbind(deparse.level ...) numbers of columns of arguments do not match

Reading Excel file into R with operator in tab name

I am reading a file with different tabs into R. However, they changed the tab names so they contain operators now, which R doesnt seem to like. For instance (and this is where the code occurs) "Storico_G1" became "Storico_G+1".
I post the code below, but the error occurs early on. I am basically looking for a workaround/to change the tab names before i create data.frames.
NB I left the code as it was before they changed the tab name from "Storico_G1" to "Storico_G+1" as I think its easier to grasp this way.
Can anybody guide me in the right direction? Many thanks in advance!
library(ggplot2)
library(lubridate)
library(openxlsx)
library(reshape2)
library(dplyr)
library(scales)
Storico_G <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx", startRow = 1, colNames = TRUE)
# Selecting Column C,E,R from Storico_G and stored in variable Storico_G_df
# Selecting Column A,P from Storico_G+1 and stored in variable Storico_G1_df
Storico_G_df <- data.frame(Storico_G$pubblicazione,Storico_G$IMMESSO, Storico_G$`RICONSEGNATO.(1)`, Storico_G$BILANCIAMENTO.RESIDUALE )
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
# Conerting pubblicazione in date format and time
Storico_G_df$pubblicazione <- ymd_h(Storico_G_df$Storico_G.pubblicazione)
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
# Selecting on row which is having 4PM value in Storico_G+1 excel sheet tab
Storico_G1_df <- subset(Storico_G1_df, hour(Storico_G1_df$pubblicazione) == 16)
rownames(Storico_G1_df) <- 1:nrow(Storico_G1_df)
# Averaging hourly values to 1 daily data point in G excel sheet tab
Storico_G_df$Storico_G.pubblicazione <- strptime(Storico_G_df$Storico_G.pubblicazione, "%Y_%m_%d_%H")
storico_G_df_agg <- aggregate(Storico_G_df, by=list(day=format(Storico_G_df$Storico_G.pubblicazione, "%F")), FUN=mean, na.rm=TRUE)[,-2]
#cbind.fill function
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#cbind with both frames
G_G1_df= data.frame(cbind.fill(storico_G_df_agg,Storico_G1_df))
#keep required columns
keep=c("day", "Storico_G.IMMESSO","Storico_G..RICONSEGNATO..1..","Storico_G1..SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS..")
#update dataframe to kept variables
G_G1_df=G_G1_df[,keep,drop=FALSE]
#Rename crazy variable names
G_G1_df <- data.frame(G_G1_df) %>%
select(day, Storico_G.IMMESSO, Storico_G..RICONSEGNATO..1.., Storico_G1..SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS..)
names(G_G1_df) <- c("day", "Immesso","Riconsegnato", "SAS")
#Melt time series
G_G1_df=melt(G_G1_df,id.vars = "day")
#Create group variable
G_G1_df$group<- ifelse(G_G1_df$variable == "SAS", "SAS", "Immesso/Consegnato")
#plot
ggplot(G_G1_df, aes(as.Date(day),as.numeric(value),col=variable))+geom_point()+geom_line()+facet_wrap(~group,ncol=1,scales="free_y")+labs(x="Month", y="Values") +scale_x_date(labels=date_format("%m-%Y"))+geom_abline(intercept=c(-2,0,2),slope=0,data=subset(G_G1_df,group=="SAS"),lwd=0.5,lty=2)

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

Loop over a subset, source a file and save results in a dataframe

Similar questions have been asked already but none was able to solve my specific problem. I have a .R file ("Mycalculus.R") containing many basic calculus that I need to apply to subsets of a dataframe: one subset for each year where the modalities of "year" are factors (yearA, yearB, yearC) not numeric values. The file generates a new dataframe that I need to save in a Rda file. Here is what I expect the code to look like with a for loop (this one obviously do not work):
id <- identif(unlist(df$year))
for (i in 1:length(id)){
data <- subset(df, year == id[i])
source ("Mycalculus.R", echo=TRUE)
save(content_df1,file="myresults.Rda")
}
Here is an exact of the main data.frame df:
obs year income gender ageclass weight
1 yearA 1000 F 1 10
2 yearA 1200 M 2 25
3 yearB 1400 M 2 5
4 yearB 1350 M 1 11
Here is what the sourced file "Mycalculus.R" do: it applies numerous basic calculus to columns of the dataframe called "data", and creates two new dataframes df1 and then df2 based on df1. Here is an extract:
data <- data %>%
group_by(gender) %>%
mutate(Income_gender = weighted.mean(income, weight))
data <- data %>%
group_by(ageclass) %>%
mutate(Income_ageclass = weighted.mean(income, weight))
library(GiniWegNeg)
gini=c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))
df1=data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c("content_df1")
df2=(1/5)*df1$Income_gender+df2$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c("content_df2")
So that in the end, I get two dataframes like this:
Income_Gender Income_Ageclass
content_df1 .... ....
And for df2:
myresult
content_df2 ....
But I need to save df1 and Rf2 as a Rda file where the row names of content_df1 and content_df2 are given per subset, something like this:
Income_Gender Income_Ageclass
content_df1_yearA .... ....
content_df1_yearB .... ....
content_df1_yearC .... ....
and
myresult
content_df2_yearA ....
content_df2_yearB ....
content_df2_yearC ....
Currently, my program does not use any loop and is doing the job but messily. Basically the code is more than 2500 lines of code. (please don't throw tomatoes at me).
Anyone could help me with this specific request?
Thank you in advance.
Consider incorporating all in one script with a defined function of needed arguments, called by lapply(). Lapply then returns a list of dataframes that you can rowbind into one final df.
library(dplyr)
library(GiniWegNeg)
runIncomeCalc <- function(data, y){
data <- data %>%
group_by(gender) %>%
mutate(Income_gender = weighted.mean(income, weight))
data <- data %>%
group_by(ageclass) %>%
mutate(Income_ageclass = weighted.mean(income, weight))
gini <- c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))
df1 <- data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c(paste0("content_df1_", y))
return(df1)
}
runResultsCalc <- function(df, y){
df2 <- (1/5) * df$Income_gender + df$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c(paste0("content_df2_", y)
return(df2)
}
dfIncList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
runIncomeCalc(yeardata, i)
})
dfResList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
df <- runIncomeCalc(yeardata, i)
runResultsCalc(df, i)
})
df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)
Now if you need to source across scripts. Create same two functions, runIncomeCalc and runResultsCalc in Mycalculus.R and then call each in other script:
library(dplyr)
library(GiniWegNeg)
if(!exists("runIncomeCalc", mode="function")) source("Mycalculus.R")
dfIncList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
runIncomeCalc(yeardata, i)
})
dfResList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
df <- runIncomeCalc(yeardata, i)
runResultsCalc(df, i)
})
df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)
If you functional-ize your steps you can create a workflow like the following:
calcFunc <- function(df) {
## Do something to the df, then return it
df
}
processFunc <- function(fname) {
## Read in your table
x <- read.table(fname)
## Do the calculation
x <- calcFunc(x)
## Make a new file name (remember to change the file extension)
new_fname <- sub("something", "else", fname)
## Write the .RData file
save(x, file = new_fname)
}
### Your workflow
## Generate a vector of files
my_files <- list.files()
## Do the work
res <- lapply(my_files, processFunc)
Alternatively, don't save the files. Omit the save call in the processFunc, and return a list of data.frame objects. Then use either data.table::rbindlist(res) or do.call(rbind, list) to make one large data.frame object.

Resources