Here is an example of my code:
library(Rcpp)
library(readxl)
Sheets<-readxl::excel_sheets("~/data.xlsx")
sheet_names <- sheets[grepl("String", sheets, ignore.case=TRUE)
for (i in 1:length(sheet_names)){
dataset[i] <- read_excel("data.xlsx", sheet = sheet_names[i])
}
What I would like this to do is to return a dataset named "dataset1" where i=1 and "dataset2" where i=2 and so on. Alternatively, I would like to use the name of the sheet itself i.e. sheet_names[i] but when attempting to use that it overwrites the strings in the variable.
I would be grateful for any suggestions on this.
Consider storing the data in a list instead of creating multiple dataframes in global environment. These objects are difficult to manage and they pollute the global environment.
library(readxl)
Sheets <- excel_sheets("~/data.xlsx")
sheet_names <- grep("String", sheets, ignore.case=TRUE, value = TRUE)
list_data <- lapply(sheet_names, function(x) read_excel("data.xlsx", sheet = x))
list_data is a list of dataframes. If you need to access individual dataframes you can use list_data[[1]] to get the 1st dataframe, list_data[[2]] to get the 2nd one and so on.
Related
I tried using mapply to create a list of data frames of elements from sheets in an Excel file.
To be precise, every column of the data table I want to create as one element of the list is a column of a separate sheet in the Excel file. There are 7 files; they have differing numbers of sheets though each sheet has the same dimensions. Each element of my final list, which I call RAINS, should refer to one of 7 files.
### Excel files
weather_files <- list()
weather_files <- list.files(pattern = "[M-m][0-9]{4}\\.xlsx")
year = c(1:7)
dateseq = c(5:26)
rainsheet <- list()
RAIN <- list()
RAINS <- list()
## List of vectors of sheet numbers for each file
for (i in seq_along(years)) {
y[[i]] <- c(5:length(excel_sheets(weather_files[i])))
}
### Function 'raindate' which calls read.xlsx
raindate <- function(j,i) {
rainsheet <- read.xlsx(weather_files[i],sheet=j,startRow=2,colNames=TRUE,rowNames=FALSE,detectDates=FALSE,rows=c(4:108),cols=c(2),check.names=FALSE)
}
### Create data frame using cbind
for (i in seq_along(year)) {
RAIN <- read.xlsx(weather_files[1],sheet=5,startRow=2,colNames=TRUE,rowNames=FALSE,detectDates=FALSE,rows=c(4:108),cols=c(1),check.names=FALSE)
RAINS[[i]] <- cbind(RAIN,mapply(raindate,y[[i]],year))
}
The problem I have is that mapply, increments on pairs of elements of the vectors 'y' and 'year'. This gives me data frames where each successive column increments the excel file and the sheet, completely mixing up the data. What I need is incrementing over all values of y within one year, then incrememting y.
Is there a method in R to replace mapply in the above code to achieve this?
I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.
I have multiple excel files and they have unique sheet names (date of file creation in my case). I read them in bulk and need to assign the sheet name to each file in new column "id". I know how to make numeric id, or id = file name, but cannot find a way to get sheet name as id.
library(readxl)
library(data.table)
file.list <- list.files("C:/Users/.../Studies/",pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
#id = numeric
df <- rbindlist(df.list, idcol = "id")
#Or by file name:
attr(df.list, "names") <- file.list
df2 = rbindlist(df.list,idcol="id")
#How to get sheet names?
If you happen to be working with only the first sheets of your files, then the following should help you grab the first sheets' names as the id for your dataframes:
attr(df.list, "names") <- sapply(file.list, function(x) excel_sheets(x)[1])
However, if you are considering importing the data from all the available sheets you will need to do a bit more work, starting with how you create your list of dataframes:
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y)
})
names(dfs) <- sheets
dfs
})
This should create a list of lists, which should contain all the available data in your files. The lists inside the main list are appropriately named after the sheet names. So, you will not need to change any attributes afterwards. But to bind the dataframes together, you need to do:
rbindlist(lapply(df.list, rbindlist, id = "id"))
I hope this proves useful.
I have dataframes in which one column has to suffer a modification, handling correctly NAs, characters and digits. Dataframes have similar names, and the column of interest is shared.
I made a for loop to change every row of the column of interest correctly. However I had to create an intermediary object "df" in order to accomplish that.
Is that necessary? or the original dataframes can be modified directly.
sheet1 <- read.table(text="
data
15448
something_else
15334
14477", header=TRUE, stringsAsFactors=FALSE)
sheet2 <- read.table(text="
data
16448
NA
16477", header=TRUE, stringsAsFactors=FALSE)
sheets<-ls()[grep("sheet",ls())]
for(i in 1:length(sheets) ) {
df<-NULL
df<-eval(parse(text = paste0("sheet",i) ))
for (y in 1:length(df$data) ){
if(!is.na(as.integer(df$data[y])))
{
df[["data"]][y]<-as.character(as.Date(as.integer(df$data[y]), origin = "1899-12-30"))
}
}
assign(eval(as.character(paste0("sheet",i))),df)
}
As #d.b. mentions, consider interacting on a list of dataframes especially if similarly structured since you can run same operations using apply procedures plus you save on managing many objects in global environment. Also, consider using the vectorized ifelse to update column.
And if ever you really need separate dataframe objects use list2env to convert each element to separate object. Below wraps as.* functions with suppressWarnings since you do want to return NA.
sheetList <- mget(ls(pattern = "sheet[0-9]"))
sheetList <- lapply(sheetList, function(df) {
df$data <- ifelse(is.na(suppressWarnings(as.integer(df$data))), df$data,
as.character(suppressWarnings(as.Date(as.integer(df$data),
origin = "1899-12-30"))))
return(df)
})
list2env(sheetList, envir=.GlobalEnv)
I have a number of dataframes (imported from CSV) that have the same structure. I would like to loop through all these dataframes and keep only two of these columns.
The loop below does not seem to work, any ideas why? Would ideally like to do this using a loop as I am trying to get better at using these.
frames <- ls()
for (frame in frames){
frame <- subset(frame, select = c("Col_A","Col_B"))
}
Cheers in advance for any advice.
For anyone interested I used Richard Scriven's idea of reading in the dataframes as one object, with a function added that showed where the file had been imported from. This allowed me to then use the Plyr package to manipulate the data:
library(plyr)
dataframes <- list.files(path = TEESMDIR, full.names = TRUE)
## Define a function to add the filename to the dataframe
read_csv_filename <- function(filename){
ret <- read.csv(filename)
ret$Source <- filename #EDIT
ret
}
list_dataframes <- ldply(dataframes, read_csv_filename)
selection <- llply(list_dataframes, subset, select = c(var1,var3))
The basic problem is that ls() returns a character vector of all the names of the objects in your environment, not the objects themselves. To get and replace an object using a character variable containing it's name, you can use the get()/assign() functions. You could re-write your function as
frames <- ls()
for (frame in frames){
assign(frame, subset(get(frame), select = c("Col_A","Col_B")))
}