Error based on blank rows importing xlsx into R - r

I'm importing and appending hundreds of Excel spreadsheets into R using map_dfr in combination with a user-defined function:
Function to import specific columns in each worksheet:
fctn <- function(path){map_dfc(.x = c(1,2,3,7,10,11,12,13), ~ read.xlsx(path,
sheet=1,
startRow = 7,
colNames = FALSE,
cols = .x))}
Code to pull all the files in the "path" and append them, where file.list is the list of paths and files to import:
all.files <- map_dfr(file.list, ~ fctn(path=.x))
My problem is, some of these sheets have missing values in some of the columns, but not others, and R doesn't like that. I encounter this error, for instance:
"Error: can't recycle '..1' (size 8) to match '..2' (size 6)", which happens because column 2 is missing information in two cells.
Is there any way to make R accept missing values in cells?

Related

R: Read specific columns of .dta file and converting variable names to lower case without reading whole file

I have a folder with multiple .dta files and I'm using the read_dta() function of the haven library to bind them. The problem is that some of the files have thier column names in lower case and others have them in upper case.
I was wondering if there is a way to only read the specific columns by changing their name to lower case in every case without reading the whole file and then selecting the columns, since the files are really large and this would take forever.
I was hoping that by using the .name_repair = element in the read_dta() function I could do this, but I really don't know how.
Im trying something like this
#Set working directory:
setwd("T:/")
#List of .dta file names to bind:
list_names<-list_names[grepl("_sdem.dta", list_names)]
#Variable names to select form those files:
vars_select<-c("r_def", "c_res", "ur", "con", "n_hog", "v_sel", "n_pro_viv","fac", "n_ren", "upm","eda", "clase1", "clase2", "clase3", "ent", "sex", "e_con", "niv_ins", "eda7c", "tpg_p8a","emp_ppal", "tue_ppal", "sub_o" )
#Read and bind ONLY the selected variables form the list of files
dataset <- data.frame()
for (i in 1:length(list_names)){
temp_data <- read_dta(list_names[i], col_select = vars_select)
dataset <- rbind(dataset, temp_data)
}
The problem is that when some of the files have their variable names in upper case format, their variables are not in the vars_select list and therefore, the next error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
I was trying to use the .name_repair = element in the read_dta() function to try to correct this, by using the tolower() function.
I was trying something like this with a specific file that has an upper case variable name format:
example_data <- read_dta("T:/2017_2_sdem.dta", col_select = vars_select, .name_repair = tolower(names()))
But the same error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
Thanks so much for your help!

Reading .xlsx file names from cells and outputting specific cell as mutated column

I have an excel sheet with input data for an experiment I ran, and I want to get that input data alongside the results of the experiment.
Each row in the excel sheet contains all the input data for one unique test in the experiment. In each of these rows, I'd like to display additional cells that show some of the results from the experiment's output files. Each test has its own unique output file and the names of each of these files (e.g. "Output1.xlsx") is contained in a column alongside the input data. So each row in the input file contains all the input data for a test as well as the file name of the output file for that test.
I'd like to run a code that reads the file names from the "Filename" column in the input file, finds the files in the working directory, accesses a value from a specific cell in each of the files, and creates a mutated vector containing the values from the cells referenced in these files.
So far my code looks like this:
library(tidyverse)
library(readxl)
# Import testing inputs from testing matrix sheet
testing_inputs <- read_xlsx("Testing Setup Info.xlsx",
sheet = 2,
range = NULL,
col_names = TRUE,
col_types = NULL,
na = "",
trim_ws = TRUE,
skip = 0,
progress = readxl_progress(),
.name_repair = "unique")
# Isolate results files from testing inputs sheet
testing_results <- testing_inputs %>%
select("Filename")
# Create df with columns for test inputs and results. Inputs from testing_inputs. Results from individual cells read from the files in the testing_results df.
analysis_df <- testing_inputs %>%
select("Test ID", "Test Order", "Box Size (in.)",
"Coil Rows (#)", "HWST (degF)", "Damper Position",
"Insulation Level", "HW Flow # HWST (GPM)", "heating SA flow (cfm)",
"H.T. - from Price (MBH)", "Filename") %>%
mutate(FS_Cap = lapply(testing_results, read_xlsx(path=".", sheet = 2, range = "Q27",
col_names = FALSE,
col_types = NULL)))
The error message that results from this:
"Error in mutate_cols():
! Problem with mutate() column FS_Cap.
i FS_Cap = lapply(...).
x zip file 'C:\Users\pwend\Box\BSG\Projects\HVAC\Integrated\PIR-19-013 Hot Water\Laboratory testing\Test Results\Combined' cannot be opened
Caused by error in utils::unzip():
! zip file 'C:\Users\pwend\Box\BSG\Projects\HVAC\Integrated\PIR-19-013 Hot Water\Laboratory testing\Test Results\Combined' cannot be opened"
There is no zip file in the working directory, so I'm not sure what to do about this.
For reference, every file I'm using in this script is contained in the working directory, so I wouldn't be able to use the list.files function since some of the files it would read would be the ones I don't need. Using list.files would also make it difficult to match the returned values from the files to the rows they correspond to in the inputs file.
Does anyone know how I could achieve the output I'm looking for?

How to bind rows in R such that instead of type conversion error binding defaults to filling value to NA?

I am currently tasked with merging multiple xlsx files into one master R (.rds) data file. Since these files are filled in manually there is a lot type conversion errors when using approaches such as dyplr::bind_rows such as
Column ``XYZ`` can't be converted from numeric to character
While I very much need the binding to be "smart" such that it happens according to the corresponding column names of the to be merged dataframes -when encountering conversion issues instead of getting an error, I would like to have these problematic cell contents treated as NA and not get an error - just a warning perhaps.
Is there a convenient way/function for doing this in R?
I have used bind_rows from dyplr package.
My current import procedure
files <- list.files("data",pattern = "xlsx", full.names = TRUE)
tmp <- read_excel(files[1], sheet = "data", trim_ws = TRUE)
names(tmp) <- make.names(str_squish(names(tmp)))
for (i in 2:length(files)) {
print(i)
tmp2 <- read_excel(files[i], sheet = "data",trim_ws = TRUE)
names(tmp2) <- make.names(str_squish(names(tmp2)))
tmp<-bind_rows(tmp,tmp2)
}
It has been pointed out that using a loop here is not efficient, but since the files are messy - many manual mistakes - and relatively small in number I focused on being able to sequentially track the binding process.

Read specific non-adjacent columns from Excel file [duplicate]

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

R read excel by column names

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

Resources