read_excel all columns text [duplicate] - r

This question already has answers here:
Specifying Column Types when Importing xlsx Data to R with Package readxl
(6 answers)
Closed 2 years ago.
I have an Excel file with all columns of type "text". When read_excel is called, however, some of the columns are guessed to be "dbl" instead. I know I can use col_types to specify the columns, but that requires me knowing how many columns there are in my file.
Is there a way I can detect the number of columns? Or, alternatively, specify that the columns are all "text"? Something like
read_excel("file.xlsx", col_types = "text")
which, quite reasonably, gives an error that I haven't specified the type for all the columns.
Currently, I can solve this by reading in the file twice:
read_excel_one_type <- function(filename, col_type = "text"){
temp <- read_excel(path = filename)
ncol.temp <- ncol(temp)
read_excel(path = filename, col_types = rep(col_type, ncol.temp))
}
but a method that doesn't require reading the file twice would be better.

This answer seems to be helpful: https://stackoverflow.com/a/34015430/5220858. I have found that the excel file needs to be formatted correctly from the start in order for R to automatically detect the correct data type (i.e. numeric, date, text). I think the post though is more relevant to your question. The poster shows a bit of code similar to what you have provided, except only one line of data is read to determine the number of columns, then the rest is read into R based on the first line.

library("xlsx")
file<-"myfile.xlsx"
sheetIndex<-1
mydf<-read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown")
works for me

Related

R / openxlsx / Finding the first non-empty cell in Excel file

I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?
As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)

Issue when importing float as string from Excel. Adding precision incorrectly

Using openxlsx read.xlsx to import a dataframe from a multi-class column. The desired result is to import all values as strings, exactly as they're represented in Excel. However, some decimals are represented as very long floats.
Sample data is simply an Excel file with a column containing the following rows:
abc123,
556.1,
556.12,
556.123,
556.1234,
556.12345
require(openxlsx)
df <- read.xlsx('testnumbers.xlsx', )
Using the above R code to read the file results in df containing these string
values:
abc123,
556.1,
556.12,
556.12300000000005,
556.12339999999995,
556.12345000000005
The Excel file provided in production has the column formatted as "General". If I format the column as Text, there is no change unless I explicitly double-click each cell in Excel and hit enter. In that case, the number is correctly displayed as a string. Unfortunately, clicking each cell isn't an option in the production environment. Any solution, Excel, R, or otherwise is appreciated.
*Edit:
I've read through this question and believe I understand the math behind what's going on. At this point, I suppose I'm looking for a workaround. How can I get a float from Excel to an R dataframe as text without changing the representation?
Why Are Floating Point Numbers Inaccurate?
I was able to get the correct formats into a data frame using pandas in python.
import pandas as pd
test = pd.read_excel('testnumbers.xlsx', dtype = str)
This will suffice as a workaround, but I'd like to see a solution built in R.
Here is a workaround in R using openxlsx that I used to solve a similar issue. I think it will solve your question, or at least allow you to format as text in the excel files programmatically.
I will use it to reformat specific cells in a large number of files (I'm converting from general to 'scientific' in my case- as an example of how you might alter this for another format).
This uses functions in the openxlsx package that you reference in the OP
First, load the xlsx file in as a workbook (stored in memory, which preserves all the xlsx formatting/etc; slightly different than the method shown in the question, which pulls in only the data):
testnumbers <- loadWorkbook(here::here("test_data/testnumbers.xlsx"))
Then create a "style" to apply which converts the numbers to "text" and apply it to the virtual worksheet (in memory).
numbersAsText <- createStyle(numFmt = "TEXT")
addStyle(testnumbers, sheet = "Sheet1", style = numbersAsText, cols = 1, rows = 1:10)
finally, save it back to the original file:
saveWorkbook(testnumbers,
file = here::here("test_data/testnumbers_formatted.xlsx"),
overwrite = T)
When you open the excel file, the numbers will be stored as "text"

R - Read specific columns from XLSX

This seems like a silly question, but I really could not find a solution! I need to read only specific columns from an Excel file. The file have multiple sheets with different number of columns, but the ones I need to read will be there. I can do this for csv files, but not for excel! This is my present code, which reads the first 14 columns (but the columns I need might not always be in the first 14). I can't just read them all as rbind will throw an error citing row mismatch (different number of rows in the sheets).
EDIT: I solved this by omitting the col_types parameter, it worked as sheets with different column numbers only had column headers. Still, this is no way a robust solution, so I hope someone can do a better job than me.
INV <- lapply(sheets, function(X) read_excel("./Inventory.xlsx", sheet = X, col_types = c(rep("text", 14))))
names(INV) <- sheets
INV <- do.call("rbind", INV)
I am trying to do something like this:
INV <- lapply(FILES[grepl("Inventory", FILES)],
function(n) read_csv(file=paste0(n), col_types=cols_only(DIVISION="c",
DEPARTMENT="i",
ITEM_ID="c",
DESCRIPTION="c",
UNIT_QTY="i",
COMP_UNIT_QTY="i",
REGION="c",
LOCATION_TYPE="c",
ZONE="c",
LOCATION_ID="c",
ATS_IND="c",
CONTAINER_ID="c",
STATUS="c",
TROUBLE_CODES="c")))
But, for an Excel file. I tried using read.xlsx from openxlsx and read_excel from readxl, but nneither supported doing this. There must be some other way. Don't worry about column types, I am fine with all as characters.
I would very much appreciate if this can be done using readxl or openxlsx.

Is there a way to prevent Excel from automatically forcing my character string to a date from within R? [duplicate]

This question already has answers here:
characters converted to dates when using write.csv
(3 answers)
Closed 5 years ago.
Within R, I have a character string ID formatted like XX-XX where XX is any integer number between 01 and 99.
However, when the numbers that make the character string could resemble a date, Excel is automatically forcing this change.
I am writing to a .csv file directly from within R using write.csv().
Unfortunately, I am not able to change the ID format convention and I also require this to be controlled from within R as it is a small part of a very large automated process where people using the software do not necessarily have any understanding of the mechanics. Furthermore, configuring excel on every person who uses this systems software is not desirable but will consider it as a last resort.
Is this possible? I am open to using a different writing option like the xlsx package if it can provide a solution.
MWE Provided:
# Create object with digits that will provoke the problem.
ID <- data.frame(x = '03-15')
# Write object to a csv file within the working directory.
write.csv(ID, file = 'problemFile.csv')
# Now open the .csv file in excel and view the result.
I recommend the openxlsx package. This worked for me:
ID <- data.frame(x = '03-15')
wb <- createWorkbook()
addWorksheet(wb, "Sheet 1")
writeData(wb, "Sheet 1", x = ID)
saveWorkbook(wb, "test.xlsx", overwrite = TRUE)
openXL("test.xlsx")
Check this:
characters converted to dates when using write.csv

CSV file import in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to import a CSV file into R to do fraud analysis with linear/logistic regression. What should have been pretty easy is turning complicated... This data set contains 26 variables and more than 2 million rows. I used this command line to import the CSV file:
data <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=";")
Nevertheless, R imported 2.3 million rows in only 1 variable. I attach an of the View(data) obtained after this step for more information. I have tried switching from sep=";" to sep="," using:
datos <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=",")
But got this error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I have tried changing read.csv to read.csv2 (2.3 million rows and 1 variable as result); or using fill=TRUE options (same result), nevertheless the import is not correct. I attach another image of original CSV look opened in Excel.
I appreciate in advance any suggestion or help to fix it.
Break down the problem into steps that you can check - initially I'd try something like
file <- 'C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv'
read.csv(file, header=F, skip=1, sep=',', nrow=1)
If this produces a data.frame with 1 row and 26 columns, you're in business, if not, check through the arguments of read.csv again, and see if any of the arguments need changing for your file.
Now progress to
read.csv(file, header=T, skip=0, sep=',', nrow=1)
This should give you the same one line data.frame, but with column names correct - if not check the csv file has the right number of columns in the first row, or carry on skipping the header and assign column names after you've read it in.
Now increase nrow, initially to 10, then maybe by a factor of 10 until you read in the whole file, or you hit a problem. Use a binary search to find the exact line that causes the problem, by setting nrow to be halfway between the value you know works, and one that doesn't until you find the exact problem line.
Refer to the csv in Excel to see what is particular about this line - does it have a strange character, unmatched quotes, fewer entries... this will influence how you fix the problem.
Repeat until your whole file reads in!
From the excel screenshot, the first line of data in your file has 31 columns; the second has 29...
My guess is that your csv file has a comma for column separator and a comma for decimal separator. You have to reexport your file to csv by making decimal and column separator different.

Resources