readxl or tidyxl: Prevent date coersion when reading from Excel xlsx - r

Is there any way to prevent coercion of dates when reading data from Excel? I'm using either the readxl package or the tidyxl package. The tidyxl package is terrific, but it automatically moves the data into the date column.
Also, I was intrigued by this sentence from the Help page for the xlsx_cells() function: " xlsx_cells() attempts to infer the correct data type of each cell, returning its value in the appropriate column (error, logical, numeric, date, character). In case this cleverness is unhelpful, the unparsed value and type information is available in the 'content' and 'type' columns." It's this unparsed value that I'm looking for.
Alternatively, I'm looking for something similar to TextReader, except for XLSX files.
Any suggestions?

Related

Is there a way to read in a large document as a data.frame in R?

I'm trying to use ggplot2 on a large data set stored into a csv file. I used to read it with excel.
I don't know how to convert this data into a data.frame. In particular, I have a date column that has the following format: "2020/04/12:12:00". How can I get R to understand this format ?
If it's a csv, you can use:
fread function from data.table. This will be the fastest way to read your csv.
read_csv or read_csv2 (for ; delimited documents) in readr package
If it's .xls (or .xlsx) document, have a look at the readxl package.
All these functions import your data as data.frames (with additional classes like data.table for fread or tibble for read_csv).
Edit
Given your comment, it looks like your file is not an excel but a csv. If you want to convert a column type to date, assuming your dataframe is called df
df[, dates := as.POSIXct(get(colnames(df)[1]), format = "%Y/%m/%d:%H:%M")]
Note that you don't need to use cbind or even reassign the data.table because you use := operator
As the message is saying you, you don't need the extra-precision of POSIXlt
Going by the question alone, I would suggest the openxlsx package, it has helped me reduce the time significantly in reading large datasets. Three points you may find it to be helpful based on your question and the comments
The read command stays same as xlsx package, however would suggest you to use openxlsx::read.xslx(file_path)
the arguments are again same, but in the place of sheetIndex it is sheet and it takes only numbers
If the existing columns are converted to character, then a simple as.Date would work

read excel file which has = in a column in r

I have an excel sheet which has formulas in one column like C=(A-32)/1.8. if i read using function read_excel it is showing the error as unexpected symbol in column. Need help in reading this.
I think you need to force each column type with the argument col_types = of the function read_excel() in the package readxl. You can specify the type character which should read the cells as they are.

R read_excel readxl sometimes incorrectly converts numbers to dates

When I use read_excel to import data from Excel to R, some numeric columns are automatically converted to dates.
# e.g.
5600 to 1915-05-01
Is there a way to turn of this feature? Other than using "col_types" argument in read_excel.
The readxl package, like readr for raw data files, has a type guesser to determine how to read columns in an Excel spreadsheet. As noted in the package vignette, the guessing process is not perfect, especially as it relates to date formats because they are stored as a special type of number.
As stated in the package documentation (as well as the comments on the OP), the way to avoid inaccurate guesses from the column type guesser is to explicitly specify the column types with the col_types argument on read_excel().

The R package XLSX is converting entire column to string or boolean when one cell is not numeric

I am using a shiney interface under R to read in a CSV file and load it into one sheet of an excel xlsm file. The file then allows user input and preforms calculations based on VBA macros.
The R xlsx package is working well for preserving the VBA and formatting in the original excel sheet. However some of the data is being converted to a different data type than intended. For example a cell containing the string "F" is causing the column containing it to be converted to type boolean, or a miss-entered number in one cell is causing the entire column to be converted to string.
Can this behavior be controlled so that, for example, cells with valid numbers are not converted to string type? Is there a work-around? Or can someone just help me to understand what is happening in the guts of the package to cause this effect so I can try to find a way around it?
Here are the calls in question:
#excelType() points to an excel xlsm template
data = read.csv("results.csv")
excelForm = loadWorkbook(excelType())
sheets = getSheets(excelForm)
addDataFrame(data, sheets[[1]], col.names = FALSE, row.names = FALSE, startRow=2, colStyle = NULL)
saveWorkbook(excelForm, "results.xlsm")
Thanks!
I hope this is the correct protocol for explaining the outcome which worked for me. I hope it will be of help to others if they end up doing something similar, though the solution is not very elegant!
I tried r2evans's suggestion of forcing column types I could not get that to work in this case. Using readxls gave the same problem, and also broke my VBA. Given lebelionz's comment suggesting that this is an R thing and not a package thing I followed his advice to deal with it after the fact. (I do not see how to credit a comment rather than an answer, but for the record this was very helpful, as were the others).
I therefore altered the program producing the CSV that was being loaded through R. I appended "::" to each cell produced, so that R saw all cells as strings, regardless of the original content. Thus "F" was stored as "::F", and therefore was not altered by R.
I added an autorun macro to the excel sheet thus created, so that when opened it automatically performed a global search and replace to remove the prefix "::" from the whole of the data. This forces Excel to choose a data type for each cell after it was restored, resulting in the types being detected cell by cell and in the correct format for my purposes.
It feels kludgy, but it works and is relatively transparent to the user. One hazard is that if the user data intentionally contained the string "::" it would be lost (I am confident this cannot arise in my particular application, but if someone would like to suggest a better prefix I would be interested). I still hope for an eventual solution rather than a work-around.
And here I thought it was only the movie industry that had to "fix it in post"!

read data into R

The World Health Organization dataset is available here: http://www.filedropper.com/who
When the data is read using fread (from the data.table package), or read_csv (from the readr package) some variables are wrapped within letter r, and are shown as character type. Like so:
"\r31.1\r".
I checked the dataset in notepad and indeed it looks weird as these values are wrapped within (' '). However they are numeric, and when the regular read.csv is used there is no such problem.
What's the reason behind this? How to fix?
the '\r' is e special character used as a new line delimiter for files on windows.
When using read_csv setting the argument escape_backslash=TRUE might do the trick.
Check this for further reading.

Resources