Dates from Excel to R, platform dependency - r

I'm importing xls files using gdata. I am converting date columns using as.Date to convert the date
As per the manual for as.Date, the date origin is platform dependent, and so I am determining which origin to use accordingly
.origin <- ifelse(Sys.info()[['sysname']] == "Windows", "1899-12-30", "1904-01-01")
as.Date(myData$Date, origin=.origin)
However, I'm wondering if I should be considering the platform where the file is being read or the platform where it was written?
For what it's worth, I am currently testing the code on a linux box with no excel, and the correct Dates are produced by using origin="1904-01-01"
Quoting `?as.Date'
## date given as number of days since 1900-01-01 (a date in 1989)
as.Date(32768, origin = "1900-01-01")
## Excel is said to use 1900-01-01 as day 1 (Windows default) or
## 1904-01-01 as day 0 (Mac default), but this is complicated by Excel
## treating 1900 as a leap year.
## So for dates (post-1901) from Windows Excel
as.Date(35981, origin = "1899-12-30") # 1998-07-05
## and Mac Excel
as.Date(34519, origin = "1904-01-01") # 1998-07-05
## (these values come from http://support.microsoft.com/kb/214330)

You could try out the (extremely) new exell package: https://github.com/hadley/exell. It loads excel dates into POSIXct, correctly choosing the origin based on whether the file was written by Windows or Mac Excel.

Yes, you should be considering where the file was written. Excel-Windows appears capable of distinguishing Mac-written dates from Win-written dates, but you are getting evidence that these are Mac-originated .xls files.
The safest method would be to work within the version of Excel on which the data was entered and to use the format menu to bring up a dialog box from which you choose as-Date and a custom format of yyyy-mm-dd. Then save as a csv file and you will be able to import into R with the colClasses vector "Date" in the proper column position. But that sounds as though it is an option not available.
I suppose it doesn't apply to you on a linux box so this is just a Mac-whine: The gdata-package gives deprecation warnings and then fails to install the XLSX support files on R 3.0.0 with the ordinary Perl 5.8 installation in '/opt/local/bin/perl'. This despite 'gdata::findPerl` being able to find it successfully.
At this point I think the question should be redirected to inquiring whether you could coax gdata functions into inspecting the properties of the files. After looking at the codebase for xls reading, I rather doubt it, since do not see any mention of inspecting for different xls versions.
Near the end of a blank xls file created with a Mac version of Excel, looking with a text editor I see:
Worksheets˛ˇˇˇˇˇ ¿F$Microsoft Excel 97 - 2004 Worksheet˛ˇˇˇ8FIBExcel.Sheet.8˛ˇ
‡ÖüÚ˘Oh´ë+'≥Ÿ0îHPhħ
∞ºƒ'David WinsemiusDavid WinsemiusMicrosoft Macintosh Excel#ê˚á!Ë+Œ#ê'å-Ë+ŒG»˛ˇˇˇPICT¿Kġ
The other difference was that the Windows version inspected the same way was had "Excel 2003 Worksheet" as teh type of worksheet, whereas it was "Excel 97 - 2004" for the Mac version.
So maybe you can coerce R into bypassing all the errors that get triggered when reading or grepping during scanning for "Macintosh". Maybe Linux-R is more resistant to that sort of thing?
Error: invalid multibyte string at '<ff>'
I also got a bunch of warnings from grep that suggested I might not be able to "see" into some of the strings:
Warning message:
In grep("Macintosh", lin) : input string 1 is invalid in this locale
You might be able to highjack some more robust code from the Perl code in xls2csv.pl which is part of the gdata package.

Related

WP All Import date issue

I am importing an excel file that has a date field with the following format: dd/mm/YYYY
It imports the data OK, however it does something strange with the dates. If the day is higher than 12, then it takes the current date.
I found this on the log:
WARNING: unrecognized date format ``, assigning current date
For example: the date 08/04/2020 is imported ok because 08 <= 12, however the date 23/02/2020 is not imported because 23 is not <= 12, and then it takes the current date.
Any ideas about what is happening?
This issue can get confusing with Excel, since typically, if the system is set for U.S. locales, Excel converts standard European date format in day-month-year to some version of month/day/year. In other words, tThe forward slash usually indicates American format, which is month first, as in April 9, 2022 or 04/09/22. You can force day/month/year, and apparently that was done for your file or system, but that's not an expected or standard format.
Without examining All Import code in detail, I'd say the behavior you describe implies it is presuming the standard usage. I'd therefore recommend that you try converting your date column before attempting imports, but how to do that efficiently would probably be more an Excel question than a WordPress question, unless you are able in your version of All Import to perform a custom conversion on the field.
A trick that sometimes works on this type of problem is copy-pasting Excel data into a Google Sheet, then downloading as CSV, without ever opening the file in your system before re-uploading. Google Sheets is pretty good about such conversions in general, but no guarantees: Your source file might defeat it in this case.
I had the same issue. Change the dates in your csv file to mm/dd/yyyy and when it imports it will automatically convert to dd/mm/yyyy if that is what your WP website is set up to use.
Only numbers lower than 12 will work the way you are importing because there are only 12 months in a year and it is reading you dd as mm.

BlueSky Statistics - Character Encoding Problem

I am loading a data set, characters of which was encoded in ISO 8859-9 ("Latin 5") using Windows 10 OS (Microsoft has assigned code page 28599 a.k.a. Windows-28599 to ISO-8859-9 in Windows).
The data set is originally in Excel.
Whenever I run an analysis, or any operation with a variable name containing a character specific to this code page (ISO 8859-9), I get an error like:
Error: undefined columns selected
BSkyFreqResults <- BSkyFrequency(vars = c("MesleÄŸi"), data = Turnudep_raw_data_5)
Error: object 'BSkyFreqResults' not found
BSkyFormat(BSkyFreqResults)
The characters ÄŸ within "MesleÄŸi" are originally one character in Turkish (g with an inverted hat on) ğ
Those variable names that contain only letters from US code page work normally in BlueSky operations.
If I try to use save as in Excel and use web option UTF-8, to convert the data to UTF-8, this does not work either. If I export it to csv file, it does not work as is, or saved as UTF-8.
How can I load this data into BlueSky so that it works?
This same data set works in Rstudio:
> Sys.getlocale('LC_CTYPE')
[1] "Turkish_Turkey.1254"
And also in SPSS:
Language is set to Unicode
Picture of Language settings in SPSS
It also works in Jamovi
I also get an error when I start BlueSky, that may be relevant to this problem:
Python-CFFI error
From cffi callback <function _consolewrite_ex at 0x000002A36B441F78>:
Traceback (most recent call last):
File "rpy2\rinterface_lib\callbacks.py", line 132, in _consolewrite_ex
File "rpy2\rinterface_lib\conversion.py", line 133, in _cchar_to_str_with_maxlen
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 15: invalid start byte
Since then I re-downloaded and re-installed BlueSky, but I still get this Python-CFFI error every time I start the software.
I want to work with BlueSky and will appreciate any help in resolving this problem.
Thanks in advance
Here is a link for reproducing the problem.
The zip file contains a data source of 2 cases both in Excel and BlueSky format, a BlueSky Markdown file to show how the error is produced and an RMarkdown file for redundancy (probably useless).
UPDATE: The Python error (Python-CFFI error) appears to be related to the Region settings in Windows.
If the region is USA (Turnudep_reprex_Windows_Region_USA-Settings.jpg) , the python error does NOT appear.
If the region is Turkey (Turnudep_reprex_Windows_Region_Turkey-Settings.jpg) the python error DOES appear.
Unfortunately, setting the region and language to USA does eliminate the python error message but not the other problem. Still all the operations with the Turkish variable names end up with an error.
This may be a problem only the BlueSky developers may solve ...
Any help or suggestion will be greatly appreciated.
UPDATE FOR VERSION 10.2: The Python error (Python-CFFI error) is eliminated in this version. All others persist. I also notice that I can not change the variable names that have characters not in US code page. Meaning, if a variable name is something like "HastaNo", I can do analysis with that variable and change the name of the variable in the editor. If the variable name is something like "Mesleği" I can not do analysis with that variable AND I CANNOT CHANGE THAT NAME in the editor to "Meslegi" or anything else, so that it is usable in analysis.
UPDATE FOR VERSION: BlueSky Statistics Version 10.2.1, R package version 8.70
No change from Version 10.2. Variable names that contain a character outside of ASCII, cause an error AND can not be changed in BlueSky Statistics.
For version 10, according to user manual chapter 15.1.3 you can adjust the encoding setting. (answer has been edited for more clarity)

Cannot get R to compare seemingly equal strings

I recently received an excel file of unknown origin (and used excel to ouput a subset into a text file "example0.txt" (see in dropbox link below). In this dataset, two strings that appear by all means equal, cannot be compared by R.
a<-scan("example0.txt", what="raw")
a
[1] "ÖSTVÅG" "FALKVÅG" "ÖSTVÅG"
# cell a[1] and a[3] appear similar ("ÖSTVÅG"). However,
a[1]==a[3] # I was expecting TRUE but I get FALSE
nchar(a[1]) == nchar(a[3]) # I was expecting TRUE (n=6) but I get FALSE
# and similarly,
nchar(a[2]) == 8 # I was expecting n=7
see files in dropbox:
https://www.dropbox.com/sh/hz6vsj5kj1u9ag6/AACtTMh-x4IIB10kMCUE5rvBa?dl=0
This behaviour really complicates matching strings on a larger dataset I am working on, let alone newer datasets that will be coming in.
Initially I suspected it would have to do with Encodings but I have tried reading the data with different encodings and performing conversions in R but I still get the same results.
I managed to solve the issue by opening the file in notepad and re-writing the a[3], re-saving the file afterwards and I was surprised to see that it now works (you can see the result in example0correct.txt). even if the new file appears equal to the original one.
Can anyone explain me what is happening and how I can detect and correct this occurrences taking example0 as a starting point?
Note: Dont know if it matters, but I am using R 3.2.2 in a windows 7 machine with office 2013 installed.

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Importing Excel files into R, xlsx or xls

Please can someone help me on the best way to import an excel 2007 (.xlsx) file into R. I have tried several methods and none seems to work. I have upgraded to 2.13.1, windows XP, xlsx 0.3.0, I don't know why the error keeps coming up. I tried:
AB<-read.xlsx("C:/AB_DNA_Tag_Numbers.xlsx","DNA_Tag_Numbers")
OR
AB<-read.xlsx("C:/AB_DNA_Tag_Numbers.xlsx",1)
but I get the error:
Error in .jnew("java/io/FileInputStream", file) :
java.io.FileNotFoundException: C:\AB_DNA_Tag_Numbers.xlsx (The system cannot find the file specified)
Thank you.
For a solution that is free of fiddly external dependencies*, there is now readxl:
The readxl package makes it easy to get data out of Excel and into R.
Compared to many of the existing packages (e.g. gdata, xlsx,
xlsReadWrite) readxl has no external dependencies so it's easy to
install and use on all operating systems. It is designed to work with
tabular data stored in a single sheet.
Readxl supports both the legacy .xls format and the modern xml-based
.xlsx format. .xls support is made possible the with libxls C library,
which abstracts away many of the complexities of the underlying binary
format. To parse .xlsx, we use the RapidXML C++ library.
It can be installed like so:
install.packages("readxl") # CRAN version
or
devtools::install_github("hadley/readxl") # development version
Usage
library(readxl)
# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")
# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")
* not strictly true, it requires the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. But they don't require any fiddling with paths, etc., so that's an advantage over Java and Perl dependencies.
Update There is now the rexcel package. This promises to get Excel formatting, functions and many other kinds of information from the Excel file and into R.
You may also want to try the XLConnect package. I've had better luck with it than xlsx (plus it can read .xls files too).
library(XLConnect)
theData <- readWorksheet(loadWorkbook("C:/AB_DNA_Tag_Numbers.xlsx"),sheet=1)
also, if you are having trouble with your file not being found, try selecting it with file.choose().
I would definitely try the read.xls function in the gdata package, which is considerably more mature than the xlsx package. It may require Perl ...
Update
As the Answer below is now somewhat outdated, I'd just draw attention to the readxl package. If the Excel sheet is well formatted/lain out then I would now use readxl to read from the workbook. If sheets are poorly formatted/lain out then I would still export to CSV and then handle the problems in R either via read.csv() or plain old readLines().
Original
My preferred way is to save individual Excel sheets in comma separated value (CSV) files. On Windows, these files are associated with Excel so you don't loose the double-click-open-in-Excel "feature".
CSV files can be read into R using read.csv(), or, if you are in a location or using a computer set up with some European settings (where , is used as the decimal place), using read.csv2().
These functions have sensible defaults that makes reading appropriately formatted files simple. Just keep any labels for samples or variables in the first row or column.
Added benefits of storing files in CSV are that as the files are plain text they can be passed around very easily and you can be confident they will open anywhere; one doesn't need Excel to look at or edit the data.
Example 2012:
library("xlsx")
FirstTable <- read.xlsx("MyExcelFile.xlsx", 1 , stringsAsFactors=F)
SecondTable <- read.xlsx("MyExcelFile.xlsx", 2 , stringsAsFactors=F)
I would try 'xlsx' package for it is easy to handle and seems mature enough
worked fine for me and did not need any additionals like Perl or whatever
Example 2015:
library("readxl")
FirstTable <- read_excel("MyExcelFile.xlsx", 1)
SecondTable <- read_excel("MyExcelFile.xlsx", 2)
nowadays I use readxl and have made good experience with it.
no extra stuff needed
good performance
This new package looks nice http://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf
It doesn't require rJava and is using 'Rcpp' for speed.
If you are running into the same problem and R is giving you an error -- could not find function ".jnew" -- Just install the library rJava. Or if you have it already just run the line library(rJava). That should be the problem.
Also, it should be clear to everybody that csv and txt files are easier to work with, but life is not easy and sometimes you just have to open an xlsx.
For me the openxlx package worked in the easiest way.
install.packages("openxlsx")
library(openxlsx)
rawData<-read.xlsx("your.xlsx");
I recently discovered Schaun Wheeler's function for importing excel files into R after realising that the xlxs package hadn't been updated for R 3.1.0.
https://gist.github.com/schaunwheeler/5825002
The file name needs to have the ".xlsx" extension and the file can't be open when you run the function.
This function is really useful for accessing other peoples work. The main advantages over using the read.csv function are when
Importing multiple excel files
Importing large files
Files that are updated regularly
Using the read.csv function requires manual opening and saving of each Excel document which is time consuming and very boring. Using Schaun's function to automate the workflow is therefore a massive help.
Big props to Schaun for this solution.
What's your operating system? What version of R are you running: 32-bit or 64-bit? What version of Java do you have installed?
I had a similar error when I first started using the read.xlsx() function and discovered that my issue (which may or may not be related to yours; at a minimum, this response should be viewed as "try this, too") was related to the incompatability of .xlsx pacakge with 64-bit Java. I'm fairly certain that the .xlsx package requires 32-bit Java.
Use 32-bit R and make sure that 32-bit Java is installed. This may address your issue.
You have checked that R is actually able to find the file, e.g. file.exists("C:/AB_DNA_Tag_Numbers.xlsx") ? – Ben Bolker Aug 14 '11 at 23:05
Above comment should've solved your problem:
require("xlsx")
read.xlsx("filepath/filename.xlsx",1)
should work fine after that.
I have tried very hard on all the answers above. However, they did not actually help because I used a mac. The rio library has this import function which can basically import any type of data file into Rstudio, even those file using languages other than English!
Try codes below:
library(rio)
AB <- import("C:/AB_DNA_Tag_Numbers.xlsx")
AB <- AB[,1]
Hope this help.
For more detailed reference: https://cran.r-project.org/web/packages/rio/vignettes/rio.html
You may be able to keep multiple tabs and more formatting information if you export to an OpenDocument Spreadsheet file (ods) or an older Excel format and import it with the ODS reader or the Excel reader you mentioned above.
As stated by many here, I am writing the same thing but with an additional point!
At first we need to make sure that our R Studio has these two packages installed:
"readxl"
"XLConnect"
In order to load a package in R you can use the below function:
install.packages("readxl/XLConnect")
library(XLConnect)
search()
search will display the list of current packages being available in your R Studio.
Now another catch, even though you might have these two packages but still you may encounter problem while reading "xlsx" file and the error could be like "error: more columns than column name"
To solve this issue you can simply resave your excel sheet "xlsx" in to
"CSV (Comma delimited)"
and your life will be super easy....
Have fun!!
The installation of xlsx package require rJava and xlsxjars. Indirectly they require the specific (32 or 64 bit) java runtime environment on the system.
Pro of read.xlsx: In the same package there are read.xlsx and write.xlsx
Con: Very low speed
As suggested, the easy way is to save in .csv format from excel.
Simple benchmark on a 5800x15 dataset (median)
read.xlsx: >10000ms
read_xlsx: 70ms
read.csv: 15ms

Resources