I'm trying to take biological data saved in a .csv format and convert it into a specific xml format set by Darwin Core standards (an extension of Dublin Core). The data are set up in rows of observation records with headers in the first row. I need to repackage the data with Darwin Core standard XML tags using a basic XML tree/schema. The purpose is to standardize the data and make it readily available to load into any kind of database program.
I am a biologist, so I'm fairly new in computer programming and code. I would like to write something in R or excel that can do this repackaging step automatically so I don't have to manually reenter thousands of records.
I have tried using the developer tools in excel 365 to save the .csv as an .xml file, but it seems like I would have to develop the xml tree or schema in a text editor program first. Also, it seems like the xml add-ons that I would use are no longer available. I have downloaded the free text editor called "Brackets" build 1.14 to write some simple xml. I also have RStudio version 1.1.419 with the XML package downloaded to potentially write a script with R version 3.4.3. I've read up on all the Darwin Core Terms and basic XML syntax and rules, but I don't really know where to start.
This is an example of the data in simple .csv format:
PhysicalObject,ANSP,PH,123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collections:1
PhysicalObject,ANSP,PH,124,"Buxbaumia piperi",2,urn:lsid:tim.lsid.tdwg.org:collections:1
This is what the records should look like as an end product:
[<?xml version="1.0"?>
xsi:schemaLocation="http://rs.tdwg.org/dwc/xsd/simpledarwincore/ http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd"
<dwc:scientificName>Cryptantha gypsophila reveal & C.R. Boome</dwc:scientificName>
<dwc:scientificName>Buxbaumia piperi</dwc:scientificName>
This can be done in a number of ways. Here, I go for the stringi solution because it's easy to read what the inputs are.
The code below imports the data, writes the first part of the XML, then writes SimpleDarwinRecords for each line and finally the last part of the file. unlink is there to clean up before anything is appended to the file. If indentation matters (apparently it doesn't), you may need to tweak the template a bit.
This could also be done using a Jinja2 template and Python.
xy <- read.table(text = 'type,institutionCode,collectionCode,catalogNumber,scientificName,individualCount,datasetID
PhysicalObject,ANSP,PH,123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collections:1
PhysicalObject,ANSP,PH,124,"Buxbaumia piperi",2,urn:lsid:tim.lsid.tdwg.org:collections:1', header = TRUE,
sep = ",")
outfile <- file(description = "output.txt", open = "at")
writeLines('[<?xml version="1.0"?>
xsi:schemaLocation="http://rs.tdwg.org/dwc/xsd/simpledarwincore/ http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd"
xmlns:dwr="http://rs.tdwg.org/dwc/xsd/simpledarwincore/">', con = outfile)
</dwr:SimpleDarwinRecord>'), con = outfile)
con = outfile)
This is the result:
[<?xml version="1.0"?>
xsi:schemaLocation="http://rs.tdwg.org/dwc/xsd/simpledarwincore/ http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd"
<dwc:scientificName>Cryptantha gypsophila Reveal & C.R. Broome</dwc:scientificName>
<dwc:scientificName>Buxbaumia piperi</dwc:scientificName>
In my job I have to perform some analytics on data shared by external organisation through user access granted on web portal. Various reports are available there, which I can view and download in many formats. Two of these formats are very useful namely MS Excel and 'XML file with report data'. Excel file is normally heavily formatted (with sub-totals, merged cells, etc.) to suit the purpose of Excel users. Converting these Excel files to data frame/table is normally a big hassle. I therefore prefer to download 'xml' file and then parse it through -> save it in csv and then carry out my analysis in R.
However, whenever I try to parse xml file directly into R (to avoid intervening convert to csv step) I never succeed. So far I have tried XML xml2 libraries in R but to no avail.
Recently I tried this code.
res <- xmlParse("Skil.xml")
> res <- xmlParse("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
rootnode <- xmlRoot(res)
rootsize <- xmlSize(rootnode)
> rootsize
[1] 2
xmldataframe <- xmlToDataFrame("Skil.xml")
> xmldataframe <- xmlToDataFrame("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
> xmldataframe
Textbox24 Textbox63 DDOName_Collection
1 <NA> <NA> <NA>
Just to mention the file size of Skil.xml is about 12.1 Mb, and is successfully parsed in Excel.
I have also tried read_xml() function of xml2 but to no avail.
I would have happily shared a sample file to try, but I am unable to do so. Moreover, I am also unable to generate a sample file in that kind of xml format.
Can someone help?
I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.
I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
class Program
static void Main(string[] args)
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
Once compiled, the utility can be called from R using e.g. system2().
I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed
Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.
I was seeing a similar error and wanted to share a short-term solution.
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.
If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().
Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."
I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a good understanding of the .docx format which I do not currently have).
The readDOC reader uses antiword to parse *.doc files. Is there a similar application that will parse *.docx files?
Or better still, is there already a standard way of creating a corpus of *.docx files using tm?
* more or less, because although the files go in and are readable, I get this warning for every document: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'
.docx files are zipped XML files. If you execute this:
> uzfil <- unzip(file.choose())
And then pick a .docx file in your directory, you get:
> str(uzfil)
chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
[1] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels"
[4] "./word/document.xml" "./word/theme/theme1.xml" "./docProps/thumbnail.jpeg"
[7] "./word/settings.xml" "./word/webSettings.xml" "./word/styles.xml"
[10] "./docProps/core.xml" "./word/numbering.xml" "./word/fontTable.xml"
[13] "./docProps/app.xml"
This will also silently unpack all of those files to your working directory. The "./word/document.xml" file has the words you are looking for, so you can probably read them with one of the XML tools in package XML. I'm guessing you would do something along the lines of :
xtext <- xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )
Actually you will probably need to save this to a temp-directory and add that path to the file name, "./word/document.xml".
You may want to use the further steps provided by #GaborGrothendieck in this answer: How to extract xml data from a CrossRef using R?
I ended up using docx2txt to convert the .docx files to text. Then I created a corpus from them like this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
I figure I could probably hack the readDOC reader so that it would use docx2txt or antiword as needed, but this works.
I am using QGIS software. I would like to show value of each raster cell as label.
My idea (I don't know any plugin or any functionality from QGIS which allow to it easier) is to export raster using gdal2xyz.py into coordinates-value format and then save it as vector (GML or shapefile). For this second task, I try to use
gdal_polygonize.py rainfXYZ.txt rainf.shp Creating output rainf.shp of
format GML.
0...10...20...30...40...50...60...70...80...90...100 - done.
unfortunately I am unable to load created file (even if I change the extension to .gml)
ogr2ogr tool don't even recognize this format.
yes - sorry I forgot to add such information.
In general after preparing CSV file (using gdal2xyz.py with -csv option),
I need to add one line at begining of it:
"Longitude,Latitude,Value" (without the quotes)
Then I need to create a VRT file which contain
*> <OGRVRTDataSource>
> <OGRVRTLayer name="Shapefile_name">
> <SrcDataSource>Shapefile_name.csv</SrcDataSource>
> <GeometryType>wkbPoint</GeometryType>
> <GeometryField encoding="PointFromColumns" x="Longitude"
> y="Latitude"/>
> </OGRVRTLayer> </OGRVRTDataSource>*
Run the command "ogr2ogr -select Value Shapefile_name.shp Shapefile_name.vrt". I got the file evap_OBC.shp and two other associated files.
For the sake of archive completeness, this question has also been asked on GDAL mailing list as thread save raster as point-vector file. It seems Chaitanya provided solution for it.
I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
## change the working directory
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file
I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.
The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
data <- as.data.set(spss.system.file('yourfile.sav'))
You may also try this:
setwd("C:/Users/rest of your path")
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)
I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945
It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
Jon Peck
If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?
Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.
For me it works well using memisc!
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
# Some interesting information in here
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})
There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html
In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
labels(data[[1]]) <- NULL
Hope the above code will help someone else.
Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then
I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
Memisc package also has an spss function worth trying.