Trying to read 20GB of data, read.csv.sql Produces Errors - r

I have a 20GB dataset in csv format and I am trying to trim it down with a read.csv.sql command.
I am successfully able to load the first 10,000 observations with the following command:
testframe = read.csv(file.choose(),nrows = 10000)
The column names can be seen in the following picture:
I then tried to build my trimmed down dataset with the following command, and get an error:
reduced = read.csv.sql(file.choose(),
sql = 'select * from file where "country" = "Poland" OR
country = "Germany" OR country = "France" OR country = "Spain"',
header = TRUE,
eol = "\n")
The error is:Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: C:\Users\feded\Desktop\AWS\biodiversity-data\occurence.csv line 262 expected 37 columns of data but found 38
Why is it that I can load the first 10,000 observations with ease and problems arise with the second command? I hope you have all the information needed to be able to provide some help on this issue.

Note that with the latest version of all packages read.csv.sql is working again.
RSQLite made breaking changes in their interface to SQLite which mean read.csv.sql and any other software that reads files into SQLite from R that used their old interface no longer work. (Other aspects of sqldf still work.)
findstr/grep
If the only reason you are doing this is to cut down the file to the 4 countries indicated perhaps you could just preprocess the csv file like this on Windows assuming that abc.csv is your csv file and that it is in the current directory. Also we have assumed that XYZ is a string in the header.
DF <- read.csv(pipe('findstr "XYZ France Germany Poland Spain" abc.csv'))
On other platforms use grep:
DF <- read.csv(pipe('grep "XYZ|France|Germany|Poland|Spain" abc.csv'))
The above could possibly retrieve some extra rows if those words can also appear in fields other than the intended one but if that is a concern then using subset or filter in R once you have the data in R could be used to narrow it down to just the desired rows.
Other utilities
There are also numerous command line utilities that can be used as an alternative to findstr and grep such as sed, awk/gawk (mentioned in the comments) and utilities specifically geared to csv files such as csvfix (C++), miller (go), csvkit (python), csvtk (go) and xsv (rust).
xsv
Taking xsv as an example, binaries can be downloaded here and then we can write the following assuming xsv is in current directory or on path. This instructs xsv to extract the rows for which the indicated regular expression matches the country column.
cmd <- 'xsv search -s country "France|Germany|Poland|Spain" abc.csv'
DF <- read.csv(pipe(cmd))
SQLite command line tool
You can use the SQLite command line program to read the file into an SQLite database which it will create for you. Google for download sqlite, download the sqlite command line tools for your platform and unpack it. Then from the command line (not from R) run something like this to create the abc.db SQLite database from abc.csv.
sqlite3 --csv abc.db ".import abc.csv abc"
Then assuming that the database is in current directory run this in R:
library(sqldf)
sqldf("select count(*) from abc", dbname = "abc.db")
I am not sure that sqlite it a good choice for such a large file but you can try it
H2
Another possibility if you have sufficient memory to hold the database (possibly after using findstr/grep/xsv or other utility on the command line rather than R) is to then use the H2 database backend to sqldf from R.
If sqldf sees that the RH2 package containing the H2 driver is loaded it will use that instead of SQLite. (It would also be possible to use MySQL or PostgreSQL backends but these are more involved to install so we won't cover them although these are much more likely to be able to handle the large size you have.)
Note that the RH2 driver requires that rJava R package be installed and it requires java itself although java is very easy to install. The H2 database itself is included in the RH2 R driver package so it does not have to be separately installed. Also the first time in a session that you access java code with rJava it will have to load java itself which will take some time but thereafter it will be faster in that session.
library(RH2)
library(sqldf)
abc3 <- sqldf("select * from csvread('abc.csv') limit 3") |>
type.convert(as.is = TRUE)

Related

Does read.csv.sql puts the result dataframe in memory or in the database?

I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.
I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :
Reads the indicated file into an sql database creating the database if
it does not already exist. Then it applies the sql statement returning
the result as a data frame. If the database did not exist prior to
this statement it is removed.
However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)
Curious to see how this works, I wrote the following program
df_data <- suppressWarnings(read.csv.sql(
file = "X:/logs/data.csv",
sql = "
select
nullif(timestamp, '') as timestamp_value,
nullif(user_account, '') as user_account,
nullif(country_code, '') as country_code,
nullif(prefix_value, '') as prefix_value,
nullif(user_query, '') as user_query,
nullif(returned_code, '') as returned_code,
nullif(execution_time, '') as execution_time,
nullif(output_format, '') as output_format
from
file
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
timestamp_value = c("TEXT"),
user_account = c("TEXT"),
country_code = c("TEXT"),
prefix_value = c("TEXT"),
user_query = c("TEXT"),
returned_code = c("TEXT"),
execution_time = c("REAL"),
output_format = c("TEXT")
),
dbname = "X:/logs/sqlite_tmp.db",
drv = "SQLite"
))
I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).
What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?
The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.
Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
library(sqldf)
sqldf("attach 'mydb' as new")
read.csv.sql("myfile.csv", sql =
"create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or
## dbSendStatement() instead of dbGetQuery() or dbSendQuery().
sqldf("select * from sqlite_master", dbname = "mydb")
## type name tbl_name rootpage
## .. info on table that was created in mydb ...
sqldf("select count(*) from mytab", dbname = "mydb")

Delete layer from GeoPackage

I am trying to delete a vector layer from a GeoPackage file using the sf package. By "delete" I mean permanently remove NOT overwrite or update. I am aware of the delete_layer option, but as I understand this only functions to delete a layer before replacing it with a layer of the same name.
Unfortunately I have written a layer with a name using non-standard encoding to a GeoPackage, which effectively makes the entire gpkg-file unreadble in QGIS. Hence, I am trying to find a solution to remove it via R.
A geopackage is also an SQLite database, so you can use RSQLite database functions to remove tables.
Set up a test:
> d1 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
> d2 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
> d3 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
Write those to a GPKG:
> st_write(d1,"deletes.gpkg","d1")
Writing layer `d1' to data source `deletes.gpkg' using driver `GPKG'
features: 10
fields: 1
geometry type: Point
> st_write(d2,"deletes.gpkg","d2",quiet=TRUE)
> st_write(d3,"deletes.gpkg","d3",quiet=TRUE)
Now to delete, use the RSQLite package (from CRAN), create a database connection:
library(RSQLite)
db = SQLite()
con = dbConnect(db,"./deletes.gpkg")
and remove the table:
dbRemoveTable(con, "d2")
There's one tiny problem - this removes the table but does not remove the metadata that GPKG uses to note this package is a spatial table. Hence you get warnings like this with GDAL tools:
$ ogrinfo -so -al deletes.gpkg
ERROR 1: Table or view 'd2' does not exist
Warning 1: unable to read table definition for 'd2'
QGIS happily read the remaining two layers in correctly though. I think this can be worked round in R by loading Spatialite module extensions alongside the SQLite modules, or manually removing the rows in the metadata tables gpkg_geometry_columns and maybe gpkg_ogr_contents but nothing seems to break hard with those things not updated.

Loading data with RSQLite which has quoted values

I am trying to load a large-ish csv file into a SQL lite database using the RSQLite package (I have also tried the sqldf package). The file contains all UK postcodes and a variety of lookup values for them.
I wanted to avoid loading it into R and just directly load it into the database. Whilst this is not strictly necessary for this task, I want to do so in order to have the technique ready for larger files which won't fit in memory should I have to handle them in the future.
Unfortunately the csv is provided with the values in double quotes and the dbWriteTable function doesn't seem able to strip them or ignore them in any form. Here is the download location of the file: http://ons.maps.arcgis.com/home/item.html?id=3548d835cff740de83b527429fe23ee0
Here is my code:
# Load library
library("RSQLite")
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
file <- "data\\ONSPD_MAY_2017_UK.zip"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")
# Create a SQL Lite database connection
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
# Now load the data into our SQL lite database
dbWriteTable(conn = db_connection,
name = "ONS_PD",
value = ONSPD_path,
row.names = FALSE,
header = TRUE,
overwrite = TRUE
)
# Check the data upload
dbListTables(db_connection)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 20")
Having hit this issue, I found a reference tutorial (https://www.r-bloggers.com/r-and-sqlite-part-1/) which recommended using the sqldf package but unfortunately when I try to use the relevant function in sqldf (read.csv.sql) then I get the same issue with double quotes.
This feels like a fairly common issue when importing csv files into a sql system, most import tools are able to handle double quotes so I'm surprised to be hitting an issue with this (unless I've missed an obvious help file on the issue somewhere along the way).
EDIT 1
Here is some example data from my csv file in the form of a dput output of the SQL table:
structure(list(pcd = c("\"AB1 0AA\"", "\"AB1 0AB\"", "\"AB1 0AD\"",
"\"AB1 0AE\"", "\"AB1 0AF\""), pcd2 = c("\"AB1 0AA\"", "\"AB1 0AB\"",
"\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\""), pcds = c("\"AB1 0AA\"",
"\"AB1 0AB\"", "\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\"")), .Names = c("pcd",
"pcd2", "pcds"), class = "data.frame", row.names = c(NA, -5L))
EDIT 2
Here is my attempt using the filter argument in sqldf's read.csv.sql function (note that Windows users will need rtools installed for this). Unfortunately this still doesn't seem to remove the quotes from my data, although it does mysteriously remove all the spaces.
library("sqldf")
sqldf("attach 'ons_lkp_db' as new")
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
read.csv.sql(ONSPD_path,
sql = "CREATE TABLE ONS_PD AS SELECT * FROM file",
dbname = "ons_lkp_db",
filter = 'tr.exe -d ^"'
)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 5")
Also, thanks for the close vote from whoever felt this wasn't a programming question in the scope of Stack Overflow(?!).
The CSV importer in the RSQLite package is derived from the sqlite3 shell, which itself doesn't seem to offer support for quoted values when importing CSV files (How to import load a .sql or .csv file into SQLite?, doc). You could use readr::read_delim_chunked():
callback <- function(data) {
name <- "ONS_PD"
exists <- dbExistsTable(con, name)
dbWriteTable(con, name, data, append = exists)
}
readr::read_delim_chunked(ONSPD_path, callback, ...)
Substitute ... with any extra arguments you need for your CSV file.
Use read.csv.sql from the sqldf package with the filter argument and provide any utility which strips out double quotes or which translates them to spaces.
The question does not provide a fully reproducible minimal example but I have provided one below. If you are using read.csv.sql in order to pick out a subset of rows or columns then just add the appropriate sql argument to do so.
First set up the test input data and then try any of the one-line solutions shown below. Assuming Windows, ensure that the tr utility (found in R's Rtools distribution) or the third party csvfix utility (found here and for Linux also see this) or the trquote2space.vbs vbscript utility (see Note at end) is on your path:
library(sqldf)
cat('a,b\n"1","2"\n', file = "tmp.csv")
# 1 - corrected from FAQ
read.csv.sql("tmp.csv", filter = "tr.exe -d '^\"'")
# 2 - similar but does not require Windows cmd quoting
read.csv.sql("tmp.csv", filter = "tr -d \\42")
# 3 - using csvfix utility (which must be installed first)
read.csv.sql("tmp.csv", filter = "csvfix echo -smq")
# 4 - using trquote2space.vbs utility as per Note at end
read.csv.sql("tmp.csv", filter = "cscript /nologo trquote2space.vbs")
any of which give:
a b
1 1 2
You could also use any other language or utility that is appropriate. For example, your Powershell suggestion could be used although I suspect that dedicated utilities such as tr and csvfix would run faster.
The first solution above is corrected from the FAQ. (It did work at the time the FAQ was written many years back but testing it now in Windows 10 it seems to require the indicated change or possibly the markdown did not survive intact from the move from Google Code, where it was originally located, to github which uses a slightly different markdown flavor.)
For Linux, tr is available natively although quoting differs from Windows and can even depend on the shell. csvfix is available on Linux too but would have to be installed. The csvfix example shown above would work identically on Windows and Linux. vbscript is obviously specific to Windows.
Note: sqldf comes with a mini-tr utility written in vbscript. If you change the relevant lines to:
Dim sSearch : sSearch = chr(34)
Dim sReplace : sReplace = " "
and change the name to trquote2space.vbs then you will have a Windows specific utility to change double quotes to spaces.
Honestly I could not find anything to solve this problem.
sqldf documentation tells
"so, one limitation with .csv files is that quotes
are not regarded as special within files so a comma within a data field such as
"Smith, James"
would be regarded as a field delimiter and the quotes would be entered as part of the data which
probably is not what is intended"
So, It looks like there is no solution as far as I know.
One possible suboptimal approach (other then obvious find and replace in text editor)
is to use SQL commands like this
dbSendQuery(db_connection,"UPDATE ONS_PD SET pcd = REPLACE(pcd, '\"', '')")

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

Read SPSS file into R

I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
library(foreign)
## change the working directory
getwd()
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file
I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.
The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
install.packages("memisc")
data <- as.data.set(spss.system.file('yourfile.sav'))
You may also try this:
setwd("C:/Users/rest of your path")
library(haven)
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)
I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
,
You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
library(foreign)
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945
It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
HTH,
Jon Peck
If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?
Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.
For me it works well using memisc!
install.packages("memisc")
load('memisc')
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
names(Daten.Februar)
I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
library(tidyverse)
library(haven)
# Some interesting information in here
vignette('semantics')
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})
There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html
In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
library(memisc)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
}
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
}
labels(data[[1]]) <- NULL
Hope the above code will help someone else.
Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
SET UNICODE OFF.
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then
I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
download.file("http://www.tcd.ie/Political_Science/elections/IMSgeneral92.zip",
"IMSgeneral92.zip")
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!
1)
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
2)
Memisc package also has an spss function worth trying.

Resources