At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).
I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.
AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.
Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.
Here's a sample data file.
I'd appreciate any help resolving this issue.
Thanks.
I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.
There is a solution to read SPSS data file in R by ODBC driver.
1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.
2) Create a DSN for the SPSS data driver.
3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.
Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.
require(RODBC)
# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
# List of tables
Tables <- sqlTables(con)
Tables
# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
# Function to query a table by name
sqlQuery.tab.name <- function(table) {
sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}
# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)
# See the data
lapply(Data, head)
# Close connection
close(con)
For example we can that value labels are defined for two variables:
[[5]]
VAR00002 VAR00002_label
1 1 Male
2 2 Female
[[6]]
VAR00003 VAR00003_label
1 2 Student
2 3 Employed
3 4 Unemployed
Additional information
Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.
I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).
get.spss <- function(channel, variables = NULL, value.labels = F) {
VarNames <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]", as.is = T)$VarName
if (is.null(variables)) variables <- VarNames else {
if (any(!variables %in% VarNames)) stop("Wrong variable names")
}
if (value.labels) {
ValueLabelTableName <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]
WHERE ValueLabelTableName is not null",
as.is = T)$VarName
ValueLabelTableName <- intersect(variables, ValueLabelTableName)
}
variables <- paste(variables, collapse = ", ")
data <- sqlQuery(channel = channel,
query = paste("SELECT", variables, "FROM [Cases]"),
as.is = T)
if (value.labels) {
for (var in ValueLabelTableName) {
VL <- sqlQuery(channel = channel,
query = paste0("SELECT * FROM [VLVAR", var,"]"),
as.is = T)
data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
}
}
return(data)
}
My work is going through the same transition.
read.spss() returns the variable labels as an attribute of the object you create with it. So in the example below I have a data frame called rvm which was created by read.spss() with to.data.frame=TRUE. It has 3,500 variables with short names a1, a2 etc but long labels for each variable in SPSS. I can access the variable labels by
cbind(attributes(rvm)$variable.labels)
which returns a list of all 3,500 variables full names up to
…
x23 "Other Expenditure Uncapped Daily Expenditure In Region"
x24 "Accommodation Expenditure In Region"
x25 "Food/Meals/Drink Expenditure In Region"
x26 "Local Transport Expenditure In Region"
x27 "Sightseeing/Attractions Expenditure In Region"
x28 "Event/Conference Expenditure In Region"
x29 "Gambling/Casino Expenditure In Region"
x30 "Gifts/Souvenirs Expenditure In Region"
x31 "Other Shopping Expenditure In Region"
x0 "Accommodation Daily Expenditure In Region"
What to do with these is another matter, but at least I have them, and if I want I can put them in some other object for safekeeping, searching with grep, etc.
Since you have SPSS available, I recommend installing the "Essentials for R" plugin (free of charge, but you need to register, also see the installation instructions) which allows you to run R within SPSS. The plugin includes an R package with functions that transfer the active SPSS data frame to R (and back) - including labeled factor levels, dates, German umlauts - details that are otherwise notoriously difficult. In my experience, it is more reliable than R's own foreign package.
Once you have everything set up, open the data in SPSS, and run something like the following code in the syntax window:
begin program r.
myDf <- spssdata.GetDataFromSPSS(missingValueToNA=TRUE,
factorMode="labels",
rDate="POSIXct")
save(myDf, file="d:/path/to/your/myDf.Rdata")
end program.
Essentials for R plugin link (apparently breaks markdown link syntax):
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/We70df3195ec8_4f95_9773_42e448fa9029/page/Downloads%20for%20IBM®%20SPSS®%20Statistics?lang=en
Nowadays, the package haven provides the functionality to achieve what you want (and much more).
The function read_sav() can import *.sav and *.zsav files and returns a tibble. The variable labels are automatically stored in the labels attribute of the corresponding variables within that tibble. The class labelled preserves the original semantics and allows us to associate arbitrary labels with numeric or character vectors. If needed, we can use the function as_factor() to coerce labeled objects, i.e. objects of the class labelled, and even all labeled vectors within data.frames or tibbles (at once) to factors.
Related
I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?
Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)
I would like to create some R functions and combine them in a small R package for an university course. Those functions will take input values (tree diameter at breast height) and estimate an output parameter (biomass) based on a function with parameters published in an article. Those parameters depend on the tree species and the functions are of this form (just a simplified example):
fun <- function(diameter, species){
parameter <- dataframe$parameter[which(dataframe$species == species)]
mass <- parameter*diameter
return(mass)
}
Hence, the function will have to look up those parameters in a table (named dataframe in this example).
Now I wondered what the best practice would be, to implement this. I have the data as Excel tables and I can load them as data.frames in R. I could use dput() and paste the output into my functions, so they contain the data to look up the values. However, this approach is probably not the most efficient one? It will make the functions pretty much unreadable.
Creating the data.frame as a global variable in the user's environment is probably also not best practice?
So, I wondered, how data sets should be included in R functions. Unfortunately, I was too stupid to google a solution (searching for terms such as "data.frame" or "function" obviously leads to a lot of stuff not related to my question; I have no useful keywords to start with). I hope this question is not a duplicate (the suggestions show very different questions).
The question is whether the data are to be user-visible as test data set. Then the right place for the data frame is a text file in the /data subdirectory of the package. This is the standard method. However, if the data are just parameters, one can define a local environment in the package and then place the the data in an .R script together with the function /Rfolder of the package.
You may have a look at the source code of package marelac, a package for aquatic sciences that uses both techniques.
Finally one can store it in R/sysdata.R as shown in the link provided by #Gabor Grothendiek.
Briefly the method that was used in marelac: First, there is a file aaa.R that defines a new environment as follows:
.marelac <- new.env()
The file name aaa.R ensures that this is loaded first, the dot that it is a "hidden" variable.
Then a function can make use of it:
## -----------------------------------------------------------------------------
## Seawater Composition
## function taken from package marelac, license: GPL >= 2
## -----------------------------------------------------------------------------
.marelac$sw_comp <- c(Na = 0.3065958, Mg = 0.0365055, Ca = 0.0117186,
K = 0.0113495, Sr = 0.0002260, Cl = 0.5503396, SO4 = 0.0771319,
HCO3 = 0.0029805, Br = 0.0019134, CO3 = 0.0004078, BOH4 = 0.0002259,
F = 0.0000369, OH = 0.0000038, BOH3 = 0.0005527, CO2 = 0.0000121)
sw_comp <- function(species = c("Na", "Mg", "Ca", "K", "Sr", "Cl", "SO4", "HCO3",
"Br", "CO3", "BOH4", "F", "OH", "BOH3", "CO2")) {
species <- match.arg(species, several.ok = TRUE)
.marelac$sw_comp[species]
}
Here sw_comp is the standard composition of seawater. This is the smallest (almost trivial) function with this technique and there are some others, e.g. gas_solubilityor diffcoeff.
I have a spss file which contents variables and value labels. I saw foreign package with read.spss function:
data <- read.spss("2017.sav", to.data.frame = TRUE, use.value.labels = TRUE)
If i use use.value.labels = TRUE, all string change to factor variables and i dont want it because they are not factor all.
I found one solution but i dont know if it is the best way to do it
1º First read spss file with previous sentence
2º select which variables are not factor and change it to string with:
cols <- c("x", "ab")
data[cols] <- lapply(data[cols], as.character)
if i dont use use.value.labels = TRUE i will have not value labels and i cannot export file correctly
You can also use the memisc package:
sav <- spss.system.file("file.sav")
df <- as.data.set(sav)
My company regularly deals with SAV files and we extract out the metadata separately. With the foreign package, you can get the metadata out in a few different ways (after you have loaded the file in):
data.label.table <- attr(sav, "label.table")
missings <- attr(sav, "missings")
The other bits require various lapply and sapply functions to get them out. The script I have is quite long, so I will not share it here. If you read the data in with read.spss(sav, to.data.frame = TRUE) you can get:
VariableLabels <- unname(attr(sav, "variable.labels"))
I dont know why, but I can’t install a "foreign" package.
Here is what I did instead to import a dataset from SPSS to R (through Excel):
Open your data in SPSS.
Export dataset from SPSS to Excel, but make sure to choose the "Save
value labels where defined instead of data values" option at the
very bottom.
Open R.
Import dataset from Excel.
Now, you have a dataset in R with value labels.
Use the haven package:
library(haven)
data <- read_sav("2017.sav")
The labels are shown in the RStudio viewer.
I have successfully added information to shapefiles before (see my post on http://rusergroup.swansea.ac.uk/Healthmap.ashx?HL=map ).
However, I just tried to do it again with a slightly different shapefile (new local health boards for Wales) and the code fails at spCbind with a "row names not identical error"
o <- match(wales.lonlat$NEW_LABEL, wds$HB_CD)
wds.xtra <- wds[o,]
wales.ncchd <- spCbind(wales.lonlat, wds.xtra)
My rows did have different names before and that didn't cause any problems. I relabeled the column in wds.xtra to match "NEW_LABEL" and that doesn't help.
The labels and order of labels do match exactly between wales.lonlat and wds.xtra.
(I'm using Revolution R 5.0, which is built on R 2.13.2)
I use match to merge data to the sp data slot based on rownames (or any other common ID). This avoids the necessity of maptools for the spCbind function.
# Based on rownames
sdata#data=data.frame(sdata#data, new.df[match(rownames(sdata#data), rownames(new.df)),])
# Based on common ID
sdata#data=data.frame(sdata#data, new.df[match(sdata#data$ID, new.df$ID),])
# where; sdata is your sp object and new.df is a data.frame object that you want to merge to sdata.
I had the same error and could resolve it by deleting all other data, which were not actually to be added. I suppose, they confused spCbind because the matching wanted to match all row-elements, not only the one given. In my example, I used
xtra2 <- data.frame(xtra$ID_3, xtra$COMPANY)
to extract the relevant fields and fed them to spCbind afterwards
gadm <- spCbind(gadm, xtra2)
I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)