R haven: missing labels and label names when reading spss file - r

I'm using the haven package for R to read an spss file with user_na=TRUE. The file has many string variables with value labels. In R only the first of the string variables (SizeofH1) has the correct value labels assigned to it as attribute.
Unfortunately I cannot not even provide a snippet of this data to make this fully reproducible but here is a screenshot of what I can see in PSPP
and what str() in R returns...
$ SizeofH1:Class 'labelled' atomic [1:280109] 3 3 3 3 ...
..- attr(*, "label")= chr "Size of Household ab 2002"
..- attr(*, "format.spss")= chr "A30"
..- attr(*, "labels")= Named chr [1:9] "1" "2" "3" "4" ...
..- attr(*, "names")= chr [1:9] "4 Persons" "2 Persons" "1 Person 50 years plus" "3 Persons" ...
$ PROMOTIO: atomic 40 1 40 40 ...
..- attr(*, "label")= chr "PROMOTION"
..- attr(*, "format.spss")= chr "A30"
$ inFMCGfr: atomic 1 1 1 1 ...
..- attr(*, "label")= chr "in FMCG from2011"
..- attr(*, "format.spss")= chr "A30"
$ TRADESEG: atomic 1 1 1 1 ...
..- attr(*, "label")= chr "TRADE SEGMENT"
..- attr(*, "format.spss")= chr "A30"
$ ORGANISA: atomic 111 111 111 111 ...
..- attr(*, "label")= chr "ORGANISATION"
..- attr(*, "format.spss")= chr "A30"
$ NAME : atomic 9 9 9 9 ...
..- attr(*, "label")= chr "NAME"
..- attr(*, "format.spss")= chr "A30"
I hope someone can point me to any possible reason that causes this behavior.

The "semantics" vignette has some useful information on this topic.
library(haven)
vignette('semantics')
There are a couple of options to get value labels. I think a good one is the example demonstrated below, using the map function from the purrr package (but could be done with lapply instead, too)
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})

The best is to save your spss file as CSV and then read it in R. I've faced this before and some strings didn't read correctly- Generally SPSS is not very smart when it comes to string variables that this could contribute to the problem.

Related

R package RSiena, goodness of fit for multiple networks at the same time

I'm relatively new to R and I'm running network and behavior coevolution models using the R Package RSiena.
My data set consists of around 100 networks and for each of these networks, I run one RSiena model.
ans.1 <- siena07(myalgorithm, data=mydata.1, effects=myeff.1, batch=TRUE)
...
ans.100 <- siena07(myalgorithm, data=mydata.100, effects=myeff.100, batch=TRUE)
Now I want to test the goodness of fit for each of the multiple network models. I actually know how to check the goodness of fit for a single model.
gof <- sienaGOF(ans.1, verbose=TRUE, varName="Friend", IndegreeDistribution)
plot(gof)
But I don't know how to combine the GOF results of all 100 models to get an overall impression. How can I get a table with the model number and the p-values. Or can I plot the results for all models within one plot? Or is there a better way?
So far I tried to put the GOF results in a list:
goftest <-list()
goftest[[1]] <- sienaGOF(ans.1, verbose=TRUE, varName="Friend", IndegreeDistribution)
...
goftest[[100]] <- sienaGOF(ans.100, verbose=TRUE, varName="Friend", IndegreeDistribution)
plot(goftest)
goftest[[1]] #Output:
"Siena Goodness of Fit ( IndegreeDistribution ), all periods
=====
Monte Carlo Mahalanobis distance test p-value: 0.941
-----
One tailed test used (i.e. estimated probability of greater distance than observation).
-----
Calculated joint MHD = ( 14.4 ) for current model."
str(goftest[[1]])#Output:
"List of 1
$ Joint:List of 8
..$ p : num 0.941
..$ SimulatedTestStat: Named num [1:2000] 9.97 16.02 6.83 10.14 8.65 ...
.. ..- attr(*, "names")= chr [1:2000] "1" "2" "3" "4" ...
..$ ObservedTestStat : num 2.09
..$ TwoTailed : logi FALSE
..$ Simulations : int [1:2000, 1:9] 21 22 22 21 19 26 30 23 25 26 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2000] "1" "2" "3" "4" ...
.. .. ..$ : NULL
..$ Observations : int [1, 1:9] 26 48 63 73 76 78 78 78 78
..$ InvCovSimStats : num [1:9, 1:9] 13.2509 4.9587 1.2948 0.231 0.0895 ...
..$ Rank : int 9
.. ..- attr(*, "method")= chr "tolNorm2"
.. ..- attr(*, "useGrad")= logi FALSE
.. ..- attr(*, "tol")= num 2e-15
..- attr(*, "class")= chr "sienaGofTest"
..- attr(*, "sienaFitName")= chr "sienaFitObject"
..- attr(*, "auxiliaryStatisticName")= chr "IndegreeDistribution"
..- attr(*, "key")= chr [1:9] "0" "1" "2" "3" ...
- attr(*, "class")= chr "sienaGOF"
- attr(*, "scoreTest")= logi FALSE
- attr(*, "originalMahalanobisDistances")= num [1:3] 2.15 3.51 8.74
- attr(*, "oneStepMahalanobisDistances")=List of 3
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
- attr(*, "joinedOneStepMahalanobisDistances")= Named num(0)
..- attr(*, "names")= chr(0)
- attr(*, "oneStepMahalanobisDistances_old")=List of 3
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
..$ : Named num(0)
.. ..- attr(*, "names")= chr(0)
- attr(*, "joinedOneStepMahalanobisDistances_old")= Named num(0)
..- attr(*, "names")= chr(0)
- attr(*, "oneStepSpecs")= num[1:20, 0 ]
- attr(*, "auxiliaryStatisticName")= chr "IndegreeDistribution"
- attr(*, "simTime")= 'proc_time' Named num [1:5] 39.61 0.28 40.21 NA NA
..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
- attr(*, "twoTailed")= logi FALSE
- attr(*, "joined")= logi TRUE"
But I don't know how to extract the p-Values and get a table, which just contains the network number and the associated p-value.
Furthermore, the plot command just produces error messages and no output so far.

Refer to variable by part of the variable name

It seems that, in R, I can refer to a variable with part of a variable name. But I am confused about why I can do that.
Use the following code as an example:
library(car)
scatterplot(housing ~ total)
house.lm <- lm(housing ~ total)
summary(house.lm)
str(summary(house.lm))
summary(house.lm)$coefficients[2,2]
summary(house.lm)$coe[2,2]
When I print the structure of summary(house.lm), I got the following output:
> str(summary(house.lm))
List of 11
$ call : language lm(formula = housing ~ total)
$ terms :Classes 'terms', 'formula' language housing ~ total
.. ..- attr(*, "variables")= language list(housing, total)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "housing" "total"
.. .. .. ..$ : chr "total"
.. ..- attr(*, "term.labels")= chr "total"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(housing, total)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "housing" "total"
$ residuals : Named num [1:162] -8.96 -11.43 3.08 8.45 2.2 ...
..- attr(*, "names")= chr [1:162] "1" "2" "3" "4" ...
$ coefficients : num [1:2, 1:4] 28.4523 0.0488 10.2117 0.0103 2.7862 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "total"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "total"
$ sigma : num 53.8
$ df : int [1:3] 2 160 2
$ r.squared : num 0.123
$ adj.r.squared: num 0.118
$ fstatistic : Named num [1:3] 22.5 1 160
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 3.61e-02 -3.31e-05 -3.31e-05 3.67e-08
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "total"
.. ..$ : chr [1:2] "(Intercept)" "total"
- attr(*, "class")= chr "summary.lm"
However, it seems that I can refer to the variable coefficients with all of the following commands:
summary(house.lm)$coe[2,2]
summary(house.lm)$coef[2,2]
summary(house.lm)$coeff[2,2]
summary(house.lm)$coeffi[2,2]
summary(house.lm)$coeffic[2,2]
summary(house.lm)$coeffici[2,2]
summary(house.lm)$coefficie[2,2]
summary(house.lm)$coefficien[2,2]
summary(house.lm)$coefficient[2,2]
summary(house.lm)$coefficients[2,2]
They all give the same results: 0.01029709
Therefore, I was wondering when I can refer to a variable with only part of its name in R?
You can do it when rest of name is unambiguous. For example
df <- data.frame(abcd = c(1,2,3), xyz = c(4,5,6), abc = c(5,6,7))
> df$xy
[1] 4 5 6
> df$ab
NULL
> df$x
[1] 4 5 6
df$xy and even df$x gives right data, but df$ab results in NULL because it can refer to both df$abc and df$abcd. It's like when you type df$xy in RStudio and press Ctrl + Space you will get rigtht variable name, so you could refer to part of variable name.
http://adv-r.had.co.nz/Functions.html#lexical-scoping
When calling a function you can specify arguments by position, by
complete name, or by partial name. Arguments are matched first by
exact name (perfect matching), then by prefix matching, and finally by
position.
When you are doing quick coding to analyse some data, using partial names is not a problem, but I tend to agree, it's not good when writing code. In a package you can't do that, R-CMD check will find every occurence.

R-package(baseline) application to sample dataset

I am trying to use the R baseline-package on a sample dataset that I have for, to test and evaluate the current baseline algorithm that I have.
I wanted to apply the fillpeaks algorithm as a trend line to compare.
bc.fillPeaks <- baseline(milk$spectra[1, drop=FALSE], lambda=6,
hwi=50, it=10, int=2000, method="fillPeaks")
plot(bc.fillPeaks)
But my problem is that the sample data that I have does not fit the matrix structure which is used in the example. When I look at the data.frame used for the example I don't understand it
'data.frame': 45 obs. of 2 variables
$ cow : num 0 0.25 0.375 0.875 0.5 0.75 0.5 0.125 0 0.125 ...
$ spectra: num [1:45, 1:21451] 1029 371 606 368 554 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "4999.94078628963" "5001.55954267662" "5003.17856106153" "5004.79784144435" ...
- attr(*, "terms")=Classes 'terms', 'formula' length 3 cow ~ spectra
.. ..- attr(*, "variables")= language list(cow, spectra)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "cow" "spectra"
.. .. .. ..$ : chr "spectra"
.. ..- attr(*, "term.labels")= chr "spectra"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(cow, spectra)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "nmatrix.21451"
.. .. ..- attr(*, "names")= chr [1:2] "cow" "spectra"
My question is therefore if any of you have experience with the baseline-package and the dataset (milk) used and ideas to how I can convert my data set which is structed: Date, Visits, Old_baseline_visits
To fit and test the baseline algorithm from the R-package
I have used baseline, and found it slightly confusing at first, particularly the example data. As it says in the help file, baseline expects a matrix with the spectra in rows. Even if you only have one "spectrum", it needs to be in the form of a single row matrix. Try this:
foo <- data.frame(Date=seq.Date(as.Date("1957-01-01"), by = "day",
length.out = ncol(milk$spectra)),
Visits=milk$spectra[1,],
Old_baseline_visits=milk$spectra[1,], row.names = NULL)
foo.t <- t(foo$Visits) # Visits in a single row matrix
bc.fillPeaks <- baseline(foo.t, lambda=6,
hwi=50, it=10, int=2000, method='fillPeaks')
plot(bc.fillPeaks)
If you want the baseline and corrected spectra back in your original data frame, try this:
foo$New_baseline <- c(getBaseline(bc.fillPeaks))
foo$New_corrected <- c(getCorrected(bc.fillPeaks))
plot(foo$Date, foo$New_corrected, "l")
Alternatively, if you don't need the baseline object, you can use baseline.fillPeaks(), which returns a list.

How do I extract and print the names of values in R?

I have data that looks on the surface as follows:
The data is structured as follows:
str(minstest)
List of 948
$ : Named int [1:5] 4 6 11 16 75
..- attr(*, "names")= chr [1:5] "Mass " "Bite " "Burn " "Cyst " ...
$ : Named int 37
..- attr(*, "names")= chr "Impaired skin integrity "
$ : Named int 2
..- attr(*, "names")= chr "Abrasion "
$ : Named int 33
..- attr(*, "names")= chr "Infection of nail "
$ : Named int 34
..- attr(*, "names")= chr "Lesion "
What I want to do is replace the numbers in the V1 vector with the names. Further, I would like to separate each name into distinct columns.
I tried multiple permutations of a recommendation that I found at:
How to print names of values in R
Unfortunately, no option I have tried thus far has worked.

Correct indexing of named vectors

This is a very quick and simple question that I am sure has been asked repreatedly on here. However after 45 minutes of searching I cannot answer the problem. Please link to whichever is the relevant question or delete it if needed.
I have the following:
>str(slope)
List of 55
$ : Named num [1:2] -0.00044 0.00311
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 1.374 -0.0276
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 3.704 -0.102
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 9.275 -0.294
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 15.76 -0.46
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 16.27 -0.443
..- attr(*, "names")= chr [1:2] "(Intercept)" "y"
$ : Named num [1:2] 25.973 -0.717
How do access all values of "y" e.g. if I want to plot them? I can access individual "y" values using:
> slope[[ c(1, 2) ]]
[1] 0.003111922
but not all of them at once.
sapply(slope, `[`, 2)
Also try
foo <- do.call(rbind, slope)
foo[,2]
Try using sapply and [[:
sapply(slope, '[[', "y")
or maybe
sapply(slope, '[[', 2)
If it doesn't work, then provide a reproducible example and some data.

Resources