What is attr(*, "value.labels") when reading SPSS into R? - r

I have an SPSS file, but not SPSS. So I want to open it in R.
If I open it using:
library(foreign)
dat <- read.spss("file.sav", to.data.frame=TRUE)
I get the warning
re-encoding from CP1252
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
If I understand correctly, the encoding notification is not a problem (I'm in an UTF-8 locale), but what does the warning about levels mean?
If I open the file using:
dat <- read.spss("file.sav", to.data.frame=TRUE, use.value.labels = FALSE)
the warning disappears, but I'm not sure if what I do is correct.
Also, calling str(dat) gives me output like:
pt_art : atomic 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "value.labels")= Named chr "2" "1"
.. ..- attr(*, "names")= chr "IPT" "VT"
What does attr(*, "value.labels") mean? I know that "pt_art" means "type of psychotherapy" and "IPT" and "VT" are the two therapy types and "2" and "1" are the numeric codes representing those types, so what we have are what are levels and labels in R, but how do I correctly transfer that into R?

The warning occurs when you try and define a factor with a labels argument that contains duplicate values.
(x <- sample(letters[1:4], 10, replace = TRUE))
## [1] "b" "c" "d" "d" "b" "c" "d" "c" "c" "c"
factor(x, levels = x)
## [1] b c d d b c d c c c
## Levels: b c d d b c d c c c
## Warning message:
## In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
## duplicated levels will not be allowed in factors anymore
SPSS usually uses value labels to denote categorical variables (that should become factors in R). However note this section from the ?read.spss help page.
Occasionally in SPSS, value labels will be added to some values of
a continuous variable (e.g. to distinguish different types of
missing data), and you will not want these variables converted to
factors. By setting 'max.value.labels' you can specify that
variables with a large number of distinct values are not converted
to factors even if they have value labels. In addition, variables
will not be converted to factors if there are non-missing values
that have no value label. The value labels are then returned in
the '"value.labels"' attribute of the variable.

Related

Removing a row does not change change output of length() and levels()

Using below code I import a dataset, explore it and remove a row.
After removing the row the output of my length and levels command is unchanged. Why?
MT <- read_csv("Q:/PhD/PhD courses/Data Doc and Man/day3-day4/bromraw.txt",
col_names = FALSE)
names(MT) <- c("id","pnr","age","sex", "runtime")
MT$sex <- as.factor(MT$sex)
length(levels(MT$sex))
levels(MT$sex)
This is the output:
[1] 3
[1] "33529" "K" "M"
Something is wrong. I investigate the row where sex has the value 33529
filter(MT, sex == 33529)
After examining the row I decide to drop it, and recheck the sex variable again.
MT <- subset(MT, sex !=33529)
length(levels(MT$sex))
levels(MT$sex)
[1] 3
[1] "33529" "K" "M"
The row is not there when I browse the data, but the output of the length and levels command is the same as before. What am I doing wrong?
I feel the question deserves a better explanation than just a piece of code.
Factor levels can exist independent of the data, e.g.
x <- factor(character(0), levels = LETTERS[1:3])
creates a vector of length 0 which has 3 factor levels
x
factor(0)
Levels: A B C
The length of the vector length(x) is zero but x has 3 levels
levels(x)
[1] "A" "B" "C"
(and length(levels(x)) is 3, accordingly).
The benefit is that we can add data later on which is checked if it is compatible with the defined factor levels:
x[1:4] <- LETTERS[1:4]
Warning message: In [<-.factor(*tmp*, 1:4, value = c("A", "B",
"C", "D")) : invalid factor level, NA generated
x
[1] A B C <NA>
Levels: A B C
Now, the vector consists of 4 elements (length(x)) but there are still only 3 factor levels. Note that "D" has not become an additional factor level automatically but was replaced by NA instead.
If elements of the vector are removed, e.g.
y <- x[-c(1L, 4L)]
y
[1] B C
Levels: A B C
the factor levels remain unchanged while length(y) is 2 now.
However, if you want to remove unused factor levels you can do so by explicitely using the droplevels() function as pointed out by akrun:
y <- droplevels(y)
y
[1] B C
Levels: B C
Now, factor level "A" has been dropped as it is unused.
While the levels() function shows the factor levels which are defined it does not tell which of the boxes (credit to Acccumulation for the picture) are filled or not. The unique() function returns a vector of distinct values while the table() function counts the number of occurrences:
set.seed(1L)
z <- sample(LETTERS[1:8], 10, replace = TRUE)
z
[1] "C" "B" "E" "H" "A" "B" "D" "A" "D" "C"
unique(z)
[1] "C" "B" "E" "H" "A" "D"
table(z)
z
A B C D E H
2 2 2 2 1 1
This could be a case of unused levels. We can resolve it by dropping the levels
MT <- droplevels(subset(MT, sex != 33529))

Coerce variables in data frame to appropriate format

I'm working a data frame which consists of multiple different data types (numerics, characters, timestamps), but unfortunately all of them are received as characters. Hence I need to coerce them into their "appropriate" format dynamically and as efficiently as possible.
Consider the following example:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
I obviously want val1 to be numeric and val2 to remain as a character. Therefore, my result should look like this:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
Right now I'm accomplishing this by checking if the coercion would result in NULL and then proceeding in coercing if this isn't the case:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
However, this doesn't strike me as the correct solution because of multiple issues:
I suspect that there is a faster way of accomplishing this
For some reason I receive the warning In FUN(X[[i]], ...) : NAs introduced by coercion, although this isn't the case (see result)
This seems inappropriate when handling other data types, i.e. dates
Is there a general, heuristic approach to this, or another, more sustainable solution? Thanks
The recent file readers like data.table::fread or the readr package do a pretty decent job in identifying and converting columns to the appropriate type.
So my first reaction was to suggest to write the data to file and read it in again, e.g.,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
- attr(*, ".internal.selfref")=<externalptr>
or without actually writing to disk:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
However, d.b's suggestions are much smarter but need some polishing to avoid coercion to factor:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
or
df[] <- lapply(df, readr::parse_guess)
You should check dataPreparation package. You will find function findAndTransformNumerics function that will do exactly what you want.
require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"character" "character" "factor"
messy_adult is an ugly data set to illustrate functions from this package. Here num1 and num2 are strings :/
messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."
Here we performed the search and it logged what it found
And know:
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"numeric" "numeric" "factor"
Hope it helps!
Disclamer: I'm the author of this package.

R, about numeric and character type of data

Let x=1:3, y=c("1","2","3")
If I type ls.str(), R shows that:
x : int [1:3] 1 2 3
y : chr [1:3] "1" "2" "3"
x is a numeric and y is a character vector.
Thus, when I typed x == y, I expected the result to be FALSE FALSE FALSE, but surprisingly R shows TRUE TRUE TRUE. Why does this happen? Isn't character type of data and numeric type of data different?

converting a chr into num in R

I have some data that is currently in character form, and I need to put it into numeric form so that I can get the mean. I'm new to R so any help will be much appreciated. My initial thought was that the missing data is causing it to not be read as num, but could it be because the numbers are "3" instead of 3?
Here's what I have:
X
chr [1:1964] "3", "4", "4", "1", NA
I've tried different methods of converting X from chr to num:
X <- na.omit(Y, Z, as.numeric)
mean(X)
# [1] NA
# Warning message:
# In mean.default(X) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
mean(X, na.rm=TRUE)
# [1] NA
# Warning message:
# In mean.default(X, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
str(X)
# Named chr [1:1965] "3" "4" "4" "1" "5" "7" NA "6" NA "5" ...
# - attr(*, "names")= chr [1:1965] "" "" "" "" ...
As always, an example of your actual data is helpful. I think I can answer anyway, though. If your data are character data, then converting to numeric like this will work most of the time:
X2 <- as.numeric(X)
If you have missing values, are they showing up as NA? Or did you write something else there to indicate missingness such as "missing"? If you've got something other than NA in your original data, then when you do the as.numeric(X) conversion, R will convert those values to NA and give you a warning message.
To take the mean of a numeric object that has missing values, use:
mean(X2, na.rm=TRUE)
This should work:
mean(as.numeric(X), na.rm=TRUE)
Doing the as.numeric() will introduce an NA for values like "X" and many of the summary functions have a na.rm parameter to ignore NA values in the vector.
But of course taking the mean of a list of chromosomes is a pretty weird operation.

R: numeric vector becoming non-numeric after cbind of dates

I have a numeric vector (future_prices) in my case. I use a date vector from another vector (here: pred_commodity_prices$futuredays) to create numbers for the months. After that I use cbind to bind the months to the numeric vector. However, was happened is that the numeric vector become non-numeric. Do you know how what the reason for this is? When I use as.numeric(future_prices) I get strange values. What could be an alternative? Thanks
head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a
1 68.33907 62.37888
2 68.08553 62.32658
is.numeric(future_prices)
[1] TRUE
> month = format(as.POSIXlt.date(pred_commodity_prices$futuredays), "%m")
> future_prices <- cbind (future_prices, month)
> head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a month
1 "68.3390747063745" "62.3788824938719" "01"
is.numeric(future_prices)
[1] FALSE
The reason is that cbind returns a matrix, and a matrix can only hold one data type. You could use a data.frame instead:
n <- 1:10
b <- LETTERS[1:10]
m <- cbind(n,b)
str(m)
chr [1:10, 1:2] "1" "2" "3" "4" "5" "6" "7" "8" "9" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "n" "b"
d <- data.frame(n,b)
str(d)
'data.frame': 10 obs. of 2 variables:
$ n: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
See ?format. The format function returns:
An object of similar structure to ‘x’ containing character
representations of the elements of the first argument ‘x’ in a
common format, and in the current locale's encoding.
from ?cbind, cbind returns
... a matrix combining the ‘...’ arguments
column-wise or row-wise. (Exception: if there are no inputs or
all the inputs are ‘NULL’, the value is ‘NULL’.)
and all elements of a matrix must be of the same class, so everything is coerced to character.
F.Y.I.
When one column is "factor", simply/directly using as.numeric will change the value in that column. The proper way is:
data.frame[,2] <- as.numeric(as.character(data.frame[,2]))
Find more details: Converting values to numeric, stack overflow

Resources