I'm a C developper trying to learn R, and there's a few things I can't wrap my head around.
I've tried doing something as simple as listing the elements from an int list stored in a data frame.
For this exemple, I'm using the data mpg from package ggplot2.
data(mpg, package="ggplot2")
Doing ls() on the mpg data frame lists the elements stored in it.
> ls(mpg)
[1] "class" "cty" "cyl" "displ" "drv" "fl" "hwy" "manufacturer"
[9] "model" "trans" "year"
Accessing a column can be done by giving it's name as string to the data frame.
> mpg["hwy"]
# A tibble: 234 x 1
hwy
<int>
1 29
2 29
3 31
4 30
5 26
6 26
7 27
8 26
9 25
10 28
# ... with 224 more rows
But using ls() on the column doesn't return the list of int stored in it.
> ls(mpg["hwy"])
[1] "hwy"
I'm really hitting a wall there. I'm trying to understand why it doesn't work the way I'm expecting to, but I can't find any information. That probably means that what I thought I've understood about R is wrong.
Can anyone please give me any directions about that?
Best regards.
Related
I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!
The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()
Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:
df["speed"]
or like so:
df$speed
The following calculates the mean value correctly:
lapply(df["speed"], mean)
But this prints all values under the column speed:
lapply(df$speed, mean)
There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.
We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.
> library(datasets)
> data(cars)
>
> str(cars["speed"])
'data.frame': 50 obs. of 1 variable:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
>
The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().
With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().
> unlist(lapply(cars$speed,mean))
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
What happened?
Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.
Processing a list with lapply()
With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.
> length(cars["speed"])
[1] 1
>
Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.
> lapply(cars["speed"],mean)
$speed
[1] 15.4
>
If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.
> lapply(cars,mean)
$speed
[1] 15.4
$dist
[1] 42.98
>
A theoretical perspective
The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:
In R, two slogans are helpful.
-- Everything that exists is an object, and
-- Everything that happens is a function call.
John Chambers, quoted in Advanced R, p. 79.
The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.
While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.
No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*
A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)
The $ operator is implemented by returning one column as vector (?"$.data.frame")
lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))
Examples:
x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector) # integer
class(b.data.frame) # data.frame
lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
# ... up to list element 26
lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5
So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?
I am trying to select a column from a dataframe using a variable as a column name, with the problem that the column name is escaped. I have a couple of workarounds for doing it, which involve changing my code a bit too much, and anyway I've been looking around and I am curious if anybody knew the solution for this kind of weird case.
My dataset is actually a list of time series (which I construct after some operations), this would be a toy example.
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
> df
$`01/19/17`
[1] 1 2 3 4 5 6 7 8 9 10
$`01/20/17`
[1] 2 3 4 5 6 7 8 9 10 11
I don't put the escapes ` in the column names because I want to, but because they come as dates from the process I follow to construct the dataset.
If I know the column name I can access like this,
df$`01/19/17`
If I want to use a variable, looking around e.g. here I see I could rewrite it to something like this,
`$`(df, `01/19/17`)
But I cannot assign a variable like this,
> name1 <- `01/19/17`
Error: object '01/19/17' not found
and if assign it this other way I get a NULL,
> name1 <- "01/19/17"
> `$`(df, name1)
NULL
As I say there are workarounds like e.g. changing all the column names in the list of series, but I just would like to know. Thank you so much.
You can access with brackets rather than with $, even when the key is a string:
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
name1 <- "01/19/17"
df[[name1]]
# [1] 1 2 3 4 5 6 7 8 9 10
Could anyone please help me with this?
I am trying to select/split some features from the dataset, before I was able to do it with this:
My data had 50 features and here I reduce to 24.
trainQ1 <- df_2015_Q1[,1:24]
#trainQ1 12312312 obs. of 24 variables
But now I use the same code
age <- trainQ1[,1:4]
#and returns
#age int [1:4] 1 2 3 4
What going on here??
Here is what I have:
tmp[1,]
percentages percentages.1 percentages.2 percentages.3 percentages.4 percentages.5 percentages.6 percentages.7 percentages.8 percentages.9
0.0329489291598023 0.0391268533772652 0.0292421746293245 0.0354200988467875 0.0284184514003295 0.035831960461285 0.0308896210873147 0.0345963756177924 0.0366556836902801 0.0403624382207578
I try converting this to numeric, since the class is factor, but I get:
as.numeric(as.character(tmp[1,]))
[1] 35 36 35 36 31 32 31 34 36 34
Where did these integers come from?
Your problem is that indexing by rows of a data frame gives surprising results.
Reconstruct your object:
tmp <- read.csv(text=
"0.0329489291598023,0.0391268533772652,0.0292421746293245,0.0354200988467875,0.0284184514003295,0.035831960461285,0.0308896210873147,0.0345963756177924,0.0366556836902801,0.0403624382207578",
header=FALSE,colClasses=rep("factor",10))
Inspect:
str(tmp[1,])
## 'data.frame': 1 obs. of 10 variables:
## $ V1 : Factor w/ 1 level "0.0329489291598023": 1
## $ V2 : Factor w/ 1 level "0.0391268533772652": 1
## ... etc.
Converting via as.character() totally messes things up:
str(as.character(tmp[1,]))
## chr [1:10] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
On the other hand, this (converting to a matrix first) works fine:
as.numeric(as.matrix(tmp)[1,])
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
That said, I have to admit that I do not understand the particular magic that makes as.character() applied to a data frame drop the information about factor levels and convert everything first to the underlying numerical codes, and then to character -- I don't know where precisely you would go to read about this. (The bottom line is "don't extract rows of data frames if you can help it; convert them to matrices first if necessary.")
As an alternative to converting to matrix, you can just transpose the dataframe row to a column:
as.numeric(as.character(t(tmp[1,])))
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
I think the integers seen by the OP
[1] 35 36 35 36 31 32 31 34 36 34
are factor levels, his data frame had multiple rows - 36 or more - and these are the levels of the first row.
ETA I see that t() converts a data frame to a matrix, so my solution is the same as Ben's.
Perhaps the reason as.character() doesn't work with a dataframe row is that the levels of the different columns may differ, so there isn't a common set of levels(). In these circumstances as.matrix() will convert to character, so it solves the problem.