I can't figure this out.
library(dplyr)
dat <- data.frame(a = 1:5,b = rep(TRUE,5))
# this doesn't work
dat %>% all(.$b) # tricky
# this doesn't work
dat %>% all(b) #
# this does
dat %>% .$b %>% all
I find it confusing that all(.$b) doesn't work. That doesn't seem intuitive to me at all.
Well, the %>% operator is borrowed from the magrittr package which defines the following rules:
By default the left-hand side (LHS) will be piped in as the first argument of the function appearing on the right-hand side (RHS).
When the LHS is needed at a position other than the first, one can use the dot,'.', as placeholder.
You can see that the whole data frame is still being passed in as the first parameter with this example
f<-function(...) str(list(...))
dat %>% f(.$b)
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 1 2 3 4 5
# ..$ b: logi [1:5] TRUE TRUE TRUE TRUE TRUE
# $ : logi [1:5] TRUE TRUE TRUE TRUE TRUE
So you are getting both the data.frame and the vector (the function is receiving two parameters). I believe this is because you are not moving the . to a position other than the first parameter so you are not changing the behavior to pass along the object as the first parameter.
It just so happens that the magrittr package has a different operator for use in cases like this. You can use %$%.
library(magrittr)
dat %$% all(b)
# [1] TRUE
Related
I am new to R and have very simple code. I am trying to create a barchart with 2 variables and 6 observations; however, the data appears to be plotting incorrectly. The combined value for MAYBE is 5.9, the combined value for NO is 5.3 and the combined value for YES is 5.3 Categories MAYBE and NO appear to be showing correctly; however, YES appears to be showing 3.2 and not 5.3. Can you please review and advise what might be wrong with my code.
library(tidyverse)
xaxis_data <- c("YES","NO","MAYBE")
yaxis_data <- c(2.1,1.6,3.4,3.2,3.7,2.5)
data_to_plot <- data.frame(cbind(xaxis_data,yaxis_data),stringsAsFactors = FALSE)
ggplot(data=data_to_plot) +
geom_bar(mapping=aes(x = xaxis_data,y=yaxis_data,fill = xaxis_data),stat="identity")[enter image description here][1]
The issue is that cbind converts to matrix and matrix can have only a single class. The xaxis_data is character class and it converts the whole matrix to character. Instead, we can just construct with data.frame alone.
data_to_plot <- data.frame(xaxis_data,yaxis_data,stringsAsFactors = FALSE)
str(data_to_plot)
#'data.frame': 6 obs. of 2 variables:
#$ xaxis_data: chr "YES" "NO" "MAYBE" "YES" ...
#$ yaxis_data: num 2.1 1.6 3.4 3.2 3.7 2.5
If we use the cbind with data.frame
str(data.frame(cbind(xaxis_data,yaxis_data),stringsAsFactors = FALSE))
'data.frame': 6 obs. of 2 variables:
#$ xaxis_data: chr "YES" "NO" "MAYBE" "YES" ...
#$ yaxis_data: chr "2.1" "1.6" "3.4" "3.2" ... ### character class
Using the OP's code
library(ggplot2)
ggplot(data=data_to_plot) +
geom_bar(mapping=aes(x = xaxis_data,y=yaxis_data,
fill = xaxis_data), stat="identity")
Caveat: novice. I have several data.tables with millions of rows each, variables are mostly dates and factors. I was using rbindlist() to combine them because. Yesterday, after breaking up the tables into smaller pieces vertically (instead of the current horizontal splicing), I was trying to understand rbind better (especially with fill = TRUE) and also tried bind_rows() and then tried to verify the results but identical() returned FALSE.
library(data.table)
library(dplyr)
DT1 <- data.table(a=1, b=2)
DT2 <- data.table(a=4, b=3)
DT_bindrows <- bind_rows(DT1,DT2)
DT_rbind <- rbind(DT1,DT2)
identical(DT_bindrows,DT_rbind)
# [1] FALSE
Visually inspecting the results from bind_rows() and rbind() says they are indeed identical. I read this and this (from where I adapted the example). My question: (a) what am I missing, and (b) if the number, names, and order of my columns is the same, should I be concerned that identical() = FALSE?
The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE
I want to set an attribute ("full.name") of certain variables in a data frame by subsetting the dataframe and iterating over a character vector. I tried two solutions but neither works (varsToPrint is a character vector containing the variables, questionLabels is a character vector containing the labels of questions):
Sample data:
jtiPrint <- data.frame(question1 = seq(5), question2 = seq(5), question3=seq(5))
questionLabels <- c("question1Label", "question2Label")
varsToPrint <- c("question1", "question2")
Solution 1:
attrApply <- function(var, label) {
`<-`(attr(var, "full.name"), label)
}
mapply(attrApply, jtiPrint[varsToPrint], questionLabels)
Solution 2:
i <- 1
for (var in jtiPrint[varsToPrint]) {
attr(var, "full.name") <- questionLabels[i]
i <- i + 1
}
Desired output (for e.g. variable 1):
attr(jtiPrint$question1, "full.name")
[1] "question1Label"
The problems seems to be in solution 2 that R sets the attritbute to a new dataframe only containing one variable (the indexed variable). However, I don't understand why solution 1 does not work. Any ideas how to fix either of these two ways?
Solution 1 :
The function is 'attr<-' not '<-'(attr...), also you need to set SIMPLIFY=FALSE (otherwise a matrix is returned instead of a list) and then call as.data.frame :
attrApply <- function(var, label) {
`attr<-`(var, "full.name", label)
}
df <- as.data.frame(mapply(attrApply,jtiPrint[varsToPrint],questionLabels,SIMPLIFY = FALSE))
> str(df)
'data.frame': 5 obs. of 2 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
Solution 2 :
You need to set the attribute on the column of the data.frame, you're setting the attribute on copies of the columns :
for(i in 1:length(varsToPrint)){
attr(jtiPrint[[i]],"full.name") <- questionLabels[i]
}
> str(jtiPrint)
'data.frame': 5 obs. of 3 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
$ question3: int 1 2 3 4 5
Anyway, note that the two approaches lead to a different result. In fact the mapply solution returns a subset of the previous data.frame (so no column 3) while the second approach modifies the existing jtiPrint data.frame.
Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.
A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.