dplyr gives me different answers depending on how I select columns - r

I may be having trouble understanding some of the basics of dplyr, but it appears that R behaves very differently depending on whether you subset columns as one column data frames or as traditional vectors. Here is an example:
mtcarsdf<-tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
#subsetting to cyl this way gives integer vector
example(mtcars$gear,mtcarsdf$cyl)
# 3 112
# 4 56
# 5 30
#subsetting this way gives a one column data table
example(mtcars$gear,mtcarsdf[,"cyl"])
# 3 198
# 4 198
# 5 198
all(mtcarsdf$cyl==mtcarsdf[,"cyl"])
# TRUE
Since my inputs are technically equal the fact that I am getting different outputs tells me I am misunderstanding how the two objects behave. Could someone please enlighten me on how to improve the example function so that it can handle different objects more robustly?
Thanks

First, the items that you are comparing with == are not really the same. This could be identified using all.equal instead of ==:
all.equal(mtcarsdf$cyl, mtcarsdf[, "cyl"])
## [1] "Modes: numeric, list"
## [2] "Lengths: 32, 1"
## [3] "names for current but not for target"
## [4] "Attributes: < target is NULL, current is list >"
## [5] "target is numeric, current is tbl_df"
With that in mind, you should be able to get the behavior you want by using [[ to extract the column instead of [.
mtcarsdf <- tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
example(mtcars$gear, mtcarsdf[["cyl"]])
However, a safer approach might be to integrate the renaming of the columns as part of your function, like this:
example2 <- function(x, y) {
df <- tbl_df(setNames(data.frame(x, y), c("x", "y")))
df %>% group_by(x) %>% summarise(total = sum(y))
}
Then, any of the following should give you the same results.
example2(mtcars$gear, mtcarsdf$cyl)
example2(mtcars$gear, mtcarsdf[["cyl"]])
example2(mtcars$gear, mtcarsdf[, "cyl"])

Related

R: Comparing list with grouped values in dataframe; questions about data types

I'm trying to make booleans for the dataframe testdf where when grouped by id, the boolean indicates whether everything in the vector values exists in that id's values.
I believe that ultimately this is a question about the different vector/list data types in R, which I still don't understand.
The vector values comes from the second column in lookup. R says it's a vector but not a list (should I do as.list(lookup$values) instead)?
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
These produce:
> testdf
id value
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
> lookup
col1 values
1 x 1
2 y 2
3 z 3
> values
[1] 1 2 3
And my intended result would look like this. values_bool is TRUE for an id where all elements in values exist in value_list. Note that I formatted the lists in a generic way.
testdf
id value_list values_bool
1 a [1,2,3] TRUE
2 b [1,2] FALSE
This page includes some more information on the creation and use of list columns in R, although it doesn't explain the difference in data types generated within each list.
For example, a cell in the list column created by nest() looks like this:
asia <tibble [59 × 1]>
and a cell in the list column created by summarize() and list() looks like this:
asia <chr [59]>
I tried to create list columns using two methods from that page in my code:
version1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
version2 <- testdf %>%
# Nest values into list
nest(value_list = value)
If you run this you can see they produce different list types.
> version1
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <dbl [3]>
2 b <dbl [2]>
> version2
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <tibble [3 × 1]>
2 b <tibble [2 × 1]>
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
So what's going on with the dataframe list data types when grouping value from testdf and selecting values from lookup?
Thanks so much for your help.
#akrun answered this in the comments, but I love this question, which is really a deep inquiry into R data structures and functional programming, so I'm going to answer it in a longer form here and arrive at the same answer.
First the minimal reproducible problem setup (I simplify version1 and version2 to v1 and v2):
library(tidyverse)
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
v1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
v2 <- testdf %>%
# Nest values into list
nest(value_list = value)
In v1 and v2, we create a value_list column that is a nested list. The list contents differ (you can tell by the reported dimensions in the OP).
v1$value_list is a nested atomic vector
v2$value_list is a nested data.frame
This is because list() -- when passed an atomic vector -- stores that atomic vector as an atomic vector, but nest() when passed an atomic vector, stores this as a data.frame. Why? Because nest() comes from the functional programming paradigm where we use map() (type-safe next gen lapply()) functions to operate on data.frame objects and importantly, return data.frame objects.
Okay, now let's start walking through the solution. It helps to understand how R represents the data types in v1 and v2.
We begin with v1. calling list() on an atomic vector results in a 1D atomic vector. We can pull it out and examine it:
Note that throughout the post I only include code. C/P and run interactively to see output
v1 # a dataframe
v1$value_list # a column, which is itself a list (see below)
class(v1$value_list) # verify this is a list
v1$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v1$value_list[[1]]) # it's an atomic vector - "numeric" = integer or double
typeof(v1$value_list[[1]]) # typeof() to show it's double
typeof(1L) # as opposed to a strict integer type
length(v1$value_list[[1]]) # atomic vectors have property length
Now we examine v2:
v2 # a dataframe
v2$value_list # a column-which is itself a list, see below
class(v2$value_list) # verify this is a list
v2$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v2$value_list[[1]]) # it's a data.frame, NOT an atomic vector!
typeof(v2$value_list[[1]]) # typeof() to show a dataframe is a just special list
dim(v2$value_list[[1]]) # dataframes have property dim
length(v2$value_list[[1]]) # dataframes also have property length, which is ncol()
With that background on object types, let's move to the problem posed in the OP.
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
Essentially, we want to test if all values occur in v1$value_list or v2$value_list. The expected result is TRUE, and then FALSE, for grouping variables a and b respectively. First, we do this the long way, and then we build to a functional programming solution.
# exhaustive approach for list 1 with atomic vectors
values %in% v1$value_list[[1]] # all values present
all(values %in% v1$value_list[[1]]) # returns TRUE as expected
# exhaustive approach for list 2 with atomic vectors
values %in% v1$value_list[[2]] # one value missing
all(values %in% v1$value_list[[2]]) # returns FALSE as expected
# now the functional programming solution without indexing. Notes on syntax:
# - Pass a list (or a vector) as the first argument.
# - Use ~ to indicate the start of the function to apply across the list
# - Use .x as a placehold within the function for the ith list element. Think: concise for-loop
map(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# Two solutions: because we want to output a vector we can unlist(), or
# use map_lgl() which only returns a boolean and fails if the result is NOT bool
unlist(map(v1$value_list, ~all(values %in% .x))) # returns c(TRUE, FALSE)
map_lgl(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# now the functional programming solution in a pipe (put it all together):
v1 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x)))
# we can do the same with v2, but we need to convert the dataframe to a vector
# so that it works with all() -- this is a bit tedious and verbose, because we
# used nest() which is designed to work with nested dataframes, but we ultimately
# want to use all() which takes vectors, so there was no need to nest() in the first
# place. For examples of how to use the power of nest():
# https://r4ds.had.co.nz/many-models.html#creating-list-columns
v2 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x$value)))
Additional Resources:
essential reading on vectors

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

A problem during real data analysis with purrr

I analyzed an real data set,
Data set: https://github.com/ThinkR-open/datasets/blob/master/README.md
tweets <- readRDS("#RStudioConf.RDS")
rstudioconf <- as.list(NULL)
for (i in 1:nrow(tweets)) {
rstudioconf[[i]] <- tweets[i,]
}
I want to answer question from data set: how many tweets contain a link to a GitHub related URL?
below is my code:
# Extract the "urls_url" elements, and flatten() the result
urls_clean <- map(rstudioconf, "urls_url") %>%
flatten()
# Remove NA from list
compact_urls <- urls_clean %>%
map(discard,is.na) %>%
compact()
# Create a mapper that detects the patten "github"
has_github <- as_mapper(~ str_detect(.x, "github"))
# Look for the "github" pattern, and sum the result
**map_lgl(compact_urls, has_github) %>% sum()
The last line of code
map_lgl(compact_urls, has_github) %>% sum()
gives me an error:
Error: Result 10 must be a single logical, not a logical vector of length 2
I am really confused, the code map_lgl(compact_urls, has_github) should give a logical vector with TRUE and FALSE, next this vector was piped into sum() and TRUE values were summed up and finally return a number. I never wonder it will give me an error. Could anyone help? Thank you!
map_lgl returns the error because some of the list elements have different length. It is indicated in ?map
map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).
out <- map(compact_urls, has_github)
table(lengths(out))
# 1 2 3 6
#1117 22 4 1
We can unlist the output from map and get the sum
sum(unlist(out))
It can be reproduced using a simple example
map_lgl(list(FALSE, TRUE), I) #each list element of length 1
#[1] FALSE TRUE
map_lgl(list(FALSE, c(TRUE, TRUE)), I) # one element of length 2
Error: Result 2 must be a single logical, not a vector of class AsIs
and of length 2
In case, if the objective is to return only a single TRUE/FALSE, then wrap the function with any
has_github <- as_mapper(~ any(str_detect(.x, "github")))
Now, try with map_lgl
map_lgl(compact_urls, has_github) %>%
sum()
#[1] 347

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources