I analyzed an real data set,
Data set: https://github.com/ThinkR-open/datasets/blob/master/README.md
tweets <- readRDS("#RStudioConf.RDS")
rstudioconf <- as.list(NULL)
for (i in 1:nrow(tweets)) {
rstudioconf[[i]] <- tweets[i,]
}
I want to answer question from data set: how many tweets contain a link to a GitHub related URL?
below is my code:
# Extract the "urls_url" elements, and flatten() the result
urls_clean <- map(rstudioconf, "urls_url") %>%
flatten()
# Remove NA from list
compact_urls <- urls_clean %>%
map(discard,is.na) %>%
compact()
# Create a mapper that detects the patten "github"
has_github <- as_mapper(~ str_detect(.x, "github"))
# Look for the "github" pattern, and sum the result
**map_lgl(compact_urls, has_github) %>% sum()
The last line of code
map_lgl(compact_urls, has_github) %>% sum()
gives me an error:
Error: Result 10 must be a single logical, not a logical vector of length 2
I am really confused, the code map_lgl(compact_urls, has_github) should give a logical vector with TRUE and FALSE, next this vector was piped into sum() and TRUE values were summed up and finally return a number. I never wonder it will give me an error. Could anyone help? Thank you!
map_lgl returns the error because some of the list elements have different length. It is indicated in ?map
map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).
out <- map(compact_urls, has_github)
table(lengths(out))
# 1 2 3 6
#1117 22 4 1
We can unlist the output from map and get the sum
sum(unlist(out))
It can be reproduced using a simple example
map_lgl(list(FALSE, TRUE), I) #each list element of length 1
#[1] FALSE TRUE
map_lgl(list(FALSE, c(TRUE, TRUE)), I) # one element of length 2
Error: Result 2 must be a single logical, not a vector of class AsIs
and of length 2
In case, if the objective is to return only a single TRUE/FALSE, then wrap the function with any
has_github <- as_mapper(~ any(str_detect(.x, "github")))
Now, try with map_lgl
map_lgl(compact_urls, has_github) %>%
sum()
#[1] 347
Related
I'm trying to make booleans for the dataframe testdf where when grouped by id, the boolean indicates whether everything in the vector values exists in that id's values.
I believe that ultimately this is a question about the different vector/list data types in R, which I still don't understand.
The vector values comes from the second column in lookup. R says it's a vector but not a list (should I do as.list(lookup$values) instead)?
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
These produce:
> testdf
id value
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
> lookup
col1 values
1 x 1
2 y 2
3 z 3
> values
[1] 1 2 3
And my intended result would look like this. values_bool is TRUE for an id where all elements in values exist in value_list. Note that I formatted the lists in a generic way.
testdf
id value_list values_bool
1 a [1,2,3] TRUE
2 b [1,2] FALSE
This page includes some more information on the creation and use of list columns in R, although it doesn't explain the difference in data types generated within each list.
For example, a cell in the list column created by nest() looks like this:
asia <tibble [59 × 1]>
and a cell in the list column created by summarize() and list() looks like this:
asia <chr [59]>
I tried to create list columns using two methods from that page in my code:
version1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
version2 <- testdf %>%
# Nest values into list
nest(value_list = value)
If you run this you can see they produce different list types.
> version1
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <dbl [3]>
2 b <dbl [2]>
> version2
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <tibble [3 × 1]>
2 b <tibble [2 × 1]>
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
So what's going on with the dataframe list data types when grouping value from testdf and selecting values from lookup?
Thanks so much for your help.
#akrun answered this in the comments, but I love this question, which is really a deep inquiry into R data structures and functional programming, so I'm going to answer it in a longer form here and arrive at the same answer.
First the minimal reproducible problem setup (I simplify version1 and version2 to v1 and v2):
library(tidyverse)
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
v1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
v2 <- testdf %>%
# Nest values into list
nest(value_list = value)
In v1 and v2, we create a value_list column that is a nested list. The list contents differ (you can tell by the reported dimensions in the OP).
v1$value_list is a nested atomic vector
v2$value_list is a nested data.frame
This is because list() -- when passed an atomic vector -- stores that atomic vector as an atomic vector, but nest() when passed an atomic vector, stores this as a data.frame. Why? Because nest() comes from the functional programming paradigm where we use map() (type-safe next gen lapply()) functions to operate on data.frame objects and importantly, return data.frame objects.
Okay, now let's start walking through the solution. It helps to understand how R represents the data types in v1 and v2.
We begin with v1. calling list() on an atomic vector results in a 1D atomic vector. We can pull it out and examine it:
Note that throughout the post I only include code. C/P and run interactively to see output
v1 # a dataframe
v1$value_list # a column, which is itself a list (see below)
class(v1$value_list) # verify this is a list
v1$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v1$value_list[[1]]) # it's an atomic vector - "numeric" = integer or double
typeof(v1$value_list[[1]]) # typeof() to show it's double
typeof(1L) # as opposed to a strict integer type
length(v1$value_list[[1]]) # atomic vectors have property length
Now we examine v2:
v2 # a dataframe
v2$value_list # a column-which is itself a list, see below
class(v2$value_list) # verify this is a list
v2$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v2$value_list[[1]]) # it's a data.frame, NOT an atomic vector!
typeof(v2$value_list[[1]]) # typeof() to show a dataframe is a just special list
dim(v2$value_list[[1]]) # dataframes have property dim
length(v2$value_list[[1]]) # dataframes also have property length, which is ncol()
With that background on object types, let's move to the problem posed in the OP.
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
Essentially, we want to test if all values occur in v1$value_list or v2$value_list. The expected result is TRUE, and then FALSE, for grouping variables a and b respectively. First, we do this the long way, and then we build to a functional programming solution.
# exhaustive approach for list 1 with atomic vectors
values %in% v1$value_list[[1]] # all values present
all(values %in% v1$value_list[[1]]) # returns TRUE as expected
# exhaustive approach for list 2 with atomic vectors
values %in% v1$value_list[[2]] # one value missing
all(values %in% v1$value_list[[2]]) # returns FALSE as expected
# now the functional programming solution without indexing. Notes on syntax:
# - Pass a list (or a vector) as the first argument.
# - Use ~ to indicate the start of the function to apply across the list
# - Use .x as a placehold within the function for the ith list element. Think: concise for-loop
map(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# Two solutions: because we want to output a vector we can unlist(), or
# use map_lgl() which only returns a boolean and fails if the result is NOT bool
unlist(map(v1$value_list, ~all(values %in% .x))) # returns c(TRUE, FALSE)
map_lgl(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# now the functional programming solution in a pipe (put it all together):
v1 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x)))
# we can do the same with v2, but we need to convert the dataframe to a vector
# so that it works with all() -- this is a bit tedious and verbose, because we
# used nest() which is designed to work with nested dataframes, but we ultimately
# want to use all() which takes vectors, so there was no need to nest() in the first
# place. For examples of how to use the power of nest():
# https://r4ds.had.co.nz/many-models.html#creating-list-columns
v2 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x$value)))
Additional Resources:
essential reading on vectors
Let's say I have a very simple data set. I have 2 columns, Parent Name, Children
> d = data.frame(Parents = c("Mark", "Adam"))
> d$Children = list(c("Kid1", "Kid2"), c("Kid3", "Kid4"))
> d
Parents Children
1 Mark Kid1, Kid2
2 Adam Kid3, Kid4
What I want to be able to do is search by Kid and get the parent name (and the index of that parent's name but this part is easy I presume). So "Kid1" would return "Mark". I can't figure how to do this.
I've tried using the following
which(d$Children = "Kid3")
But it didn't work, presumably because the datatype is actually list.
Is there a way to get around this? Is using a dataframe here a bad idea? Is there an alternate data strcuture I should use here, I think in Python I might have tried to using a dictionary but I'm not sure how to tackle this problem in R.
For filtering an element, use lapply with %in%
as.character(d$Parent)[unlist(lapply(d$Children, `%in%`, x = 'Kid3'))]
#[1] Adam
Or with Map
as.character(d$Parents)[unlist(Map(`%in%`, "Kid3", d$Children))]
#[1] Adam
The columns in the input are factor class. So, it can be converted to character class while extracting
Or another option is stack with subset
subset(stack(setNames(d$Children, d$Parents)), values == "Kid3")$ind
Or with dplyr/purrr
library(purrr)
library(dplyr)
d %>%
filter(map_lgl(Children, `%in%`, x = "Kid3")) %>%
pull(Parents)
#[1] Adam
Or
deframe(d) %>%
keep(~ "Kid3" %in% .x) %>%
names
#[1] "Adam"
Here's a way with sapply from base R. sapply(d$Children, ...) applies the anonymous function(x) "Kid3" %in% x) to every element of d$Children. This function checks if "Kid3" is present in every element and returns one logical output per row. This logical output is then used to get corresponding Parent. Fore more examples look at ?sapply. -
d$Parent[sapply(d$Children, function(x) "Kid3" %in% x)]
[1] Adam
Levels: Adam Mark
With dplyr -
d %>% unnest() %>% filter(Children == "Kid3")
Parents Children
1 Adam Kid3
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
for a dataframe df, I need to find the unique values for some_col. Tried the following
length(unique(df["some_col"]))
but this is not giving the expected results. However length(unique(some_vector)) works on a vector and gives the expected results.
Some preceding steps while the df is created
df <- read.csv(file, header=T)
typeof(df) #=> "list"
typeof(unique(df["some_col"])) #=> "list"
length(unique(df["some_col"])) #=> 1
Try with [[ instead of [. [ returns a list (a data.frame in fact), [[ returns a vector.
df <- data.frame( some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
length(unique(df[["some_col"]]))
#[1] 4
class( df[["some_col"]] )
[1] "numeric"
class( df["some_col"] )
[1] "data.frame"
You're getting a value of 1 because the list is of length 1 (1 column), even though that 1 element contains several values.
you need to use
length(unique(unlist(df[c("some_col")])))
When you call column by df[c("some_col")] or by df["some_col"] ; it pulls it as a list. Unlist will convert it into the vector and you can work easily with it. When you call column by df$some_col .. it pulls the data column as vector
I think you might just be missing a ,
Try
length(unique(df[,"some_col"]))
In response to comment :
df <- data.frame(cbind(A=c(1:10),B=rep(c("A","B"),5)))
df["B"]
Output :
B
1 A
2 B
3 A
4 B
5 A
6 B
7 A
8 B
9 A
10 B
and
length(unique(df[,"B"]))
Output:
[1] 1
Which is the same incorrect/undesirable output as the OP posted
HOWEVER With a comma ,
df[,"B"]
Output :
[1] A B A B A B A B A B
Levels: A B
and
length(unique(df[,"B"]))
Now gives you the correct/desired output by the OP. Which in this example is 2
[1] 2
The reason is that df["some_col"] calls a data.frame and length call to an object class data.frame counts the number of data.frames in that object which is 1, while df[,"some_col"] returns a vector and length call to a vector correctly returns the number of elements in that vector. So you see a comma (,) makes all the difference.
using tidyverse
df %>%
select("some_col") %>%
n_distinct()
The data.table package contains the convenient shorthand uniqueN. From the documentation
uniqueN is equivalent to length(unique(x)) when x is anatomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
You can use it with a data frame:
df <- data.frame(some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
data.table::uniqueN(df[['some_col']])
[1] 4
or if you already have a data.table
dt <- setDT(df)
dt[,uniqueN(some_col)]
[1] 4
Here is another option:
df %>%
distinct(column_name) %>%
count()
or this without tidyverse:
count(distinct(df, column_name))
checking benchmarks in the web you will see that distinct() is fast.
I may be having trouble understanding some of the basics of dplyr, but it appears that R behaves very differently depending on whether you subset columns as one column data frames or as traditional vectors. Here is an example:
mtcarsdf<-tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
#subsetting to cyl this way gives integer vector
example(mtcars$gear,mtcarsdf$cyl)
# 3 112
# 4 56
# 5 30
#subsetting this way gives a one column data table
example(mtcars$gear,mtcarsdf[,"cyl"])
# 3 198
# 4 198
# 5 198
all(mtcarsdf$cyl==mtcarsdf[,"cyl"])
# TRUE
Since my inputs are technically equal the fact that I am getting different outputs tells me I am misunderstanding how the two objects behave. Could someone please enlighten me on how to improve the example function so that it can handle different objects more robustly?
Thanks
First, the items that you are comparing with == are not really the same. This could be identified using all.equal instead of ==:
all.equal(mtcarsdf$cyl, mtcarsdf[, "cyl"])
## [1] "Modes: numeric, list"
## [2] "Lengths: 32, 1"
## [3] "names for current but not for target"
## [4] "Attributes: < target is NULL, current is list >"
## [5] "target is numeric, current is tbl_df"
With that in mind, you should be able to get the behavior you want by using [[ to extract the column instead of [.
mtcarsdf <- tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
example(mtcars$gear, mtcarsdf[["cyl"]])
However, a safer approach might be to integrate the renaming of the columns as part of your function, like this:
example2 <- function(x, y) {
df <- tbl_df(setNames(data.frame(x, y), c("x", "y")))
df %>% group_by(x) %>% summarise(total = sum(y))
}
Then, any of the following should give you the same results.
example2(mtcars$gear, mtcarsdf$cyl)
example2(mtcars$gear, mtcarsdf[["cyl"]])
example2(mtcars$gear, mtcarsdf[, "cyl"])