Adding objects together in R (like ggplot layers) - r

I'm doing OOP R and was wondering how to make it so the + can be used to add custom objects together. The most common example of this I've found is in ggplot2 w/ adding geoms together.
I read through the ggplot2 source code and found this
https://github.com/hadley/ggplot2/blob/master/R/plot-construction.r
It looks like "%+%" is being used, but it's not clear how that eventually translates into the plain + operator.

You just need to define a method for the generic function +. (At the link in your question, that method is "+.gg", designed to be dispatched by arguments of class "gg"). :
## Example data of a couple different classes
dd <- mtcars[1, 1:4]
mm <- as.matrix(dd)
## Define method to be dispatched when one of its arguments has class data.frame
`+.data.frame` <- function(x,y) rbind(x,y)
## Any of the following three calls will dispatch the method
dd + dd
# mpg cyl disp hp
# Mazda RX4 21 6 160 110
# Mazda RX41 21 6 160 110
dd + mm
# mpg cyl disp hp
# Mazda RX4 21 6 160 110
# Mazda RX41 21 6 160 110
mm + dd
# mpg cyl disp hp
# Mazda RX4 21 6 160 110
# Mazda RX41 21 6 160 110

Related

Adding tidyselect helper functions to a vector [duplicate]

This question already has answers here:
dplyr/rlang: parse_expr with multiple expressions
(3 answers)
Closed 2 years ago.
I often create a "vector" of the variables I use most often while I'm coding. Usually if I just input the vector object in select it works perfectly. Is there any way I can use in the helper functions in a string?
For example I could do
library(dplyr)
x = c('matches("cyl")')
mtcars %>%
select_(x)
but this is not preferable because 1) select_ is deprecated and 2) it's not scalable (i.e., x = c('hp', 'matches("cyl")') will not grab both the relevant columns.
Is there anyway I could use more tidyselect helper functions in as part of a vector?
Note: if I do something like:
x = c(matches("cyl"))
#> Error: `matches()` must be used within a *selecting* function.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
I get an error, so I'll definitely need to enquo it somehow.
You are trying to turn a string into code which might not be the best approach. However, you can use parse_exprs with !!!.
library(dplyr)
library(rlang)
x = c('matches("cyl")')
mtcars %>% select(!!!parse_exprs(x))
# Cyl
#Mazda RX4 6
#Mazda RX4 Wag 6
#Datsun 710 4
#Hornet 4 Drive 6
#Hornet Sportabout 8
#...
x = c('matches("cyl")', 'hp')
mtcars %>% select(!!!parse_exprs(x))
# cyl hp
#Mazda RX4 6 110
#Mazda RX4 Wag 6 110
#Datsun 710 4 93
#Hornet 4 Drive 6 110
#Hornet Sportabout 8 175
#....

Cannot use a variable named with numbers in R

I have some dataframes named as:
1_patient
2_patient
3_patient
Now I am not able to access its variables. For example:
I am not able to obtain:
2_patient$age
If I press tab when writing the name, it automatically gets quoted, but I am still unable to use it.
Do you know how can I solve this?
It is not recommended to name an object with numbers as prefix, but we can use backquote to extract the value from the object
`1_patient`$age
If there are more than object, we can use mget to return the objects in a list and then extract the 'age' column by looping over the list with lapply
mget(ls(pattern = "^\\d+_mtcars$"))
#$`1_mtcars`
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
lapply(mget(ls(pattern = "^\\d+_patient$")), `[[`, 'age')
Using a small reproducible example
data(mtcars)
`1_mtcars` <- head(mtcars, 2)
1_mtcars$mpg
Error: unexpected input in "1_"
`1_mtcars`$mpg
#[1] 21 21

Compare item in one row against all other rows and loop through all rows using data.table - R

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far built from readings here, here, here, here, and here, but I'm not quite pulling it off:
library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))
I'm currently using lapply() to cycle through the strings in the cartype column and bring together those rows whose cartype names are closer than a specified value (.08).
output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])
> output[1:3]
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Datsun 710 3
Data Table Attempt:
mtcarsdt <- as.data.table(mtcars)
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]
An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun(), but it repeats that value for all the rows.
res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line
I'm now trying to cycle through all the rows using set(). I'm close, but although the code appears to be correctly matching the text from the 12th column (cartype) it's returning the values from the first column, mpg:
for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 c(21, 21) 1
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 c(21, 21) 2
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 3
Now, this is pretty hacky, but I found that if I create a copy of the cartype column and place it in the first column it pretty much works, but there must be a cleaner way to do this. Also, it would be nice to keep the output in a list form like the lapply() output above as I have other post-processing steps set up for that format.
mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))
> mtcarsdt[1:14,cartype]
[1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[3] "Datsun 710"
[4] "Hornet 4 Drive"
[5] "Hornet Sportabout"
[6] "Valiant"
[7] "Duster 360"
[8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"
[9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
Have you tried using the matrix version of stringdist?
res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1
identical(out, output)
#[1] TRUE
Now, you probably can't just run the above for a 500k X 500k matrix, but you can split it into smaller pieces (pick size appropriate for your data/memory sizes):
size = 4 # dividing into pieces of size 4x4
# I picked a divisible number, a little more work will be needed
# if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)
grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))
indices = grid[, {
res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))
if (nrow(out) > 0)
out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
}, by = .(V1, V2)]
identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE

Key ordering vs. ordering of original columns with gather()

Does key ordering depend on whether I first list the columns to gather vs. those not to gather?
This is my data.frame:
library(tidyr)
wide_df <- data.frame(c("a", "b"), c("oh", "ah"), c("bla", "ble"), stringsAsFactors = FALSE)
colnames(wide_df) <- c("first", "second", "third")
wide_df
first second third
1 a oh bla
2 b ah ble
First I gather all columns in a specific order, and my ordering is respected in the key listing as second, first, although the columns are actually ordered as first, second:
long_01_df <- gather(wide_df, my_key, my_value, second, first, third)
long_01_df
my_key my_value
1 second oh
2 second ah
3 first a
4 first b
5 third bla
6 third ble
Then I decide to exclude one column from gathering:
long_02_df <- gather(wide_df, my_key, my_value, second, first, -third)
long_02_df
third my_key my_value
1 bla second oh
2 ble second ah
3 bla first a
4 ble first b
The keys are again ordered as second, first. Then I code it like this, believing to be doing the exact same thing:
long_03_df <- gather(wide_df, my_key, my_value, -third, second, first)
long_03_df
And I get the keys ordered according to the real column order in the original data.frame:
third my_key my_value
1 bla first a
2 ble first b
3 bla second oh
4 ble second ah
This behavior does not even change, when I call the function with factor_key = TRUE. What I am missing?
Summary
The reason for this is that you can not mix negative and positive indices. (You also should not: it simply makes no sense.) If you do that, gather() will ignore some of the indices.
Detailed answer
Also for standard indexing you are not allowed to mix positive and negative indices:
x <- 1:10
x[c(4, -2)]
## Error in x[c(4, -2)] : only 0's may be mixed with negative subscripts
It makes sense that this is the case: Indexing with 4 tells R to only keep the fourth element. There is no need to tell it explicitly to throw away the second element in addition.
According to the documentation of gather(), selecting columns works the same way as in dplyr's select(). So let's play with that. I'll work with a subset of mtcars:
mtcars <- mtcars[1:2, 1:5]
mtcars
## mpg cyl disp hp drat
## Mazda RX4 21.0 6 160 110 3.90
## Mazda RX4 Wag 21.0 6 160 110 3.90
You can use positive and negative indexing with select():
select(mtcars, mpg, cyl)
## mpg cyl
## Mazda RX4 21 6
## Mazda RX4 Wag 21 6
select(mtcars, -mpg, -cyl)
## disp hp drat
## Mazda RX4 160 110 3.9
## Mazda RX4 Wag 160 110 3.9
Also for select(), mixing positive and negative indices makes no sense. But instead of throwing an error, select() seems to ignore all indices that have a different sign than the first one:
select(mtcars, mpg, -hp, cyl)
## mpg cyl
## Mazda RX4 21 6
## Mazda RX4 Wag 21 6
select(mtcars, -mpg, hp, -cyl)
## disp hp drat
## Mazda RX4 160 110 3.9
## Mazda RX4 Wag 160 110 3.9
As you can see, the results are exactly the same as before.
For your examples with gather(), you use these two lines:
long_02_df <- gather(wide_df, my_key, my_value, second, first, -third)
long_03_df <- gather(wide_df, my_key, my_value, -third, second, first)
According to what I've shown above, these lines are identical to:
long_02_df <- gather(wide_df, my_key, my_value, second, first)
long_03_df <- gather(wide_df, my_key, my_value, -third)
Note that there is nothing in the second line that would indicate your preferred ordering of the keys. It only says that third should be omitted.

is there a way to use the ggplot aes callout without inputing the column name but by just inputting the column #?

EXAMPLE DATASET:
mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
Recommended ggplot way:
ggplot(mtcars,aes(x=mpg)) + geom_histogram
They way I want to do it:
ggplot(mtcars,aes(x=[,1]) +geom_histogram
or
ggplot(mtcars,aes(x=[[1]]))+geom_histogram
Why can't ggplot let me call out my variable by its column? I need to call it out by column number not name. Why is ggplot so strict here? Any work around for this?
The problem you're facing is that the ggplot aes argument evaluates within the data.frame that you pass it. A column name is a string, and can't be properly evaluated the same way.
Fortunately, there is a solution: use the aes_string option, as follows:
library(ggplot2)
my_data <- mtcars
names(my_data)
ggplot(my_data, aes_string(x=names(my_data)[1]))+
geom_histogram()
This works because names(my_data)[1] returns a string, and is perfectly acceptable for the aes_string option.

Resources