why does group_by() affect out-of-scope data frames?

why does group_by() affect out-of-scope data frames? - r

If I pass a grouped data frame to a function, then change the name of the grouped variable, the grouping of the original data frame gets changed to the new name. When the function returns (I am not returning the altered data frame), the names of the original data frame are unchanged but the grouping is changed to the non-existent name.
# test scoping of group_by() which appears to change groups
library(dplyr)
muck_up_group<-function(mydf){
mydf<-mydf %>% rename(UhOh=Species)
}
dont_muck_up_group<-function(mydf){
mydf<-mydf %>% ungroup()
mydf<-mydf %>% rename(UhOh=Species)
}
data("iris")
iris<-as_tibble(iris) %>% group_by(Species)
iris
# A tibble: 150 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
muck_up_group(iris) # original grouping changed to column name that doesn't exist
iris
# A tibble: 150 x 5
# Groups: UhOh [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
#restore original state
iris<-as_tibble(iris) %>% group_by(Species)
dont_muck_up_group(iris) # original grouping preserved
iris
# A tibble: 150 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
I can understand why it might be bad practice to change the name of a grouping variable but it is permissible. This seems to be an example of an attribute of a variable being passed by reference when the content is being passed by value (as we understand R does normally).
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] lubridate_1.6.0 bindrcpp_0.2 mFilter_0.1-3
[4] ggrepel_0.6.5 reshape2_1.4.2 scales_0.4.1
[7] purrr_0.2.3 readr_1.1.1 tidyr_0.7.0
[10] tibble_1.3.4 tidyverse_1.1.1 knitr_1.17
[13] Rblpapi_0.3.6 stringr_1.2.0 rvest_0.3.2
[16] xml2_1.1.1 devtools_1.13.3 dplyr_0.7.2
[19] plyr_1.8.4 ggplot2_2.2.1 PerformanceAnalytics_1.4.3541
[22] xts_0.10-0 zoo_1.8-0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 lattice_0.20-35 assertthat_0.2.0 rprojroot_1.2 digest_0.6.12
[6] psych_1.7.5 R6_2.2.2 cellranger_1.1.0 backports_1.1.0 evaluate_0.10.1
[11] httr_1.3.1 highr_0.6 rlang_0.1.2 curl_2.8.1 lazyeval_0.2.0
[16] readxl_1.0.0 TTR_0.23-2 tidyquant_0.5.3 rmarkdown_1.6 labeling_0.3
[21] foreign_0.8-67 munsell_0.4.3 broom_0.4.2 compiler_3.4.0 modelr_0.1.1
[26] pkgconfig_2.0.1 base64enc_0.1-3 mnormt_1.5-5 htmltools_0.3.6 tidyselect_0.1.1
[31] withr_2.0.0 Quandl_2.8.0 grid_3.4.0 nlme_3.1-131 jsonlite_1.5
[36] gtable_0.2.0 magrittr_1.5 quantmod_0.4-10 stringi_1.1.5 RColorBrewer_1.1-2
[41] tools_3.4.0 forcats_0.2.0 glue_1.1.1 hms_0.3 rsconnect_0.8.5
[46] parallel_3.4.0 yaml_2.1.14 colorspace_1.3-2 memoise_1.1.0 bindr_0.1
[51] haven_1.1.0
>
Bug? Thanks.

See #aosmith’s comment above. Dplyr closed issue.

Related

R+ggplot2: adding log tick marks to a histogram

Please have a look at the reprex at the end of the post.
I generate some lognormally distributed values and then I bin the distribution using a non-uniform bin (the grid is evenly spaced if I take its logarithm).
The point is not the maths, but the fact that, using annotation_logticks
( see
https://ggplot2.tidyverse.org/reference/annotation_logticks.html
) I cannot add the ticks to the plot.
Does anybody understand what goes wrong?
Thanks a lot!
library(tidyverse)
library(scales)
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
## auxiliary functions
scale_x_log10nice <- function(name=NULL,omag=seq(-20,20),...) {
breaks10 <- 10^omag
scale_x_log10(name,breaks=breaks10,
labels=scales::trans_format("log10", scales::math_format(10^.x)),...)
}
log_binning <- function(x_min,x_max,n_bin){
x_max <- x_max
m <- n_bin-1
r <- (x_max/x_min)^(1/m)
my_seq <- seq(0,m,by=1)
grid <- x_min*r^my_seq
}
##################################################à
set.seed(1234)
n_bins <- 10
df <- tibble(x=rlnorm(10e4, sdlog=2))
my_breaks2 <- log_binning(min(df$x),
max(df$x), n_bins)
gpl <- ggplot(df, aes(x=x )) +
theme_bw()+
geom_histogram(## binwidth=10e3,
colour="black", fill="blue"## , boundary=0
, breaks=my_breaks2
)+
scale_x_log10nice("x values")
gpl
gpl2 <- gpl+
annotation_logticks(sides="b", outside=T)
## where are the logticks?
gpl2
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scales_1.2.1 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.99.9000
#> [5] purrr_1.0.0 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
#> [9] ggplot2_3.4.0 tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 xfun_0.36 haven_2.5.1
#> [4] gargle_1.2.1 colorspace_2.0-3 vctrs_0.5.1
#> [7] generics_0.1.3 htmltools_0.5.4 yaml_2.3.6
#> [10] utf8_1.2.2 rlang_1.0.6 pillar_1.8.1
#> [13] glue_1.6.2 withr_2.5.0 DBI_1.1.3
#> [16] dbplyr_2.2.1 readxl_1.4.1 modelr_0.1.10
#> [19] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1
#> [22] cellranger_1.1.0 rvest_1.0.3 evaluate_0.19
#> [25] labeling_0.4.2 knitr_1.41 tzdb_0.3.0
#> [28] fastmap_1.1.0 fansi_1.0.3 highr_0.10
#> [31] broom_1.0.2 backports_1.4.1 googlesheets4_1.0.1
#> [34] jsonlite_1.8.4 farver_2.1.1 fs_1.5.2
#> [37] hms_1.1.2 digest_0.6.31 stringi_1.7.8
#> [40] grid_4.2.2 cli_3.6.0 tools_4.2.2
#> [43] magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3
#> [46] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.2
#> [49] googledrive_2.0.0 lubridate_1.9.0 timechange_0.1.1
#> [52] assertthat_0.2.1 rmarkdown_2.19 httr_1.4.4
#> [55] R6_2.5.1 compiler_4.2.2
Created on 2023-01-17 with reprex v2.0.2

If you want to use outside = TRUE in annotation_logticks, you also need to turn clipping off.
From the docs for ?annotation_logticks
outside      logical that controls whether to move the log ticks outside of the plot area. Default is off (FALSE). You will also need to use coord_cartesian(clip = "off")
gpl +
annotation_logticks(sides="b", outside = TRUE) +
coord_cartesian(clip = "off")

Mutate across Rounding to two decimal places in R [duplicate]

I am working with data that has significant digits (i.e. digits after the "."). These digits appear when viewing my data both as a variable in base R, and also when the data is stored in a dataframe. However, they do not appear when I view the data in a tibble.
I need to view these significant digits for my work. Is there a way to make them appear when using tibbles?
Here is a reproducible example:
x has 5 significant digits, and 3 are displayed when using base R:
x = 1234.56789
x
[1] 1234.568
Within a data.frame, 3 significant digits are also displayed:
df = data.frame(x=x)
df
x
1 1234.568
Within a tibble, though, 0 significant digits are displayed:
library(tibble)
df = tibble(x=x)
df
# A tibble: 1 x 1
x
<dbl>
1 1235.
Again, I am looking for a way to make more than 0 significant digits appear whening viewing my data in a tibble.
Here is the result of my sessionInfo():
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] tibble_1.4.2 readr_1.1.1 choroplethr_3.6.2
[4] acs_2.1.3 XML_3.98-1.12 stringr_1.3.1
loaded via a namespace (and not attached):
[1] httr_1.3.1 maps_3.3.0 splines_3.5.1
[4] Formula_1.2-3 assertthat_0.2.0 sp_1.3-1
[7] latticeExtra_0.6-28 yaml_2.2.0 pillar_1.3.0
[10] backports_1.1.2 lattice_0.20-35 glue_1.3.0
[13] uuid_0.1-2 digest_0.6.15 RColorBrewer_1.1-2
[16] checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6
[19] Matrix_1.2-14 plyr_1.8.4 pkgconfig_2.0.1
[22] WDI_2.5 purrr_0.2.5 scales_0.5.0
[25] jpeg_0.1-8 tigris_0.7 ggmap_2.6.1
[28] htmlTable_1.12 ggplot2_3.0.0 nnet_7.3-12
[31] lazyeval_0.2.1 cli_1.0.0 proto_1.0.0
[34] survival_2.42-6 RJSONIO_1.3-0 magrittr_1.5
[37] crayon_1.3.4 maptools_0.9-2 fansi_0.2.3
[40] foreign_0.8-71 class_7.3-14 tools_3.5.1
[43] data.table_1.11.4 hms_0.4.2 geosphere_1.5-7
[46] RgoogleMaps_1.4.2 munsell_0.5.0 cluster_2.0.7-1
[49] bindrcpp_0.2.2 compiler_3.5.1 e1071_1.7-0
[52] rlang_0.2.1 classInt_0.2-3 units_0.6-0
[55] grid_3.5.1 rstudioapi_0.7 rjson_0.2.20
[58] rappdirs_0.3.1 htmlwidgets_1.2 base64enc_0.1-3
[61] gtable_0.2.0 curl_3.2 DBI_1.0.0
[64] reshape2_1.4.3 R6_2.2.2 gridExtra_2.3
[67] knitr_1.20 dplyr_0.7.6 rgdal_1.3-3
[70] utf8_1.1.4 bindr_0.1.1 Hmisc_4.1-1
[73] stringi_1.2.4 Rcpp_0.12.18 mapproj_1.2.6
[76] sf_0.6-3 rpart_4.1-13 acepack_1.4.1
[79] png_0.1-7 spData_0.2.9.0 tidyselect_0.2.4

you can set the option pillar.sigfig
options(pillar.sigfig = 1)
as_tibble(iris)
# # A tibble: 150 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5. 4. 1. 0.2 setosa
# 2 5. 3 1. 0.2 setosa
# 3 5. 3. 1. 0.2 setosa
# 4 5. 3. 2. 0.2 setosa
# 5 5 4. 1. 0.2 setosa
# 6 5. 4. 2. 0.4 setosa
# 7 5. 3. 1. 0.3 setosa
# 8 5 3. 2. 0.2 setosa
# 9 4. 3. 1. 0.2 setosa
# 10 5. 3. 2. 0.1 setosa
options(pillar.sigfig = 7)
tb = tibble(x=x)
tb
# # A tibble: 1 x 1
# x
# <dbl>
# 1 1234.568
See also:
?`tibble-options`
or online:
https://www.rdocumentation.org/packages/tibble/versions/1.4.2/topics/tibble-options

How to make a column exist in r?

I have a very large dataset where I am looking to take a column of identifiers (CP) first edit how the identifiers look to match another file, and then search if there are ```CP`` matches between the files.
I do the editing of the CP first with:
fullGWAS <- fread('file.csv',sep=",")
colnames(fullGWAS)[1] <- "CP"
fullGWAS2<-gsub("_.*","",fullGWAS$CP)
fullGWAS2 <-data.frame(fullGWAS2)
colnames(fullGWAS2)[1] <- "CP"
fullGWAS3 <- select(fullGWAS, c(2:15))
gwasdf <- cbind(fullGWAS2, fullGWAS3)
As an example gwasdf looks like:
> head(gwasdf)
CP chr bpos a1 a2 freq BETAsbp Psbp BETAdbp Pdbp BETApp Ppp minP
1 1:2556125 1 2556125 t c 0.3255 -0.0262 0.41300 -0.0113 0.5388 -0.0157 0.4690 0.41300
2 1:2556548 1 2556548 t c 0.3261 -0.0274 0.39270 -0.0121 0.5096 -0.0160 0.4615 0.39270
3 1:2556709 1 2556709 a g 0.3257 -0.0263 0.41210 -0.0116 0.5266 -0.0155 0.4749 0.41210
4 12:11366987 12 11366987 t c 0.9443 0.0355 0.61460 0.0019 0.9631 0.0185 0.7007 0.61460
5 17:21949792 17 21949792 a c 0.4570 -0.0384 0.20690 -0.0043 0.8065 -0.0212 0.3050 0.20690
6 17:21955349 17 21955349 t g 0.5253 0.0505 0.09562 0.0103 0.5574 0.0248 0.2303 0.09562
minTRAIT BETAmean
1 SBP -0.01875
2 SBP -0.01975
3 SBP -0.01895
4 SBP 0.01870
5 SBP -0.02135
6 SBP 0.03040
I can see CP is here yet when I try to check this I get:
exists("gwasdf$CP")
[1] FALSE
class(gwasdf)
[1] "data.frame"
nrow(gwasdf)
[1] 7083535
Why is this false and how can I make it be true?
I am trying to ultimately check whether the CP identifiers are present in another file with follow-up code using:
CPmatches <- df2[CP %in% gwasdf$CP] #df2 is another file I just read in
mismatchextract <- subset(gwasdf, !(CP %in% df2$CP))
For extra info I use RStudio with:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.57.0 sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7
[5] proto_1.0.0 data.table_1.13.2 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[13] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2 colorspace_1.4-1 vctrs_0.3.4
[6] generics_0.1.0 chron_2.3-56 blob_1.2.1 rlang_0.4.8 pillar_1.4.7
[11] glue_1.4.1 withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_2.0.0
[16] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0 fansi_0.4.1 broom_0.7.2
[26] Rcpp_1.0.5 scales_1.1.1 backports_1.1.10 jsonlite_1.7.1 fs_1.5.0
[31] bit_4.0.4 hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[36] cli_2.2.0 tools_4.0.2 magrittr_2.0.1 crayon_1.3.4 pkgconfig_2.0.3
[41] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
[46] httr_1.4.2 rstudioapi_0.13 R6_2.5.0 compiler_4.0.2

Something like this using dplyr and the %in% operator? Assuming there are two separate datasets and a goal of subsetting based on whether an element in one dataset belongs to a separate dataset.
qwasdf_1 <- data.frame(
CP1 = c("1:2556125", "1:2556548", "99:12345678")
)
qwasdf_2 <- data.frame(
CP2 = c("1:2556125", "1:2556548", "1:2556709")
)
library(dplyr)
qwasdf_1 %>%
filter(CP1 %in% qwasdf_2$CP2)
#> CP1
#> 1 1:2556125
#> 2 1:2556548
Created on 2020-11-23 by the reprex package (v0.3.0)

How to unlist column of strings to count matches?

I am looking to count any matching strings there are between 2 datasets. This is with one dataset having one column of genes and another column of genes those genes interact with.
For example:
#dataset1
Gene Interactors
ACE BRCA2, NOS2, SEPT9
HER2 AGT, TGRF
YUO SEPT9, NOS2, TET2
I have a second dataset also with genes and interacting genes similar to this. For example:
#dataset2
Gene Interactors
RTY ADFD, NOS3, SEPT9
TERT ADAM2, GERP
GHJ TET2, NOS2
I am looking to be able to count how many Interactors in dataset1 there are that have matching Interactors in dataset 2.
Example output:
Gene Interactors Secondary_interaction_count
ACE BRCA2, NOS2, SEPT9 2 #SEPT9 and NOS2 are in the 2nd dataset under interacting genes
HER2 AGT, TGRF 0
YUO SEPT9, ADAM2, TET2 3 #all 3 are in dataset 2
Currently I have 2 versions to try and get this. One which only gives true or false that I don't know how to change into counting:
temp <- unlist(strsplit(df2$interactors, ', '))
df1$secondary_count <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
And another which I think isn't spliting the string but I'm not sure how to modify it:
df1 %>%
mutate(secondary_count = str_count(interactors, str_c(df2$interactors, collapse = '|')))
Is there a way to modify either of these 2 coding attempts to get a count? Or should I try another way?
Input data:
#df1:
structure(list(Gene = c("ACE", "HER2", "YUO"), Interactors = c("BRCA2, NOS2, SEPT9",
"AGT, TGRF", "SEPT9, NOS2, TET2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
#df2:
structure(list(Gene = c("RTY", "TERT", "GHJ"), Interactors = c("ADFD, NOS3, SEPT9",
"ADAM2, GERP", "TET2, NOS2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7 proto_1.0.0
[5] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
[13] plyr_1.8.6 dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] gtools_3.8.2 tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2
[5] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 chron_2.3-56
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.1
[13] withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_1.4.4
[17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
[21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] fansi_0.4.1 broom_0.7.2 Rcpp_1.0.5 scales_1.1.1
[29] backports_1.1.10 jsonlite_1.7.1 fs_1.5.0 bit_4.0.4
[33] hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[37] cli_2.1.0 tools_4.0.2 magrittr_1.5 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
[45] lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11
[49] R6_2.4.1 compiler_4.0.2

Try this
library(tidyr)
library(dplyr)
sep_rows <- . %>% separate_rows(Interactors, sep = ", ")
df1 %>%
sep_rows() %>%
mutate(
found = !is.na(match(Interactors, sep_rows(df2)$Interactors))
) %>%
group_by(Gene) %>%
summarise(
Interactors = toString(Interactors),
Secondary_interaction_count = sum(found)
)
Output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Gene Interactors Secondary_interaction_count
<chr> <chr> <int>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3

Another try:
> df1 %>% separate_rows(Interactors) %>% rowwise() %>%
+ mutate(secondary_interactions = str_extract_all(Interactors, paste0(df2 %>% separate_rows(Interactors) %>% pull(Interactors), collapse = '|'))) %>%
+ unnest(secondary_interactions, keep_empty = T) %>% group_by(Gene) %>%
+ mutate(Interactors = toString(Interactors), secondary_interactions_cnt = case_when(is.na(secondary_interactions) ~ 0, TRUE ~ 1)) %>%
+ mutate(secondary_interactions = sum(secondary_interactions_cnt)) %>% select(-4)%>% distinct()
# A tibble: 3 x 3
# Groups: Gene [3]
Gene Interactors secondary_interactions
<chr> <chr> <dbl>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
>

Nothing rendering when using stat_density_2d(geom = "polygon")

For some reason stat_density_2d() does not seem to be working right for me when specifying geom = "polygon" and I am absolutely stumped. Here is my code...
library(sf)
library(tidyverse)
library(RANN2)
library(hexbin)
library(mapproj)
options(stringsAsFactors = FALSE)
raleigh_police <- rgdal::readOGR("https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.geojson", "OGRGeoJSON")
raleigh_police_sf <- raleigh_police %>%
st_as_sf()
raleigh_police_sf %>%
filter(crime_description == "Burglary/Residential") %>%
st_coordinates() %>%
as_tibble() %>%
ggplot() +
stat_density_2d(aes(X, Y, fill = stat(level)), geom = "polygon")
Here is my sessionInfo()...
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 mapproj_1.2.6 maps_3.3.0 hexbin_1.27.2 RANN2_0.1 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
[9] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2 tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 sf_0.7-1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lubridate_1.7.4 lattice_0.20-38 class_7.3-14 utf8_1.1.4 assertthat_0.2.0 rprojroot_1.3-2
[8] digest_0.6.18 R6_2.3.0 cellranger_1.1.0 plyr_1.8.4 backports_1.1.2 evaluate_0.12 e1071_1.7-0
[15] httr_1.3.1 blogdown_0.9 pillar_1.3.0 rlang_0.3.0.1 lazyeval_0.2.1 readxl_1.1.0 rstudioapi_0.8
[22] rmarkdown_1.10 labeling_0.3 rgdal_1.3-6 munsell_0.5.0 broom_0.5.0 compiler_3.5.1 modelr_0.1.2
[29] xfun_0.4 pkgconfig_2.0.2 htmltools_0.3.6 tidyselect_0.2.5 bookdown_0.7 codetools_0.2-15 fansi_0.4.0
[36] crayon_1.3.4 withr_2.1.2 MASS_7.3-51.1 grid_3.5.1 nlme_3.1-137 spData_0.2.9.4 jsonlite_1.5
[43] gtable_0.2.0 DBI_1.0.0 magrittr_1.5 units_0.6-1 scales_1.0.0 cli_1.0.1 stringi_1.2.4
[50] sp_1.3-1 xml2_1.2.0 tools_3.5.1 glue_1.3.0 hms_0.4.2 yaml_2.2.0 colorspace_1.3-2
[57] classInt_0.2-3 rvest_0.3.2 knitr_1.20 bindr_0.1.1 haven_1.1.2
I just get a blank plot with nothing on it. Completely stumped. What am I doing wrong here?
Update (2018-11-18)
It turns out that the primary issue here was options(stringsAsFactors = FALSE). If you comment that out and run the original code, everything actually works fine. I found this GitHub Issue which was the reason I tried that. Much more efficient code solutions are provided in the answers to this question and they also made sure not to use options(stringsAsFactors = FALSE).

Aside from downloading the file before reading and changing readOGR to read_sf, it works as-is for me save warnings for a couple NA points caused by empty geometries:
library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
path <- "~/Downloads/raleigh.geojson"
download.file(
"https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.geojson",
path,
method = "curl"
)
raleigh_police <- sf::read_sf(path, "OGRGeoJSON")
raleigh_police %>%
filter(crime_description == "Burglary/Residential") %>%
st_coordinates() %>%
as_tibble() %>%
ggplot() +
stat_density_2d(aes(X, Y, fill = stat(level)), geom = "polygon")
#> Warning: Removed 5 rows containing non-finite values (stat_density2d).
The empty rows:
raleigh_police %>%
filter(crime_description == "Burglary/Residential",
st_is_empty(.))
#> Simple feature collection with 5 features and 21 fields (with 5 geometries empty)
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 5 x 22
#> OBJECTID GlobalID case_number crime_category crime_code crime_descripti…
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 205318 8057315… P14076062 BURGLARY/RESI… 30B Burglary/Reside…
#> 2 417488 70afb27… P15027702 BURGLARY/RESI… 30B Burglary/Reside…
#> 3 424718 bdf69fa… P18029113 BURGLARY/RESI… 30B Burglary/Reside…
#> 4 436550 711c05b… P18044139 BURGLARY/RESI… 30B Burglary/Reside…
#> 5 442091 9d7a008… P18051764 BURGLARY/RESI… 30B Burglary/Reside…
#> # … with 16 more variables: crime_type <chr>, reported_block_address <chr>,
#> # city_of_incident <chr>, city <chr>, district <chr>, reported_date <dttm>,
#> # reported_year <int>, reported_month <int>, reported_day <int>,
#> # reported_hour <int>, reported_dayofwk <chr>, latitude <dbl>,
#> # longitude <dbl>, agency <chr>, updated_date <dttm>, geometry <POINT [°]>
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS 10.14.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] sf_0.7-1 forcats_0.3.0 stringr_1.3.1
#> [4] dplyr_0.7.99.9000 purrr_0.2.5 readr_1.2.0
#> [7] tidyr_0.8.2 tibble_1.4.99.9005 ggplot2_3.1.0
#> [10] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.5 haven_1.1.2 lattice_0.20-35
#> [4] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0
#> [7] rlang_0.3.0.1 e1071_1.7-0 pillar_1.3.0.9001
#> [10] glue_1.3.0 withr_2.1.2 DBI_1.0.0
#> [13] modelr_0.1.2 readxl_1.1.0 plyr_1.8.4
#> [16] munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0
#> [19] rvest_0.3.2 evaluate_0.12 labeling_0.3
#> [22] knitr_1.20 class_7.3-14 broom_0.5.0
#> [25] Rcpp_0.12.19.3 scales_1.0.0 backports_1.1.2
#> [28] classInt_0.2-3 jsonlite_1.5 hms_0.4.2.9001
#> [31] digest_0.6.18 stringi_1.2.4 grid_3.5.1
#> [34] rprojroot_1.3-2 cli_1.0.1 tools_3.5.1
#> [37] magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4
#> [40] pkgconfig_2.0.2 MASS_7.3-51 xml2_1.2.0
#> [43] spData_0.2.9.4 lubridate_1.7.4 assertthat_0.2.0
#> [46] rmarkdown_1.10 httr_1.3.1 R6_2.3.0
#> [49] units_0.6-1 nlme_3.1-137 compiler_3.5.1

As noted, this is just point-level data and the CSV they provide can be a fine substitute:
library(tidyverse)
rp_csv_url <- "https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.csv"
httr::GET(
url = rp_csv_url,
httr::write_disk(basename(rp_csv_url)), # won't overwrite if it exists unless explicitly told to so you get caching for free
httr::progress() # I suspect this is a big file so it's nice to see a progress bar
)
raleigh_police <- read_csv(basename(rp_csv_url))
mutate(
raleigh_police,
longitude = as.numeric(longitude), # they come in wonky, still
latitude = as.numeric(latitude) # they come in wonky, still
) -> raleigh_police
raleigh_police %>%
filter(crime_description == "Burglary/Residential") %>%
ggplot() +
stat_density_2d(
aes(longitude, latitude, fill = stat(level)),
color = "#2b2b2b", size=0.125, geom = "polygon"
) +
viridis::scale_fill_viridis(direction=-1, option="magma") +
hrbrthemes::theme_ipsum_rc()
If you'd like to turn level into something more meaningful:
h <- c(MASS::bandwidth.nrd(rp_br$longitude),
MASS::bandwidth.nrd(rp_br$latitude))
dens <- MASS::kde2d(
rp_br$longitude, rp_br$latitude, h = h, n = 100
)
breaks <- pretty(range(dens$z), 10)
zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
z <- tapply(zdf$z, zdf[c("x", "y")], identity)
cl <- grDevices::contourLines(
x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
levels = breaks
)
sp::SpatialPolygons(
lapply(1:length(cl), function(idx) {
sp::Polygons(
srl = list(sp::Polygon(
matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
)),
ID = idx
)
})
) -> cont
sp::coordinates(rp_br) <- ~longitude+latitude
then:
data_frame(
ct = sapply(sp::over(cont, sp::geometry(rp_br), returnList = TRUE), length),
id = 1:length(ct),
lvl = sapply(cl, function(x) x$level)
) %>%
count(lvl, wt=ct) %>%
mutate(
pct = n/nrow(rp_br),
pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
)
## # A tibble: 10 x 4
## lvl n pct pct_lab
## <dbl> <int> <dbl> <chr>
## 1 10. 7302 0.927 92.7% of the points fall within this level
## 2 20. 6243 0.792 79.2% of the points fall within this level
## 3 30. 4786 0.607 60.7% of the points fall within this level
## 4 40. 3204 0.407 40.7% of the points fall within this level
## 5 50. 1945 0.247 24.7% of the points fall within this level
## 6 60. 1277 0.162 16.2% of the points fall within this level
## 7 70. 793 0.101 10.1% of the points fall within this level
## 8 80. 474 0.0601 6.0% of the points fall within this level
## 9 90. 279 0.0354 3.5% of the points fall within this level
## 10 100. 44 0.00558 0.6% of the points fall within this level

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

why does group_by() affect out-of-scope data frames? - r

See #aosmith’s comment above. Dplyr closed issue.

Related

R+ggplot2: adding log tick marks to a histogram

Mutate across Rounding to two decimal places in R [duplicate]

How to make a column exist in r?

How to unlist column of strings to count matches?

Nothing rendering when using stat_density_2d(geom = "polygon")

Categories

Resources