I am using Rstudio on a Windows machine and try to do some string matching with Chinese characters. I am not familiar with encoding settings on windows so I check some tutorials and make sure that the result of Sys.getlocale() should be the right one.
When executing str_detect() within dataframe, the matching fails. But it works on a vector level. Also, df_edu_village %>% filter(str_detect(village, "糖")) show a different result with df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape())).
Below I try to reproduce the result but reprex() isn't working, and the problem may be due to copy-paste so I knit the result into HTML with Rmd myself. Thanks for any help.
# devtools::install_github("ntupsc/pscdata")
library(pscdata)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df_edu_village <- pscdata::edu_village_original %>% as_tibble() %>% distinct(village)
df_edu_village %>% filter(str_detect(village, "糖"))
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `str_detect(village, "糖")`.
## x Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`聶}`)
df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))
## # A tibble: 3 x 1
## village
## <chr>
## 1 糖<U+5ECD>里
## 2 糖友里
## 3 大糖里
df_edu_village %>% filter(str_detect(village, "里"))
## # A tibble: 0 x 1
## # ... with 1 variable: village <chr>
str_detect(df_edu_village$village[1], "里")
## [1] FALSE
grepl(df_edu_village$village[1], "里")
## [1] FALSE
str_detect("留侯里", "里")
## [1] TRUE
Sys.getlocale()
## [1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950
## system code page: 1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5
## [9] tidyverse_1.3.1 pscdata_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.26 haven_2.4.3 colorspace_2.0-2
## [5] vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2 yaml_2.2.1
## [9] utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4 pillar_1.6.2
## [13] withr_2.4.2 glue_1.4.2 DBI_1.1.1 dbplyr_2.1.1
## [17] modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1 evaluate_0.14
## [25] knitr_1.34 tzdb_0.1.2 fastmap_1.1.0 curl_4.3.2
## [29] fansi_0.5.0 broom_0.7.9 Rcpp_1.0.7 backports_1.2.1
## [33] scales_1.1.1 jsonlite_1.7.2 fs_1.5.0 hms_1.1.0
## [37] digest_0.6.27 stringi_1.7.4 grid_4.1.1 cli_3.0.1
## [41] tools_4.1.1 magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3
## [45] ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10
## [49] rstudioapi_0.13 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
## [53] R6_2.5.1 compiler_4.1.1
I have a very large dataset where I am looking to take a column of identifiers (CP) first edit how the identifiers look to match another file, and then search if there are ```CP`` matches between the files.
I do the editing of the CP first with:
fullGWAS <- fread('file.csv',sep=",")
colnames(fullGWAS)[1] <- "CP"
fullGWAS2<-gsub("_.*","",fullGWAS$CP)
fullGWAS2 <-data.frame(fullGWAS2)
colnames(fullGWAS2)[1] <- "CP"
fullGWAS3 <- select(fullGWAS, c(2:15))
gwasdf <- cbind(fullGWAS2, fullGWAS3)
As an example gwasdf looks like:
> head(gwasdf)
CP chr bpos a1 a2 freq BETAsbp Psbp BETAdbp Pdbp BETApp Ppp minP
1 1:2556125 1 2556125 t c 0.3255 -0.0262 0.41300 -0.0113 0.5388 -0.0157 0.4690 0.41300
2 1:2556548 1 2556548 t c 0.3261 -0.0274 0.39270 -0.0121 0.5096 -0.0160 0.4615 0.39270
3 1:2556709 1 2556709 a g 0.3257 -0.0263 0.41210 -0.0116 0.5266 -0.0155 0.4749 0.41210
4 12:11366987 12 11366987 t c 0.9443 0.0355 0.61460 0.0019 0.9631 0.0185 0.7007 0.61460
5 17:21949792 17 21949792 a c 0.4570 -0.0384 0.20690 -0.0043 0.8065 -0.0212 0.3050 0.20690
6 17:21955349 17 21955349 t g 0.5253 0.0505 0.09562 0.0103 0.5574 0.0248 0.2303 0.09562
minTRAIT BETAmean
1 SBP -0.01875
2 SBP -0.01975
3 SBP -0.01895
4 SBP 0.01870
5 SBP -0.02135
6 SBP 0.03040
I can see CP is here yet when I try to check this I get:
exists("gwasdf$CP")
[1] FALSE
class(gwasdf)
[1] "data.frame"
nrow(gwasdf)
[1] 7083535
Why is this false and how can I make it be true?
I am trying to ultimately check whether the CP identifiers are present in another file with follow-up code using:
CPmatches <- df2[CP %in% gwasdf$CP] #df2 is another file I just read in
mismatchextract <- subset(gwasdf, !(CP %in% df2$CP))
For extra info I use RStudio with:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.57.0 sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7
[5] proto_1.0.0 data.table_1.13.2 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[13] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2 colorspace_1.4-1 vctrs_0.3.4
[6] generics_0.1.0 chron_2.3-56 blob_1.2.1 rlang_0.4.8 pillar_1.4.7
[11] glue_1.4.1 withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_2.0.0
[16] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0 fansi_0.4.1 broom_0.7.2
[26] Rcpp_1.0.5 scales_1.1.1 backports_1.1.10 jsonlite_1.7.1 fs_1.5.0
[31] bit_4.0.4 hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[36] cli_2.2.0 tools_4.0.2 magrittr_2.0.1 crayon_1.3.4 pkgconfig_2.0.3
[41] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
[46] httr_1.4.2 rstudioapi_0.13 R6_2.5.0 compiler_4.0.2
Something like this using dplyr and the %in% operator? Assuming there are two separate datasets and a goal of subsetting based on whether an element in one dataset belongs to a separate dataset.
qwasdf_1 <- data.frame(
CP1 = c("1:2556125", "1:2556548", "99:12345678")
)
qwasdf_2 <- data.frame(
CP2 = c("1:2556125", "1:2556548", "1:2556709")
)
library(dplyr)
qwasdf_1 %>%
filter(CP1 %in% qwasdf_2$CP2)
#> CP1
#> 1 1:2556125
#> 2 1:2556548
Created on 2020-11-23 by the reprex package (v0.3.0)
In my process I need to perform many dplyr::inner_joins. Thought I might define a custom pipe operator for it as explained here:
library(tidyverse)
library(rlang)
df1 <- tibble(a = 1:10, b = 11:20)
df2 <- tibble(a = 1:10, c = 21:30)
`%J>%` <- function(lhs, rhs){
inner_join(lhs, rhs)
}
df1 %J>% df2
This works as expected and I get:
Joining, by = "a"
# A tibble: 10 x 3
a b c
<int> <int> <int>
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
But then also a warning:
Warning message:
`chr_along()` is soft-deprecated as of rlang 0.2.0.
This warning is displayed once per session.
Plot thickens if I don't include library(rlang) at all (in a new session), in which case I get no warnings:
library(tidyverse)
df1 <- tibble(a = 1:10, b = 11:20)
df2 <- tibble(a = 1:10, c = 21:30)
`%J>%` <- function(lhs, rhs){
inner_join(lhs, rhs)
}
df1 %J>% df2
Obviously I don't have to include library(rlang) at all in this example, but if I did - this is one weird warning. Where is it coming from and how to avoid it if I did wanted to include library(rlang)?
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_Israel.1252 LC_CTYPE=English_Israel.1252 LC_MONETARY=English_Israel.1252 LC_NUMERIC=C LC_TIME=English_Israel.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rlang_0.3.0.1 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 cellranger_1.1.0 pillar_1.3.0 compiler_3.5.1 plyr_1.8.4 bindr_0.1.1 tools_3.5.1 packrat_0.4.9-3 jsonlite_1.5 lubridate_1.7.4 nlme_3.1-137
[12] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 cli_1.0.1 rstudioapi_0.8 haven_1.1.2 bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0 httr_1.3.1 hms_0.4.2
[23] grid_3.5.1 tidyselect_0.2.4 glue_1.3.0 R6_2.2.2 fansi_0.3.0 readxl_1.1.0 modelr_0.1.2 magrittr_1.5 backports_1.1.2 scales_1.0.0 rvest_0.3.2
[34] assertthat_0.2.0 colorspace_1.3-2 utf8_1.1.4 stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 broom_0.5.0 crayon_1.3.4
From your description, I would say that If you load rlang as part of the tidyverse, (i.e. just load tidyverse), then R will use the verse's rlang which is automatically updated whithin the verse. If you load tidyverse first and then rlang, then R will use the last seen one, which is the one you loaded manually. Thus, If you did not update rlang manually then It will give the warning.
The problem should go away If you manually update rlang.
For some reason stat_density_2d() does not seem to be working right for me when specifying geom = "polygon" and I am absolutely stumped. Here is my code...
library(sf)
library(tidyverse)
library(RANN2)
library(hexbin)
library(mapproj)
options(stringsAsFactors = FALSE)
raleigh_police <- rgdal::readOGR("https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.geojson", "OGRGeoJSON")
raleigh_police_sf <- raleigh_police %>%
st_as_sf()
raleigh_police_sf %>%
filter(crime_description == "Burglary/Residential") %>%
st_coordinates() %>%
as_tibble() %>%
ggplot() +
stat_density_2d(aes(X, Y, fill = stat(level)), geom = "polygon")
Here is my sessionInfo()...
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 mapproj_1.2.6 maps_3.3.0 hexbin_1.27.2 RANN2_0.1 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
[9] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2 tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 sf_0.7-1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lubridate_1.7.4 lattice_0.20-38 class_7.3-14 utf8_1.1.4 assertthat_0.2.0 rprojroot_1.3-2
[8] digest_0.6.18 R6_2.3.0 cellranger_1.1.0 plyr_1.8.4 backports_1.1.2 evaluate_0.12 e1071_1.7-0
[15] httr_1.3.1 blogdown_0.9 pillar_1.3.0 rlang_0.3.0.1 lazyeval_0.2.1 readxl_1.1.0 rstudioapi_0.8
[22] rmarkdown_1.10 labeling_0.3 rgdal_1.3-6 munsell_0.5.0 broom_0.5.0 compiler_3.5.1 modelr_0.1.2
[29] xfun_0.4 pkgconfig_2.0.2 htmltools_0.3.6 tidyselect_0.2.5 bookdown_0.7 codetools_0.2-15 fansi_0.4.0
[36] crayon_1.3.4 withr_2.1.2 MASS_7.3-51.1 grid_3.5.1 nlme_3.1-137 spData_0.2.9.4 jsonlite_1.5
[43] gtable_0.2.0 DBI_1.0.0 magrittr_1.5 units_0.6-1 scales_1.0.0 cli_1.0.1 stringi_1.2.4
[50] sp_1.3-1 xml2_1.2.0 tools_3.5.1 glue_1.3.0 hms_0.4.2 yaml_2.2.0 colorspace_1.3-2
[57] classInt_0.2-3 rvest_0.3.2 knitr_1.20 bindr_0.1.1 haven_1.1.2
I just get a blank plot with nothing on it. Completely stumped. What am I doing wrong here?
Update (2018-11-18)
It turns out that the primary issue here was options(stringsAsFactors = FALSE). If you comment that out and run the original code, everything actually works fine. I found this GitHub Issue which was the reason I tried that. Much more efficient code solutions are provided in the answers to this question and they also made sure not to use options(stringsAsFactors = FALSE).
Aside from downloading the file before reading and changing readOGR to read_sf, it works as-is for me save warnings for a couple NA points caused by empty geometries:
library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
path <- "~/Downloads/raleigh.geojson"
download.file(
"https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.geojson",
path,
method = "curl"
)
raleigh_police <- sf::read_sf(path, "OGRGeoJSON")
raleigh_police %>%
filter(crime_description == "Burglary/Residential") %>%
st_coordinates() %>%
as_tibble() %>%
ggplot() +
stat_density_2d(aes(X, Y, fill = stat(level)), geom = "polygon")
#> Warning: Removed 5 rows containing non-finite values (stat_density2d).
The empty rows:
raleigh_police %>%
filter(crime_description == "Burglary/Residential",
st_is_empty(.))
#> Simple feature collection with 5 features and 21 fields (with 5 geometries empty)
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 5 x 22
#> OBJECTID GlobalID case_number crime_category crime_code crime_descripti…
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 205318 8057315… P14076062 BURGLARY/RESI… 30B Burglary/Reside…
#> 2 417488 70afb27… P15027702 BURGLARY/RESI… 30B Burglary/Reside…
#> 3 424718 bdf69fa… P18029113 BURGLARY/RESI… 30B Burglary/Reside…
#> 4 436550 711c05b… P18044139 BURGLARY/RESI… 30B Burglary/Reside…
#> 5 442091 9d7a008… P18051764 BURGLARY/RESI… 30B Burglary/Reside…
#> # … with 16 more variables: crime_type <chr>, reported_block_address <chr>,
#> # city_of_incident <chr>, city <chr>, district <chr>, reported_date <dttm>,
#> # reported_year <int>, reported_month <int>, reported_day <int>,
#> # reported_hour <int>, reported_dayofwk <chr>, latitude <dbl>,
#> # longitude <dbl>, agency <chr>, updated_date <dttm>, geometry <POINT [°]>
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS 10.14.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] sf_0.7-1 forcats_0.3.0 stringr_1.3.1
#> [4] dplyr_0.7.99.9000 purrr_0.2.5 readr_1.2.0
#> [7] tidyr_0.8.2 tibble_1.4.99.9005 ggplot2_3.1.0
#> [10] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.5 haven_1.1.2 lattice_0.20-35
#> [4] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0
#> [7] rlang_0.3.0.1 e1071_1.7-0 pillar_1.3.0.9001
#> [10] glue_1.3.0 withr_2.1.2 DBI_1.0.0
#> [13] modelr_0.1.2 readxl_1.1.0 plyr_1.8.4
#> [16] munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0
#> [19] rvest_0.3.2 evaluate_0.12 labeling_0.3
#> [22] knitr_1.20 class_7.3-14 broom_0.5.0
#> [25] Rcpp_0.12.19.3 scales_1.0.0 backports_1.1.2
#> [28] classInt_0.2-3 jsonlite_1.5 hms_0.4.2.9001
#> [31] digest_0.6.18 stringi_1.2.4 grid_3.5.1
#> [34] rprojroot_1.3-2 cli_1.0.1 tools_3.5.1
#> [37] magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4
#> [40] pkgconfig_2.0.2 MASS_7.3-51 xml2_1.2.0
#> [43] spData_0.2.9.4 lubridate_1.7.4 assertthat_0.2.0
#> [46] rmarkdown_1.10 httr_1.3.1 R6_2.3.0
#> [49] units_0.6-1 nlme_3.1-137 compiler_3.5.1
As noted, this is just point-level data and the CSV they provide can be a fine substitute:
library(tidyverse)
rp_csv_url <- "https://opendata.arcgis.com/datasets/24c0b37fa9bb4e16ba8bcaa7e806c615_0.csv"
httr::GET(
url = rp_csv_url,
httr::write_disk(basename(rp_csv_url)), # won't overwrite if it exists unless explicitly told to so you get caching for free
httr::progress() # I suspect this is a big file so it's nice to see a progress bar
)
raleigh_police <- read_csv(basename(rp_csv_url))
mutate(
raleigh_police,
longitude = as.numeric(longitude), # they come in wonky, still
latitude = as.numeric(latitude) # they come in wonky, still
) -> raleigh_police
raleigh_police %>%
filter(crime_description == "Burglary/Residential") %>%
ggplot() +
stat_density_2d(
aes(longitude, latitude, fill = stat(level)),
color = "#2b2b2b", size=0.125, geom = "polygon"
) +
viridis::scale_fill_viridis(direction=-1, option="magma") +
hrbrthemes::theme_ipsum_rc()
If you'd like to turn level into something more meaningful:
h <- c(MASS::bandwidth.nrd(rp_br$longitude),
MASS::bandwidth.nrd(rp_br$latitude))
dens <- MASS::kde2d(
rp_br$longitude, rp_br$latitude, h = h, n = 100
)
breaks <- pretty(range(dens$z), 10)
zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
z <- tapply(zdf$z, zdf[c("x", "y")], identity)
cl <- grDevices::contourLines(
x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
levels = breaks
)
sp::SpatialPolygons(
lapply(1:length(cl), function(idx) {
sp::Polygons(
srl = list(sp::Polygon(
matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
)),
ID = idx
)
})
) -> cont
sp::coordinates(rp_br) <- ~longitude+latitude
then:
data_frame(
ct = sapply(sp::over(cont, sp::geometry(rp_br), returnList = TRUE), length),
id = 1:length(ct),
lvl = sapply(cl, function(x) x$level)
) %>%
count(lvl, wt=ct) %>%
mutate(
pct = n/nrow(rp_br),
pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
)
## # A tibble: 10 x 4
## lvl n pct pct_lab
## <dbl> <int> <dbl> <chr>
## 1 10. 7302 0.927 92.7% of the points fall within this level
## 2 20. 6243 0.792 79.2% of the points fall within this level
## 3 30. 4786 0.607 60.7% of the points fall within this level
## 4 40. 3204 0.407 40.7% of the points fall within this level
## 5 50. 1945 0.247 24.7% of the points fall within this level
## 6 60. 1277 0.162 16.2% of the points fall within this level
## 7 70. 793 0.101 10.1% of the points fall within this level
## 8 80. 474 0.0601 6.0% of the points fall within this level
## 9 90. 279 0.0354 3.5% of the points fall within this level
## 10 100. 44 0.00558 0.6% of the points fall within this level
If I pass a grouped data frame to a function, then change the name of the grouped variable, the grouping of the original data frame gets changed to the new name. When the function returns (I am not returning the altered data frame), the names of the original data frame are unchanged but the grouping is changed to the non-existent name.
# test scoping of group_by() which appears to change groups
library(dplyr)
muck_up_group<-function(mydf){
mydf<-mydf %>% rename(UhOh=Species)
}
dont_muck_up_group<-function(mydf){
mydf<-mydf %>% ungroup()
mydf<-mydf %>% rename(UhOh=Species)
}
data("iris")
iris<-as_tibble(iris) %>% group_by(Species)
iris
# A tibble: 150 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
muck_up_group(iris) # original grouping changed to column name that doesn't exist
iris
# A tibble: 150 x 5
# Groups: UhOh [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
#restore original state
iris<-as_tibble(iris) %>% group_by(Species)
dont_muck_up_group(iris) # original grouping preserved
iris
# A tibble: 150 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
I can understand why it might be bad practice to change the name of a grouping variable but it is permissible. This seems to be an example of an attribute of a variable being passed by reference when the content is being passed by value (as we understand R does normally).
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] lubridate_1.6.0 bindrcpp_0.2 mFilter_0.1-3
[4] ggrepel_0.6.5 reshape2_1.4.2 scales_0.4.1
[7] purrr_0.2.3 readr_1.1.1 tidyr_0.7.0
[10] tibble_1.3.4 tidyverse_1.1.1 knitr_1.17
[13] Rblpapi_0.3.6 stringr_1.2.0 rvest_0.3.2
[16] xml2_1.1.1 devtools_1.13.3 dplyr_0.7.2
[19] plyr_1.8.4 ggplot2_2.2.1 PerformanceAnalytics_1.4.3541
[22] xts_0.10-0 zoo_1.8-0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 lattice_0.20-35 assertthat_0.2.0 rprojroot_1.2 digest_0.6.12
[6] psych_1.7.5 R6_2.2.2 cellranger_1.1.0 backports_1.1.0 evaluate_0.10.1
[11] httr_1.3.1 highr_0.6 rlang_0.1.2 curl_2.8.1 lazyeval_0.2.0
[16] readxl_1.0.0 TTR_0.23-2 tidyquant_0.5.3 rmarkdown_1.6 labeling_0.3
[21] foreign_0.8-67 munsell_0.4.3 broom_0.4.2 compiler_3.4.0 modelr_0.1.1
[26] pkgconfig_2.0.1 base64enc_0.1-3 mnormt_1.5-5 htmltools_0.3.6 tidyselect_0.1.1
[31] withr_2.0.0 Quandl_2.8.0 grid_3.4.0 nlme_3.1-131 jsonlite_1.5
[36] gtable_0.2.0 magrittr_1.5 quantmod_0.4-10 stringi_1.1.5 RColorBrewer_1.1-2
[41] tools_3.4.0 forcats_0.2.0 glue_1.1.1 hms_0.3 rsconnect_0.8.5
[46] parallel_3.4.0 yaml_2.1.14 colorspace_1.3-2 memoise_1.1.0 bindr_0.1
[51] haven_1.1.0
>
Bug? Thanks.
See #aosmith’s comment above. Dplyr closed issue.