How to rename mlr3 task feature values within pipeline

How to rename mlr3 task feature values within pipeline - r

I have a mlr3 task
df <- data.frame(v1 = c("a", "b", "a"),
v2 = c(1, 2, 2),
data = c(3.15, 4.11, 3.56))
library(mlr3)
task <- TaskRegr$new("bmsp", df, target = "data")
How can I rename the feature "v1" values "a" to values "c" within pipeline?
The code:
library(mlr3)
library(mlr3pipelines)
df <- data.frame(v1 = c("a", "b", "a"),
v2 = c(1, 2, 2),
data = c(3.15, 4.11, 3.56))
library(mlr3)
task <- TaskRegr$new("bmsp", df, target = "data")
pop <- po("colapply",
applicator = function(x) ifelse(x == "a", "c", x))
pop$param_set$values$affect_columns = selector_name("v1")
pop$train(list(task))[[1]]$data()
Gives the output (see column v1, row 2):
data v1 v2
1 3.15 c 1
2 4.11 2 2
3 3.56 c 2
But need output
data v1 v2
1 3.15 c 1
2 4.11 b 2
3 3.56 c 2

This is quite straightforward to do using PipeOpColApply.
We need to define a function that will take the provided input and perform the requested operation (applicator).
library(mlr3)
library(mlr3pipelines)
pop <- po("colapply",
applicator = function(x) ifelse(x == "a", "c", x))
We also need to define on which columns the function will operate:
pop$param_set$values$affect_columns = selector_name("v1")
pop$train(list(task))[[1]]$data()
#output
data v1 v2
1: 3.15 c 1
2: 4.11 b 2
3: 3.56 c 2
This is very similar to the example in the function help.
data:
df <- data.frame(v1 = c("a", "b", "a"),
v2 = c(1, 2, 2),
data = c(3.15, 4.11, 3.56))
task <- TaskRegr$new("bmsp", df, target = "data")
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlr3pipelines_0.3.0-9000 mlr3_0.7.0 Biostrings_2.56.0 XVector_0.28.0 IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] Biobase_2.48.0 httr_1.4.2 bit64_4.0.5 splines_4.0.2 foreach_1.5.0 prodlim_2019.11.13 assertthat_0.2.1 lgr_0.3.4 askpass_1.1
[10] BiocFileCache_1.12.1 blob_1.2.1 mlr3misc_0.5.0 progress_1.2.2 ipred_0.9-9 backports_1.1.10 pillar_1.4.6 RSQLite_2.2.1 lattice_0.20-41
[19] glue_1.4.2 uuid_0.1-4 pROC_1.16.2 digest_0.6.25 checkmate_2.0.0 colorspace_1.4-1 recipes_0.1.13 Matrix_1.2-18 plyr_1.8.6
[28] timeDate_3043.102 XML_3.99-0.5 pkgconfig_2.0.3 biomaRt_2.44.1 caret_6.0-86 zlibbioc_1.34.0 purrr_0.3.4 scales_1.1.1 gower_0.2.2
[37] lava_1.6.8 tibble_3.0.3 openssl_1.4.3 generics_0.0.2 ggplot2_3.3.2 ellipsis_0.3.1 withr_2.3.0 nnet_7.3-14 paradox_0.4.0-9000
[46] survival_3.1-12 magrittr_1.5 crayon_1.3.4 memoise_1.1.0 nlme_3.1-148 MASS_7.3-51.6 class_7.3-17 tools_4.0.2 data.table_1.13.0
[55] prettyunits_1.1.1 hms_0.5.3 lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0 glmnet_4.0-2 AnnotationDbi_1.50.3 compiler_4.0.2 tinytex_0.26
[64] rlang_0.4.7 grid_4.0.2 iterators_1.0.12 rstudioapi_0.11 rappdirs_0.3.1 gtable_0.3.0 ModelMetrics_1.2.2.2 codetools_0.2-16 DBI_1.1.0
[73] curl_4.3 reshape2_1.4.4 R6_2.4.1 lubridate_1.7.9 dplyr_1.0.2 bit_4.0.4 biomartr_0.9.2 shape_1.4.5 stringi_1.5.3
[82] Rcpp_1.0.5 vctrs_0.3.4 rpart_4.1-15 dbplyr_1.4.4 tidyselect_1.1.0 xfun_0.18

Related

R+ggplot2: adding log tick marks to a histogram

Please have a look at the reprex at the end of the post.
I generate some lognormally distributed values and then I bin the distribution using a non-uniform bin (the grid is evenly spaced if I take its logarithm).
The point is not the maths, but the fact that, using annotation_logticks
( see
https://ggplot2.tidyverse.org/reference/annotation_logticks.html
) I cannot add the ticks to the plot.
Does anybody understand what goes wrong?
Thanks a lot!
library(tidyverse)
library(scales)
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
## auxiliary functions
scale_x_log10nice <- function(name=NULL,omag=seq(-20,20),...) {
breaks10 <- 10^omag
scale_x_log10(name,breaks=breaks10,
labels=scales::trans_format("log10", scales::math_format(10^.x)),...)
}
log_binning <- function(x_min,x_max,n_bin){
x_max <- x_max
m <- n_bin-1
r <- (x_max/x_min)^(1/m)
my_seq <- seq(0,m,by=1)
grid <- x_min*r^my_seq
}
##################################################à
set.seed(1234)
n_bins <- 10
df <- tibble(x=rlnorm(10e4, sdlog=2))
my_breaks2 <- log_binning(min(df$x),
max(df$x), n_bins)
gpl <- ggplot(df, aes(x=x )) +
theme_bw()+
geom_histogram(## binwidth=10e3,
colour="black", fill="blue"## , boundary=0
, breaks=my_breaks2
)+
scale_x_log10nice("x values")
gpl
gpl2 <- gpl+
annotation_logticks(sides="b", outside=T)
## where are the logticks?
gpl2
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scales_1.2.1 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.99.9000
#> [5] purrr_1.0.0 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
#> [9] ggplot2_3.4.0 tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 xfun_0.36 haven_2.5.1
#> [4] gargle_1.2.1 colorspace_2.0-3 vctrs_0.5.1
#> [7] generics_0.1.3 htmltools_0.5.4 yaml_2.3.6
#> [10] utf8_1.2.2 rlang_1.0.6 pillar_1.8.1
#> [13] glue_1.6.2 withr_2.5.0 DBI_1.1.3
#> [16] dbplyr_2.2.1 readxl_1.4.1 modelr_0.1.10
#> [19] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1
#> [22] cellranger_1.1.0 rvest_1.0.3 evaluate_0.19
#> [25] labeling_0.4.2 knitr_1.41 tzdb_0.3.0
#> [28] fastmap_1.1.0 fansi_1.0.3 highr_0.10
#> [31] broom_1.0.2 backports_1.4.1 googlesheets4_1.0.1
#> [34] jsonlite_1.8.4 farver_2.1.1 fs_1.5.2
#> [37] hms_1.1.2 digest_0.6.31 stringi_1.7.8
#> [40] grid_4.2.2 cli_3.6.0 tools_4.2.2
#> [43] magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3
#> [46] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.2
#> [49] googledrive_2.0.0 lubridate_1.9.0 timechange_0.1.1
#> [52] assertthat_0.2.1 rmarkdown_2.19 httr_1.4.4
#> [55] R6_2.5.1 compiler_4.2.2
Created on 2023-01-17 with reprex v2.0.2

If you want to use outside = TRUE in annotation_logticks, you also need to turn clipping off.
From the docs for ?annotation_logticks
outside      logical that controls whether to move the log ticks outside of the plot area. Default is off (FALSE). You will also need to use coord_cartesian(clip = "off")
gpl +
annotation_logticks(sides="b", outside = TRUE) +
coord_cartesian(clip = "off")

app$vspace error in building phylogenetic tree in R

I am working with phylogenetic trees. Import the phylogenetic tree file with ggtree::read.tree and get the information with readxl::read_xlsx. I want to visualize in tree. When I try to add color and shape information (from xlsx, I tried assigning it to a variable before but it didn't work) with the ggtree::geom_tippoint function, I get the "Error in app$vspace(new_style$margin-top %||% 0) :attempt to apply non-function" error.
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Turkish_Turkey.1254 LC_CTYPE=Turkish_Turkey.1254
#> [3] LC_MONETARY=Turkish_Turkey.1254 LC_NUMERIC=C
#> [5] LC_TIME=Turkish_Turkey.1254
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] rstudioapi_0.13 knitr_1.36 magrittr_2.0.1 R.cache_0.15.0
#> [5] rlang_1.0.1 fastmap_1.1.0 fansi_0.5.0 stringr_1.4.0
#> [9] styler_1.6.2 highr_0.9 tools_4.1.1 xfun_0.26
#> [13] R.oo_1.24.0 utf8_1.2.2 cli_3.2.0 withr_2.4.3
#> [17] htmltools_0.5.2 ellipsis_0.3.2 yaml_2.2.1 digest_0.6.28
#> [21] tibble_3.1.5 lifecycle_1.0.1 crayon_1.5.0 purrr_0.3.4
#> [25] R.utils_2.11.0 vctrs_0.3.8 fs_1.5.0 glue_1.4.2
#> [29] evaluate_0.14 rmarkdown_2.11 reprex_2.0.1 stringi_1.7.5
#> [33] compiler_4.1.1 pillar_1.7.0 R.methodsS3_1.8.1 backports_1.4.1
#> [37] pkgconfig_2.0.3
The contents of the nwk file are as follows.
(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);
xlsx file content is as follows.
label
con
host
rb
color
shape
A
Japan
Sol
Tsw
#ee4444
15
B
Japan
Sol
Sw5
#ee4444
15
C
South Korea
Sol
Tsw
#ee4444
15
D
South Korea
Cap
#A1CD42
16
E
China
Sol
Tsw
#ee4444
15
F
Italy
Cap
Tsw
#A1CD42
15
G
USA
Cap
#A1CD42
16
H
USA
Per
Sw5
#86d4ea
15
K
Italy
Sol
Sw5
#ee4444
15
L
Italy
Cap
#A1CD42
16
M
Turkey
Per
Tsw
#86d4ea
15
J
Turkey
Sol
#ee4444
16
I
Turkey
Cap
Sw5
#A1CD42
15
d1<- read.tree(file = "D:/Download/tree_newick.nwk")
d1a<-data.frame(read_xlsx(path="D:/Download/tree_newichk_info.xlsx", sheet = "Sheet1"))
d2<-ggtree(d1, layout = "circular")+xlim(-5, NA) %<+% d1a
d3<-d2+geom_text(aes(label=node), hjust=.3)+
geom_tiplab(aes(,color=d1a$con , label=label,size=10))+
geom_tippoint(aes(shape=ifelse(rb==c("Tsw","Sw5"),15, ifelse (rb!=c("Tsw","Sw5"), 16,17))), color= ifelse(d1a$host == "Cap",'#A1CD42', ifelse (d1a$host== "Sol", '#ee4444','#86d4ea')))
d3
shape_f<-ifelse(d1a$rb==c("Tsw","Sw5"),15, ifelse (d1a$rb!=c("Tsw","Sw5"), 16,17))
color_f=ifelse(d1a$host == "Cap",'#A1CD42', ifelse (d1a$host== "Sol", '#ee4444','#86d4ea'))
d4<-d2+geom_text(aes(label=node), hjust=.3)+geom_tiplab(aes(label=label))+geom_tippoint(aes(shape=shape_f,color=color_f))
d4
shape_d<-d1a$shape
color_d<-d1a$color
d5<-d2+ geom_text(aes(label=node), hjust=.3)+geom_tiplab(aes(label=label))+geom_tippoint(aes(shape=shape_d,color=color_d))
d5
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Turkish_Turkey.1254 LC_CTYPE=Turkish_Turkey.1254 LC_MONETARY=Turkish_Turkey.1254 LC_NUMERIC=C
[5] LC_TIME=Turkish_Turkey.1254
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reprex_2.0.1 shiny_1.7.1 forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 readr_2.0.2
[7] tidyr_1.1.4 tibble_3.1.5 tidyverse_1.3.1 readxl_1.3.1 ggnewscale_0.4.6 ggtreeExtra_1.2.3
[13] ggtree_3.0.4 treeio_1.16.2 tidytree_0.3.8 ggplot2_3.3.5 dplyr_1.0.7 ape_5.6-1
[19] treedataverse_0.0.1 BiocManager_1.30.16
loaded via a namespace (and not attached):
[1] nlme_3.1-152 fs_1.5.0 lubridate_1.8.0 httr_1.4.2 R.cache_0.15.0 tools_4.1.1 backports_1.4.1
[8] bslib_0.3.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.1 lazyeval_0.2.2 colorspace_2.0-3 withr_2.4.3
[15] processx_3.5.2 tidyselect_1.1.2 compiler_4.1.1 cli_3.2.0 rvest_1.0.2 xml2_1.3.2 labeling_0.4.2
[22] sass_0.4.0 scales_1.1.1 callr_3.7.0 digest_0.6.28 yulab.utils_0.0.4 R.utils_2.11.0 rmarkdown_2.11
[29] pkgconfig_2.0.3 htmltools_0.5.2 styler_1.6.2 highr_0.9 dbplyr_2.1.1 fastmap_1.1.0 rlang_1.0.1
[36] rstudioapi_0.13 gridGraphics_0.5-1 jquerylib_0.1.4 farver_2.1.0 generics_0.1.2 jsonlite_1.7.2 R.oo_1.24.0
[43] magrittr_2.0.1 ggplotify_0.1.0 patchwork_1.1.1 Rcpp_1.0.8 munsell_0.5.0 fansi_0.5.0 clipr_0.7.1
[50] R.methodsS3_1.8.1 lifecycle_1.0.1 stringi_1.7.5 yaml_2.2.1 grid_4.1.1 parallel_4.1.1 promises_1.2.0.1
[57] crayon_1.5.0 miniUI_0.1.1.1 lattice_0.20-44 haven_2.4.3 hms_1.1.1 ps_1.6.0 knitr_1.36
[64] pillar_1.7.0 glue_1.4.2 evaluate_0.14 ggfun_0.0.5 modelr_0.1.8 vctrs_0.3.8 tzdb_0.1.2
[71] httpuv_1.6.3 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 cachem_1.0.6 xfun_0.26 mime_0.12
[78] xtable_1.8-4 broom_0.7.9 later_1.3.0 aplot_0.1.3 ellipsis_0.3.2

How to make a column exist in r?

I have a very large dataset where I am looking to take a column of identifiers (CP) first edit how the identifiers look to match another file, and then search if there are ```CP`` matches between the files.
I do the editing of the CP first with:
fullGWAS <- fread('file.csv',sep=",")
colnames(fullGWAS)[1] <- "CP"
fullGWAS2<-gsub("_.*","",fullGWAS$CP)
fullGWAS2 <-data.frame(fullGWAS2)
colnames(fullGWAS2)[1] <- "CP"
fullGWAS3 <- select(fullGWAS, c(2:15))
gwasdf <- cbind(fullGWAS2, fullGWAS3)
As an example gwasdf looks like:
> head(gwasdf)
CP chr bpos a1 a2 freq BETAsbp Psbp BETAdbp Pdbp BETApp Ppp minP
1 1:2556125 1 2556125 t c 0.3255 -0.0262 0.41300 -0.0113 0.5388 -0.0157 0.4690 0.41300
2 1:2556548 1 2556548 t c 0.3261 -0.0274 0.39270 -0.0121 0.5096 -0.0160 0.4615 0.39270
3 1:2556709 1 2556709 a g 0.3257 -0.0263 0.41210 -0.0116 0.5266 -0.0155 0.4749 0.41210
4 12:11366987 12 11366987 t c 0.9443 0.0355 0.61460 0.0019 0.9631 0.0185 0.7007 0.61460
5 17:21949792 17 21949792 a c 0.4570 -0.0384 0.20690 -0.0043 0.8065 -0.0212 0.3050 0.20690
6 17:21955349 17 21955349 t g 0.5253 0.0505 0.09562 0.0103 0.5574 0.0248 0.2303 0.09562
minTRAIT BETAmean
1 SBP -0.01875
2 SBP -0.01975
3 SBP -0.01895
4 SBP 0.01870
5 SBP -0.02135
6 SBP 0.03040
I can see CP is here yet when I try to check this I get:
exists("gwasdf$CP")
[1] FALSE
class(gwasdf)
[1] "data.frame"
nrow(gwasdf)
[1] 7083535
Why is this false and how can I make it be true?
I am trying to ultimately check whether the CP identifiers are present in another file with follow-up code using:
CPmatches <- df2[CP %in% gwasdf$CP] #df2 is another file I just read in
mismatchextract <- subset(gwasdf, !(CP %in% df2$CP))
For extra info I use RStudio with:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.57.0 sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7
[5] proto_1.0.0 data.table_1.13.2 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[13] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2 colorspace_1.4-1 vctrs_0.3.4
[6] generics_0.1.0 chron_2.3-56 blob_1.2.1 rlang_0.4.8 pillar_1.4.7
[11] glue_1.4.1 withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_2.0.0
[16] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0 fansi_0.4.1 broom_0.7.2
[26] Rcpp_1.0.5 scales_1.1.1 backports_1.1.10 jsonlite_1.7.1 fs_1.5.0
[31] bit_4.0.4 hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[36] cli_2.2.0 tools_4.0.2 magrittr_2.0.1 crayon_1.3.4 pkgconfig_2.0.3
[41] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
[46] httr_1.4.2 rstudioapi_0.13 R6_2.5.0 compiler_4.0.2

Something like this using dplyr and the %in% operator? Assuming there are two separate datasets and a goal of subsetting based on whether an element in one dataset belongs to a separate dataset.
qwasdf_1 <- data.frame(
CP1 = c("1:2556125", "1:2556548", "99:12345678")
)
qwasdf_2 <- data.frame(
CP2 = c("1:2556125", "1:2556548", "1:2556709")
)
library(dplyr)
qwasdf_1 %>%
filter(CP1 %in% qwasdf_2$CP2)
#> CP1
#> 1 1:2556125
#> 2 1:2556548
Created on 2020-11-23 by the reprex package (v0.3.0)

How to unlist column of strings to count matches?

I am looking to count any matching strings there are between 2 datasets. This is with one dataset having one column of genes and another column of genes those genes interact with.
For example:
#dataset1
Gene Interactors
ACE BRCA2, NOS2, SEPT9
HER2 AGT, TGRF
YUO SEPT9, NOS2, TET2
I have a second dataset also with genes and interacting genes similar to this. For example:
#dataset2
Gene Interactors
RTY ADFD, NOS3, SEPT9
TERT ADAM2, GERP
GHJ TET2, NOS2
I am looking to be able to count how many Interactors in dataset1 there are that have matching Interactors in dataset 2.
Example output:
Gene Interactors Secondary_interaction_count
ACE BRCA2, NOS2, SEPT9 2 #SEPT9 and NOS2 are in the 2nd dataset under interacting genes
HER2 AGT, TGRF 0
YUO SEPT9, ADAM2, TET2 3 #all 3 are in dataset 2
Currently I have 2 versions to try and get this. One which only gives true or false that I don't know how to change into counting:
temp <- unlist(strsplit(df2$interactors, ', '))
df1$secondary_count <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
And another which I think isn't spliting the string but I'm not sure how to modify it:
df1 %>%
mutate(secondary_count = str_count(interactors, str_c(df2$interactors, collapse = '|')))
Is there a way to modify either of these 2 coding attempts to get a count? Or should I try another way?
Input data:
#df1:
structure(list(Gene = c("ACE", "HER2", "YUO"), Interactors = c("BRCA2, NOS2, SEPT9",
"AGT, TGRF", "SEPT9, NOS2, TET2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
#df2:
structure(list(Gene = c("RTY", "TERT", "GHJ"), Interactors = c("ADFD, NOS3, SEPT9",
"ADAM2, GERP", "TET2, NOS2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7 proto_1.0.0
[5] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
[13] plyr_1.8.6 dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] gtools_3.8.2 tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2
[5] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 chron_2.3-56
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.1
[13] withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_1.4.4
[17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
[21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] fansi_0.4.1 broom_0.7.2 Rcpp_1.0.5 scales_1.1.1
[29] backports_1.1.10 jsonlite_1.7.1 fs_1.5.0 bit_4.0.4
[33] hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[37] cli_2.1.0 tools_4.0.2 magrittr_1.5 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
[45] lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11
[49] R6_2.4.1 compiler_4.0.2

Try this
library(tidyr)
library(dplyr)
sep_rows <- . %>% separate_rows(Interactors, sep = ", ")
df1 %>%
sep_rows() %>%
mutate(
found = !is.na(match(Interactors, sep_rows(df2)$Interactors))
) %>%
group_by(Gene) %>%
summarise(
Interactors = toString(Interactors),
Secondary_interaction_count = sum(found)
)
Output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Gene Interactors Secondary_interaction_count
<chr> <chr> <int>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3

Another try:
> df1 %>% separate_rows(Interactors) %>% rowwise() %>%
+ mutate(secondary_interactions = str_extract_all(Interactors, paste0(df2 %>% separate_rows(Interactors) %>% pull(Interactors), collapse = '|'))) %>%
+ unnest(secondary_interactions, keep_empty = T) %>% group_by(Gene) %>%
+ mutate(Interactors = toString(Interactors), secondary_interactions_cnt = case_when(is.na(secondary_interactions) ~ 0, TRUE ~ 1)) %>%
+ mutate(secondary_interactions = sum(secondary_interactions_cnt)) %>% select(-4)%>% distinct()
# A tibble: 3 x 3
# Groups: Gene [3]
Gene Interactors secondary_interactions
<chr> <chr> <dbl>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
>

group_by and summarise previously (10 minutes ago) worked on my data frame but no doesn't

I have a dataframe I am manipulating that I ran group_by and summarise on a few minutes ago. After a forced restart of my computer (due to company IT) my group_by function no longer works. I have had this error sporadically for the last month or so.
Here's my code:
covid_per10k_hosp <- datasetv5_pat %>%
ungroup() %>%
mutate(death2=case_when(death=="deceased"~1, TRUE~0)) %>%
group_by(PROV_ID) %>%
summarize(n_deaths=sum(death2))
example data:
PAT_ID PROV_ID death
1 A deceased
2 A alive
3 B deceased
4 B deceased
Expected Output:
PROV_ID n_deaths
A 1
B 2
Actual Output:
PROV_ID n_deaths
A 1
A 1
B 2
B 2
Edit to respond to comments suggesting additional information, here is the output from sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] finalfit_1.0.2 ggsci_2.9 icd_4.0.9
ggpubr_0.4.0
[5] readr_1.3.1 vroom_1.3.1 knitr_1.29
tableone_0.12.0
[9] dplyr_1.0.2 summarytools_0.9.6 expss_0.10.6
Hmisc_4.4-1
[13] ggplot2_3.3.2 Formula_1.2-3 survival_3.2-3
lattice_0.20-38
loaded via a namespace (and not attached):
[1] tidyr_1.1.1 bit64_4.0.5 splines_3.6.0
carData_3.0-4
[5] assertthat_0.2.1 latticeExtra_0.6-29 pander_0.6.3
cellranger_1.1.0
[9] pillar_1.4.6 backports_1.1.7 glue_1.4.2
digest_0.6.25
[13] RColorBrewer_1.1-2 pryr_0.1.4 ggsignif_0.6.0
checkmate_2.0.0
[17] colorspace_1.4-1 htmltools_0.5.0 Matrix_1.2-17
survey_4.0
[21] plyr_1.8.6 pkgconfig_2.0.3 broom_0.7.0
haven_2.3.1
[25] magick_2.4.0 purrr_0.3.4 scales_1.1.1
jpeg_0.1-8.1
[29] openxlsx_4.1.5 rio_0.5.16 htmlTable_2.0.1
tibble_3.0.3
[33] generics_0.0.2 car_3.0-9 ellipsis_0.3.1
withr_2.2.0
[37] nnet_7.3-12 cli_2.0.2 magrittr_1.5
crayon_1.3.4
[41] readxl_1.3.1 mice_3.11.0 fansi_0.4.1
rstatix_0.6.0
[45] forcats_0.5.0 foreign_0.8-71 rapportools_1.0
tools_3.6.0
[49] data.table_1.13.0 hms_0.5.3 mitools_2.4
lifecycle_0.2.0
[53] matrixStats_0.56.0 stringr_1.4.0 munsell_0.5.0
cluster_2.0.8
[57] zip_2.1.0 packrat_0.5.0 compiler_3.6.0
rlang_0.4.7
[61] grid_3.6.0 rstudioapi_0.11 htmlwidgets_1.5.1
tcltk_3.6.0
[65] base64enc_0.1-3 boot_1.3-22 gtable_0.3.0
codetools_0.2-16
[69] abind_1.4-5 DBI_1.1.0 curl_4.3
R6_2.4.1
[73] gridExtra_2.3 lubridate_1.7.9 utf8_1.1.4
bit_4.0.4
[77] stringi_1.4.6 Rcpp_1.0.5 vctrs_0.3.2
rpart_4.1-15
[81] png_0.1-7 tidyselect_1.1.0 xfun_0.16

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to rename mlr3 task feature values within pipeline - r

Related

R+ggplot2: adding log tick marks to a histogram

app$vspace error in building phylogenetic tree in R

How to make a column exist in r?

How to unlist column of strings to count matches?

group_by and summarise previously (10 minutes ago) worked on my data frame but no doesn't

Categories

Resources