R round_date to nearest quarter in lubridate v1.8.0 - r

I have a tibble with dates in one column I would like to round to the nearest quarter. Using lubridate::round_date() each date seems to just round down to the nearest quarter. I am trying to solve so that some dates round down and some dates round up.
library(tidyverse)
library(lubridate)
my_tibble <- tibble(my_dates = seq(ymd('2022-01-01'), ymd('2022-03-31'), by = 'days'))
my_tibble <-
my_tibble %>%
mutate(qtr_date = round_date(my_dates, unit = "quarter"))
The early dates should round down, which they do:
> head(my_tibble)
# A tibble: 6 × 2
my_dates qtr_date
<date> <dttm>
1 2022-01-01 2022-01-01 00:00:00
2 2022-01-02 2022-01-01 00:00:00
3 2022-01-03 2022-01-01 00:00:00
4 2022-01-04 2022-01-01 00:00:00
5 2022-01-05 2022-01-01 00:00:00
6 2022-01-06 2022-01-01 00:00:00
But the later dates also round down:
> tail(my_tibble)
# A tibble: 6 × 2
my_dates qtr_date
<date> <dttm>
1 2022-03-26 2022-01-01 00:00:00
2 2022-03-27 2022-01-01 00:00:00
3 2022-03-28 2022-01-01 00:00:00
4 2022-03-29 2022-01-01 00:00:00
5 2022-03-30 2022-01-01 00:00:00
6 2022-03-31 2022-01-01 00:00:00
I was expecting the dates after the midpoint (2022-02-15) to round up to second quarter date.
If I wanted the dates in the quarter to always round up or round down I would have used cieling_date() or floor_date().
Is there someway to modify round_date() so that it actually rounds up or down?
Here is my info from sessioninfo()
> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.8.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[6] readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.6 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] cellranger_1.1.0 pillar_1.7.0 compiler_4.2.0 dbplyr_2.1.1 tools_4.2.0
[6] jsonlite_1.8.0 lifecycle_1.0.1 gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.3
[11] reprex_2.0.1 DBI_1.1.2 cli_3.3.0 rstudioapi_0.13 haven_2.5.0
[16] xml2_1.3.3 withr_2.5.0 httr_1.4.3 fs_1.5.2 generics_0.1.2
[21] vctrs_0.4.1 hms_1.1.1 grid_4.2.0 tidyselect_1.1.2 glue_1.6.2
[26] R6_2.5.1 fansi_1.0.3 readxl_1.4.0 tzdb_0.3.0 modelr_0.1.8
[31] magrittr_2.0.3 backports_1.4.1 scales_1.2.0 ellipsis_0.3.2 rvest_1.0.2
[36] assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2 stringi_1.7.6 munsell_0.5.0
[41] broom_0.8.0 crayon_1.5.1

I could replicate this issue in lubridate v1.8.0. If you look at the source for round_date() you will see that this function has been completely refactored. The work is now done by the line:
timechange::time_round(x, unit = unit, week_start = as_week_start(week_start))
round_date() was previously calling floor_date() and ceiling_date() and finding which was nearest. We can see that this change was made in the commit on November 4th 2022 (line 174).
This does not entirely explain why your code did not work, but knowing round_date() is now calculated differently, I updated lubridate to the latest CRAN version (1.9.0) with:
install.packages("lubridate")
It is also possible to install a specific version of a package as described here.
Updating to 1.9.0 also installed timechange as a dependency (v.0.1.1), which fixed the problem.

Related

R+ggplot2: adding log tick marks to a histogram

Please have a look at the reprex at the end of the post.
I generate some lognormally distributed values and then I bin the distribution using a non-uniform bin (the grid is evenly spaced if I take its logarithm).
The point is not the maths, but the fact that, using annotation_logticks
( see
https://ggplot2.tidyverse.org/reference/annotation_logticks.html
) I cannot add the ticks to the plot.
Does anybody understand what goes wrong?
Thanks a lot!
library(tidyverse)
library(scales)
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
## auxiliary functions
scale_x_log10nice <- function(name=NULL,omag=seq(-20,20),...) {
breaks10 <- 10^omag
scale_x_log10(name,breaks=breaks10,
labels=scales::trans_format("log10", scales::math_format(10^.x)),...)
}
log_binning <- function(x_min,x_max,n_bin){
x_max <- x_max
m <- n_bin-1
r <- (x_max/x_min)^(1/m)
my_seq <- seq(0,m,by=1)
grid <- x_min*r^my_seq
}
##################################################à
set.seed(1234)
n_bins <- 10
df <- tibble(x=rlnorm(10e4, sdlog=2))
my_breaks2 <- log_binning(min(df$x),
max(df$x), n_bins)
gpl <- ggplot(df, aes(x=x )) +
theme_bw()+
geom_histogram(## binwidth=10e3,
colour="black", fill="blue"## , boundary=0
, breaks=my_breaks2
)+
scale_x_log10nice("x values")
gpl
gpl2 <- gpl+
annotation_logticks(sides="b", outside=T)
## where are the logticks?
gpl2
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scales_1.2.1 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.99.9000
#> [5] purrr_1.0.0 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
#> [9] ggplot2_3.4.0 tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 xfun_0.36 haven_2.5.1
#> [4] gargle_1.2.1 colorspace_2.0-3 vctrs_0.5.1
#> [7] generics_0.1.3 htmltools_0.5.4 yaml_2.3.6
#> [10] utf8_1.2.2 rlang_1.0.6 pillar_1.8.1
#> [13] glue_1.6.2 withr_2.5.0 DBI_1.1.3
#> [16] dbplyr_2.2.1 readxl_1.4.1 modelr_0.1.10
#> [19] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1
#> [22] cellranger_1.1.0 rvest_1.0.3 evaluate_0.19
#> [25] labeling_0.4.2 knitr_1.41 tzdb_0.3.0
#> [28] fastmap_1.1.0 fansi_1.0.3 highr_0.10
#> [31] broom_1.0.2 backports_1.4.1 googlesheets4_1.0.1
#> [34] jsonlite_1.8.4 farver_2.1.1 fs_1.5.2
#> [37] hms_1.1.2 digest_0.6.31 stringi_1.7.8
#> [40] grid_4.2.2 cli_3.6.0 tools_4.2.2
#> [43] magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3
#> [46] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.2
#> [49] googledrive_2.0.0 lubridate_1.9.0 timechange_0.1.1
#> [52] assertthat_0.2.1 rmarkdown_2.19 httr_1.4.4
#> [55] R6_2.5.1 compiler_4.2.2
Created on 2023-01-17 with reprex v2.0.2
If you want to use outside = TRUE in annotation_logticks, you also need to turn clipping off.
From the docs for ?annotation_logticks
outside      logical that controls whether to move the log ticks outside of the plot area. Default is off (FALSE). You will also need to use coord_cartesian(clip = "off")
gpl +
annotation_logticks(sides="b", outside = TRUE) +
coord_cartesian(clip = "off")

R + Arrow 10 : convert blank to numeric NA

Please have a look at the reprex at the end of the post.
I need to read a column as a string, perform several manipulations and then save convert it to a numerical column.
The blanks ("") in the string column give me a headache because arrow does not convert them to numerical missing values NA.
Does anybody know how to achieve that?
Many thanks
library(tidyverse)
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
df <- tibble(x=rep(c("4000 -", "6000 -", "", "8000 - "), 10),
y=seq(1,10, length=40))
write_csv(df, "test_string.csv")
data <- open_dataset("test_string.csv",
format="csv",
skip=1,
schema=schema(x=string(), y=double()))
data2 <- data |>
mutate(x= sub(" -.*", "", x) ) |>
mutate(x2=as.numeric(x)) |>
collect() ## how to convert the blank to a numeric NA ?
#> Error in `collect()`:
#> ! Invalid: Failed to parse string: '' as a scalar of type double
#> Backtrace:
#> ▆
#> 1. ├─dplyr::collect(mutate(mutate(data, x = sub(" -.*", "", x)), x2 = as.numeric(x)))
#> 2. └─arrow:::collect.arrow_dplyr_query(mutate(mutate(data, x = sub(" -.*", "", x)), x2 = as.numeric(x)))
#> 3. └─base::tryCatch(...)
#> 4. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 5. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 6. └─value[[3L]](cond)
#> 7. └─arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
#> 8. └─rlang::abort(msg, call = call)
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] arrow_10.0.0 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10
#> [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
#> [9] ggplot2_3.4.0 tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] lubridate_1.9.0 assertthat_0.2.1 digest_0.6.30
#> [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
#> [7] backports_1.4.1 reprex_2.0.2 evaluate_0.17
#> [10] httr_1.4.4 highr_0.9 pillar_1.8.1
#> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1
#> [16] R.utils_2.12.1 R.oo_1.25.0 rmarkdown_2.17
#> [19] styler_1.8.0 googledrive_2.0.0 bit_4.0.4
#> [22] munsell_0.5.0 broom_1.0.1 compiler_4.2.2
#> [25] modelr_0.1.9 xfun_0.34 pkgconfig_2.0.3
#> [28] htmltools_0.5.3 tidyselect_1.2.0 fansi_1.0.3
#> [31] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1
#> [34] withr_2.5.0 R.methodsS3_1.8.2 grid_4.2.2
#> [37] jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3
#> [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
#> [43] vroom_1.6.0 cli_3.4.1 stringi_1.7.8
#> [46] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2
#> [49] generics_0.1.3 vctrs_0.5.0 tools_4.2.2
#> [52] bit64_4.0.5 R.cache_0.16.0 glue_1.6.2
#> [55] hms_1.1.2 parallel_4.2.2 fastmap_1.1.0
#> [58] yaml_2.3.6 timechange_0.1.1 colorspace_2.0-3
#> [61] gargle_1.2.1 rvest_1.0.3 knitr_1.40
#> [64] haven_2.5.1
Created on 2022-11-07 with reprex v2.0.2
ifelse works here when all classes are correct (and not double()); if_else enforces this already, so we can use either.
data |>
mutate(x = sub(" -.*", "", x)) |>
mutate(
x = ifelse(x == "", NA_character_, x), # also if_else works
x2 = as.numeric(x)
) |>
collect()
# # A tibble: 40 x 3
# x y x2
# <chr> <dbl> <dbl>
# 1 4000 1 4000
# 2 6000 1.23 6000
# 3 NA 1.46 NA
# 4 8000 1.69 8000
# 5 4000 1.92 4000
# 6 6000 2.15 6000
# 7 NA 2.38 NA
# 8 8000 2.62 8000
# 9 4000 2.85 4000
# 10 6000 3.08 6000
# # ... with 30 more rows
Try using the read_csv instead of open_dataset
library(readr)
data <- read_csv("test_string.csv")

Could not find function "fpkmToTpm_matrix" [duplicate]

This question already has answers here:
Error: could not find function ... in R
(10 answers)
Closed 7 months ago.
I installed a R package named "GeoTcgaData", libraried it and wanted to use the function "fpkmToTPM_matrix" to convert my data. But this message came out:
Error in fpkmToTpm_matrix(J_Rseq) :
could not find function "fpkmToTpm_matrix"
Wondering where the problem is.
(The version of the package is 1.1.0)
You may need to install some of the dependencies separately via Bioconductor. If you install the packages using:
install.packages("BiocManager")
BiocManager::install(pkgs = c('DESeq2', 'impute', 'edgeR', 'cqn', 'topconfects', 'ChAMP', 'clusterProfiler',
'org.Hs.eg.db', 'minfi', 'IlluminaHumanMethylation450kanno.ilmn12.hg19',
'dearseq', 'NOISeq'))
You can then (hopefully) install and run GeoTcgaData as expected:
install.packages("GeoTcgaData", type = "source")
library(GeoTcgaData)
#> =============================================================
#> Hello, friend! welcome to use GeoTcgaData!
#> -------------------------------------------------------------
#> Version:1.1.0
#> =============================================================
lung_squ_count2 <- matrix(c(0.11,0.22,0.43,0.14,0.875,0.66,0.77,0.18,0.29),ncol=3)
rownames(lung_squ_count2) <- c("DISC1","TCOF1","SPPL3")
colnames(lung_squ_count2) <- c("sample1","sample2","sample3")
lung_squ_count2
#> sample1 sample2 sample3
#> DISC1 0.11 0.140 0.77
#> TCOF1 0.22 0.875 0.18
#> SPPL3 0.43 0.660 0.29
result <- fpkmToTpm_matrix(lung_squ_count2)
result
#> sample1 sample2 sample3
#> DISC1 144736.8 83582.09 620967.7
#> TCOF1 289473.7 522388.06 145161.3
#> SPPL3 565789.5 394029.85 233871.0
sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GeoTcgaData_1.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 plyr_1.8.7 pillar_1.8.0 compiler_4.1.3
#> [5] highr_0.9 R.methodsS3_1.8.2 R.utils_2.12.0 tools_4.1.3
#> [9] digest_0.6.29 evaluate_0.15 lifecycle_1.0.1 tibble_3.1.8
#> [13] R.cache_0.16.0 pkgconfig_2.0.3 rlang_1.0.4 reprex_2.0.1
#> [17] DBI_1.1.3 cli_3.3.0 rstudioapi_0.13 yaml_2.3.5
#> [21] xfun_0.31 fastmap_1.1.0 withr_2.5.0 styler_1.7.0
#> [25] stringr_1.4.0 dplyr_1.0.9 knitr_1.39 generics_0.1.3
#> [29] fs_1.5.2 vctrs_0.4.1 tidyselect_1.1.2 glue_1.6.2
#> [33] R6_2.5.1 fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4
#> [37] magrittr_2.0.3 htmltools_0.5.3 splines_4.1.3 assertthat_0.2.1
#> [41] utf8_1.2.2 nor1mix_1.3-0 stringi_1.7.8 cqn_1.38.0
#> [45] R.oo_1.25.0
Created on 2022-08-08 by the reprex package (v2.0.1)
If installing the dependencies separately doesn't solve the issue, another potential alternative is to define the function yourself. From the source code:
fpkmToTpm <- function(fpkm)
{
exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}
fpkmToTpm_matrix <- function(fpkm_matrix) {
fpkm_matrix_new <- apply(fpkm_matrix, 2, fpkmToTpm)
}
fpkmToTpm_matrix(J_Rseq)
Does this work?
Also, please edit your question to include the output of the command sessionInfo()

str_detect() error when detecting chinese characters on Windows

I am using Rstudio on a Windows machine and try to do some string matching with Chinese characters. I am not familiar with encoding settings on windows so I check some tutorials and make sure that the result of Sys.getlocale() should be the right one.
When executing str_detect() within dataframe, the matching fails. But it works on a vector level. Also, df_edu_village %>% filter(str_detect(village, "糖")) show a different result with df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape())).
Below I try to reproduce the result but reprex() isn't working, and the problem may be due to copy-paste so I knit the result into HTML with Rmd myself. Thanks for any help.
# devtools::install_github("ntupsc/pscdata")
library(pscdata)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df_edu_village <- pscdata::edu_village_original %>% as_tibble() %>% distinct(village)
df_edu_village %>% filter(str_detect(village, "糖"))
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `str_detect(village, "糖")`.
## x Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`聶}`)
df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))
## # A tibble: 3 x 1
## village
## <chr>
## 1 糖<U+5ECD>里
## 2 糖友里
## 3 大糖里
df_edu_village %>% filter(str_detect(village, "里"))
## # A tibble: 0 x 1
## # ... with 1 variable: village <chr>
str_detect(df_edu_village$village[1], "里")
## [1] FALSE
grepl(df_edu_village$village[1], "里")
## [1] FALSE
str_detect("留侯里", "里")
## [1] TRUE
Sys.getlocale()
## [1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950
## system code page: 1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5
## [9] tidyverse_1.3.1 pscdata_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.26 haven_2.4.3 colorspace_2.0-2
## [5] vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2 yaml_2.2.1
## [9] utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4 pillar_1.6.2
## [13] withr_2.4.2 glue_1.4.2 DBI_1.1.1 dbplyr_2.1.1
## [17] modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1 evaluate_0.14
## [25] knitr_1.34 tzdb_0.1.2 fastmap_1.1.0 curl_4.3.2
## [29] fansi_0.5.0 broom_0.7.9 Rcpp_1.0.7 backports_1.2.1
## [33] scales_1.1.1 jsonlite_1.7.2 fs_1.5.0 hms_1.1.0
## [37] digest_0.6.27 stringi_1.7.4 grid_4.1.1 cli_3.0.1
## [41] tools_4.1.1 magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3
## [45] ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10
## [49] rstudioapi_0.13 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
## [53] R6_2.5.1 compiler_4.1.1

How to make a column exist in r?

I have a very large dataset where I am looking to take a column of identifiers (CP) first edit how the identifiers look to match another file, and then search if there are ```CP`` matches between the files.
I do the editing of the CP first with:
fullGWAS <- fread('file.csv',sep=",")
colnames(fullGWAS)[1] <- "CP"
fullGWAS2<-gsub("_.*","",fullGWAS$CP)
fullGWAS2 <-data.frame(fullGWAS2)
colnames(fullGWAS2)[1] <- "CP"
fullGWAS3 <- select(fullGWAS, c(2:15))
gwasdf <- cbind(fullGWAS2, fullGWAS3)
As an example gwasdf looks like:
> head(gwasdf)
CP chr bpos a1 a2 freq BETAsbp Psbp BETAdbp Pdbp BETApp Ppp minP
1 1:2556125 1 2556125 t c 0.3255 -0.0262 0.41300 -0.0113 0.5388 -0.0157 0.4690 0.41300
2 1:2556548 1 2556548 t c 0.3261 -0.0274 0.39270 -0.0121 0.5096 -0.0160 0.4615 0.39270
3 1:2556709 1 2556709 a g 0.3257 -0.0263 0.41210 -0.0116 0.5266 -0.0155 0.4749 0.41210
4 12:11366987 12 11366987 t c 0.9443 0.0355 0.61460 0.0019 0.9631 0.0185 0.7007 0.61460
5 17:21949792 17 21949792 a c 0.4570 -0.0384 0.20690 -0.0043 0.8065 -0.0212 0.3050 0.20690
6 17:21955349 17 21955349 t g 0.5253 0.0505 0.09562 0.0103 0.5574 0.0248 0.2303 0.09562
minTRAIT BETAmean
1 SBP -0.01875
2 SBP -0.01975
3 SBP -0.01895
4 SBP 0.01870
5 SBP -0.02135
6 SBP 0.03040
I can see CP is here yet when I try to check this I get:
exists("gwasdf$CP")
[1] FALSE
class(gwasdf)
[1] "data.frame"
nrow(gwasdf)
[1] 7083535
Why is this false and how can I make it be true?
I am trying to ultimately check whether the CP identifiers are present in another file with follow-up code using:
CPmatches <- df2[CP %in% gwasdf$CP] #df2 is another file I just read in
mismatchextract <- subset(gwasdf, !(CP %in% df2$CP))
For extra info I use RStudio with:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.57.0 sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7
[5] proto_1.0.0 data.table_1.13.2 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[13] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2 colorspace_1.4-1 vctrs_0.3.4
[6] generics_0.1.0 chron_2.3-56 blob_1.2.1 rlang_0.4.8 pillar_1.4.7
[11] glue_1.4.1 withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_2.0.0
[16] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0 fansi_0.4.1 broom_0.7.2
[26] Rcpp_1.0.5 scales_1.1.1 backports_1.1.10 jsonlite_1.7.1 fs_1.5.0
[31] bit_4.0.4 hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[36] cli_2.2.0 tools_4.0.2 magrittr_2.0.1 crayon_1.3.4 pkgconfig_2.0.3
[41] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
[46] httr_1.4.2 rstudioapi_0.13 R6_2.5.0 compiler_4.0.2
Something like this using dplyr and the %in% operator? Assuming there are two separate datasets and a goal of subsetting based on whether an element in one dataset belongs to a separate dataset.
qwasdf_1 <- data.frame(
CP1 = c("1:2556125", "1:2556548", "99:12345678")
)
qwasdf_2 <- data.frame(
CP2 = c("1:2556125", "1:2556548", "1:2556709")
)
library(dplyr)
qwasdf_1 %>%
filter(CP1 %in% qwasdf_2$CP2)
#> CP1
#> 1 1:2556125
#> 2 1:2556548
Created on 2020-11-23 by the reprex package (v0.3.0)

Resources