I'm working with some data that has strings like these:
1) C: 0.664 (3327)T: 0.336 (1681)
2) C|C: 0.462 (1158)C|T: 0.404 (1011)T|T: 0.134 (335)
I'm interested in extracting just the letters and the numbers within the parenthesis to get data frames like these:
1)
L1 N1 L2 N2
C 3327 T 1681
2)
L1 N1 L2 N2 L3 N3
CC 1158 CT 1011 TT 335
Is there any function/package or efficient way to do this in R?
We could also use stri_extract_all from library(stringi) after removing the | with gsub. We use lookahead ((?=:)) and match one or more characters that are not ) or we match one or more character that are not ) ([^)]+) followed by the lookbehind ((?<=\\()).
library(stringi)
stri_extract_all_regex(gsub('\\|', '', x), '[^)]+(?=:)|(?<=\\()[^)]+')
#[[1]]
#[1] "C" "3327" "T" "1681"
#[[2]]
#[1] "CC" "1158" "CT" "1011" "TT" "335"
We could also use two gsub and then convert the output to a data.frame. The class of the numeric and character elements are differentiated using this method.
res <- read.table(text=gsub('\\:[^(]+|[()]', ' ',
gsub('[|]', '', x)),
sep='', header=FALSE, stringsAsFactors=FALSE, na.strings='', fill=TRUE)
# V1 V2 V3 V4 V5 V6
#1 C 3327 T 1681 <NA> NA
#2 CC 1158 CT 1011 TT 335
str(res)
#'data.frame': 2 obs. of 6 variables:
# $ V1: chr "C" "CC"
# $ V2: int 3327 1158
# $ V3: chr "T" "CT"
# $ V4: int 1681 1011
# $ V5: chr NA "TT"
# $ V6: int NA 335
NOTE: We can change the column names using ?colnames
Example
x = c(
"C: 0.664 (3327)T: 0.336 (1681)",
"C|C: 0.462 (1158)C|T: 0.404 (1011)T|T: 0.134 (335)"
)
Select parts
s = strsplit(x, "\\)|(:.*?\\()")
# [[1]]
# [1] "C" "3327" "T" "1681"
#
# [[2]]
# [1] "C|C" "1158" "C|T" "1011" "T|T" "335"
The regex matches two things: \\) or :.*?\\(. In the second:
. matches any character
* quantifies the match as "any character any number of times"
? tells the quantifier to be "non-greedy" so it stops at \\(, even though that also matches ..
From there, it's pretty straightforward to perform your remaining formatting tasks:
Map(function(r, n)
setNames( gsub("\\|", "", r), paste0(c("L","N"), rep(seq(n), each=2)) ),
s,
lengths(s)/2
)
# [[1]]
# L1 N1 L2 N2
# "C" "3327" "T" "1681"
#
# [[2]]
# L1 N1 L2 N2 L3 N3
# "CC" "1158" "CT" "1011" "TT" "335"
Related
Something incredibly weird is happening in my file. Some values are disappearing but I have the same amount of row. Somes values are still in the matrix, so I don't understand.
# Data : meps >> https://github.com/JMcrocs/MEPVote/raw/master/meps.rds
> str(meps)
'data.frame': 784 obs. of 2338 variables:
$ mepid: num 197701 197533 197521 187917 124986 ...
$ EPG : chr "GUE.NGL" "GUE.NGL" "GUE.NGL" "GUE.NGL" ...
> mepsMatrix <- as.matrix(meps)
> str(mepsMatrix)
chr [1:784, 1:2338] "197701" "197533" "197521" "187917" "124986" "197529" "197468" " 96706" " 88715" "197416" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:784] "197701" "197533" "197521" "187917" ...
..$ : chr [1:2338] "mepid" "EPG" "1" "2" ...
> nrow(meps)
[1] 784
> nrow(mepsMatrix)
[1] 784
> 28229 %in% meps[,'mepid']
[1] TRUE
> 28229 %in% mepsMatrix[,'mepid']
[1] FALSE
The weirdest part is that I can find it with the RStudio viewer.
Can someone help me, please? I would be grateful!
Look at this minimal example:
df <- data.frame(a = c(2, 20), b = c("a", "b"))
m <- as.matrix(df)
2 %in% df[, "a"]
#> TRUE
2 %in% m[, "a"]
#> FALSE
" 2" %in% m[, "a"]
#> TRUE
2 %in% trimws(m[, "a"])
#> TRUE
m
#> a b
#> [1,] " 2" "a"
#> [2,] "20" "b"
Apparently as.matrix forces the same padding to keep the same string lengths when converts to character.
i am trying to figure out how to loop through a named vector of regression coefficients. i want to loop through the vector and detect whether or not a coefficient name contains the string 'country'. if it does, i want to append the corresponding value to an empty vector. i already solved this using dplyr tools, but i also want to do it using a for loop.
this is what my data looks like:
str(co2_per_cap_model$coefficients)
Named num [1:164] -0.0511 0.3289 1.2352 3.0743 0.8654 ...
- attr(*, "names")= chr [1:164] "(Intercept)" "time" "countryAlbania" "countryAlgeria" ...
this is the loop i've been tinkering with. any advice? thank you in advance.
storage <- c()
for(coeff in co2_per_cap_model$coefficients){
if(str_detect(names(co2_per_cap_model$coefficients), 'country')){
storage <- c(coeff, storage)
}
}
We need to create some reproducible data. Then just use grep:
set.seed(42)
coef <- 1:25
names(coef) <- sample(LETTERS[1:5], 25, replace=TRUE)
str(coef)
# Named int [1:25] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "names")= chr [1:25] "A" "E" "A" "A" ...
idx <- grep("A", names(coef))
coef[idx]
# A A A A A A A
# 1 3 4 9 11 17 18
From a range of numbers from 001 to 999, I would like to be able to formulate a function where from 001 to 199, the combinations of numbers will be listed in up to 6 different ways. Example 192 as 192, 129, 291, 219, 912, 921. The listing should obviously begin with 001 which will show as: 001, 010, 100.
I'm not sure what format you want the results in.
As commented, these are permutations: combinat::permn is probably the most convenient way to achieve this.
Format a number with zero-padding ("%03d"), split into characters (strsplit(.,"")):
f0 <- function(x) strsplit(sprintf("%03d",x),"")[[1]]
Create all permutations, squash them back into strings (paste/collapse), and select the unique values (e.g. 000 has only one unique value)
f1 <- function(x) unique(sapply(combinat::permn(f0(x)),paste,collapse=""))
Apply to each of the integers
result <- lapply(0:999,f1)
head(result)
[[1]]
[1] "000"
[[2]]
[1] "001" "010" "100"
[[3]]
[1] "002" "020" "200"
[[4]]
[1] "003" "030" "300"
[[5]]
[1] "004" "040" "400"
[[6]]
[1] "005" "050" "500"
Later values do indeed have up to six entries.
You could make vectors of indices with tidyr::crossing or expand.grid:
library(tidyverse)
indices <- crossing(x = 1:3, y = 1:3, z = 1:3) %>%
filter(x != y, x != z, y != z) %>%
pmap(~unname(c(...)))
indices %>% str
#> List of 6
#> $ : int [1:3] 1 2 3
#> $ : int [1:3] 1 3 2
#> $ : int [1:3] 2 1 3
#> $ : int [1:3] 2 3 1
#> $ : int [1:3] 3 1 2
#> $ : int [1:3] 3 2 1
...which you can then use to subset each input vector as you iterate across them:
perms <- pmap(crossing(x = 0:9, y = 0:9, z = 0:9), function(...){
map_chr(indices, function(x) paste(c(...)[x], collapse = "")) %>%
unique()
})
perms[500:510] %>% str(vec.len = 6)
#> List of 11
#> $ : chr [1:3] "499" "949" "994"
#> $ : chr [1:3] "500" "050" "005"
#> $ : chr [1:6] "501" "510" "051" "015" "150" "105"
#> $ : chr [1:6] "502" "520" "052" "025" "250" "205"
#> $ : chr [1:6] "503" "530" "053" "035" "350" "305"
#> $ : chr [1:6] "504" "540" "054" "045" "450" "405"
#> $ : chr [1:3] "505" "550" "055"
#> $ : chr [1:6] "506" "560" "056" "065" "650" "605"
#> $ : chr [1:6] "507" "570" "057" "075" "750" "705"
#> $ : chr [1:6] "508" "580" "058" "085" "850" "805"
#> $ : chr [1:6] "509" "590" "059" "095" "950" "905"
This ultimately is still a lot of iteration, so while it works fast enough for 6000 iterations, a vectorized approach would scale better.
Here is a solution that gives the desired output with no duplication and no additional calls to clean up duplicate results. We take advantage of std::next_permutation from the the algorithm library in C++, which takes a vector as input and generates lexicographical permutations until the first permutation is reached. This means, we only generate 3 permutations for 001, 1 permutation for 999, and 6 permutation for 123.
We start by generating all combinations of as.character(0:9) of length 3 with repetition by utilizing gtools::combinations.
## install.packages("gtools")
myCombs <- gtools::combinations(10, 3, as.character(0:9), repeats.allowed = TRUE)
nrow(myCombs)
[1] 220
Here is an Rcpp version that exposes std::next_permutation to R:
## install.packages("Rcpp")
Rcpp::cppFunction(
"CharacterVector permuteDigits(CharacterVector v) {
std::string myStr;
std::vector<std::string> result;
for (std::size_t i = 0; i < v.size(); ++i)
myStr += v[i];
do {
result.push_back(myStr);
} while(std::next_permutation(myStr.begin(), myStr.end()));
return wrap(result);
}"
)
And finally, we bring it altogether with lapply:
permutedCombs <- lapply(1:nrow(myCombs), function(x) {
permuteDigits(myCombs[x, ])
})
Here is some sample output:
permutedCombs[1:5]
[[1]]
[1] "000"
[[2]]
[1] "001" "010" "100"
[[3]]
[1] "002" "020" "200"
[[4]]
[1] "003" "030" "300"
[[5]]
[1] "004" "040" "400"
permutedCombs[151:155]
[[1]]
[1] "356" "365" "536" "563" "635" "653"
[[2]]
[1] "357" "375" "537" "573" "735" "753"
[[3]]
[1] "358" "385" "538" "583" "835" "853"
[[4]]
[1] "359" "395" "539" "593" "935" "953"
[[5]]
[1] "366" "636" "663"
And here is proof that we have all 1000 results with no duplications:
sum(lengths(permutedCombs))
[1] 1000
identical(sort(as.integer(do.call(c, permutedCombs))), 0:999)
[1] TRUE
names(score)
[1] "(Intercept)" "aado2_calc(20,180]" "aado2_calc(360,460]"
[4] "aado2_calc(460,629]" "albumin[1,1.8]" "albumin(1.8,2.2]"
[7] "albumin(2.2,2.8]" "aniongap(15,18]" "aniongap(18,20]"
[10] "aniongap(20,22]" "aniongap(22,25]" "aniongap(25,49]"
I want to extract the two numbers within parenthesis (numbers outside the parenthesis are not needed) and there are "(" or "[". the first number will be assigned to an object "low" and the second to "high".
You can use the readr package and the function parse_number for ease of use. For more power you'd want to use something like the base regular expression functions in r, or a package like stringi
Just like #jake-kaupp said - use stringi :) As you can see, stringi solution is shorter, easier to understand and much faster - up to 30 times!
Short answer:
arr <- stri_extract_all_regex(x, "(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", simplify = NA)
data.frame(low = as.numeric(arr[,1]), high = as.numeric(arr[,2]))
Long answer:
require(stringi)
require(microbenchmark)
grepFun <- function(x){
mat <- regmatches(x,
gregexpr("(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", x, perl = TRUE))
newnames <- lapply(mat, function(m) {
if (! length(m)) return(list(low = NA, high = NA))
setNames(as.list(as.numeric(m)), nm = c("low", "high"))
})
do.call(rbind.data.frame, newnames)
}
striFun <- function(x){
arr <- stri_extract_all_regex(x, "(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", simplify = NA)
data.frame(low = as.numeric(arr[,1]), high = as.numeric(arr[,2]))
}
# both functions work the same
grepFun(scorenames)
low high
1 NA NA
2 20.0 180.0
3 360.0 460.0
4 460.0 629.0
...
12 25.0 49.0
striFun(scorenames)
low high
1 NA NA
2 20.0 180.0
3 360.0 460.0
4 460.0 629.0
...
12 25.0 49.0
# generating more complicated vector
n <- 10000
x <- stri_paste(stri_rand_strings(n, length = 1:10), sample(c("(","["),n,TRUE),
sample(1000,n,TRUE), ",", sample(1000,n,TRUE), sample(c(")","]"), n, TRUE))
head(x) # check first elements
[1] "O[68,434]" "Ql[783,151)" "Zk0(773,60)" "ETfV(446,518]" "Xixbr(576,855)" "G6QnHu(92,955)"
#short test using new data
grepFun(x[1:6])
low high
1 68 434
2 783 151
3 773 60
4 446 518
5 576 855
6 92 955
striFun(x[1:6])
low high
1 68 434
2 783 151
3 773 60
4 446 518
5 576 855
6 92 955
#and some benchmark to prove performance
microbenchmark(grepFun(x), striFun(x))
Unit: milliseconds
expr min lq mean median uq max neval
grepFun(x) 330.27733 366.09306 416.56330 406.08914 465.29829 568.15250 100
striFun(x) 11.57449 11.97825 13.38157 12.46927 13.67699 25.97455 100
scorenames <- c(
"(Intercept)" ,"aado2_calc(20,180]" ,"aado2_calc(360,460]"
,"aado2_calc(460,629]" ,"albumin[1,1.8]" ,"albumin(1.8,2.2]"
,"albumin(2.2,2.8]" ,"aniongap(15,18]" ,"aniongap(18,20]"
,"aniongap(20,22]" ,"aniongap(22,25]" ,"aniongap(25,49]"
)
The first step might be to extract everything within the "parens"-delimiters (to include (), [], and the comma ,).
mat <- regmatches(scorenames,
gregexpr("(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", scorenames, perl = TRUE))
str(mat)
# List of 12
# $ : chr(0)
# $ : chr [1:2] "20" "180"
# $ : chr [1:2] "360" "460"
# $ : chr [1:2] "460" "629"
# $ : chr [1:2] "1" "1.8"
# $ : chr [1:2] "1.8" "2.2"
# $ : chr [1:2] "2.2" "2.8"
# $ : chr [1:2] "15" "18"
# $ : chr [1:2] "18" "20"
# $ : chr [1:2] "20" "22"
# $ : chr [1:2] "22" "25"
# $ : chr [1:2] "25" "49"
From here, we can see that (1) the first one is problematic (no surprise, you need to figure out what you want here), and (2) the rest look about right.
Here's one rough way to process this list. This is very trusting and naïve ... you should probably add checks to ensure the list is of length 2, that everything converts correctly (perhaps in a tryCatch), etc.
newnames <- lapply(mat, function(m) {
if (! length(m)) return(list(low = NA, high = NA))
setNames(as.list(as.numeric(m)), nm = c("low", "high"))
})
str(newnames)
# List of 12
# $ :List of 2
# ..$ low : logi NA
# ..$ high: logi NA
# $ :List of 2
# ..$ low : num 20
# ..$ high: num 180
# $ :List of 2
# ..$ low : num 360
# ..$ high: num 460
# ...snip...
You can turn this into a data.frame with:
head(do.call(rbind.data.frame, newnames))
# low high
# 1 NA NA
# 2 20.0 180.0
# 3 360.0 460.0
# 4 460.0 629.0
# 5 1.0 1.8
# 6 1.8 2.2
I have a problem regarding data conversion using R language.
I have two data that being stored in variables named lung.X and lung.y, below are the description of my data.
> str(lung.X)
chr [1:86, 1:7129] " 170.0" " 104.0" " 53.7" " 119.0" " 105.5" " 130.0" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:86] "V3" "V4" "V5" "V6" ...
..$ : chr [1:7129] "A28102_at" "AB000114_at" "AB000115_at" "AB000220_at" ...
and
> str(lung.y)
num [1:86] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
lung.X is a matrix (row: 86 col: 7129) and lung.y is an array of numbers (86 entries)
Do anyone know how to convert above data into the format below?
> str(lung.X)
num [1:86, 1:7129] 170 104 53.7 119 105.5 130...
I thought I should do like this
lung.X <- as.numeric(lung.X)
but I got this instead
> str(lung.X)
num [1:613094] 170 104 53.7 119 105.5 130...
The reason of doing this is because I need lung.X to be numerical only.
Thank you.
You could change the mode of your matrix to numeric:
## example data
m <- matrix(as.character(1:10), nrow=2,
dimnames = list(c("R1", "R2"), LETTERS[1:5]))
m
# A B C D E
# R1 "1" "3" "5" "7" "9"
# R2 "2" "4" "6" "8" "10"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
mode(m) <- "numeric"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
m
# A B C D E
# R1 1 3 5 7 9
# R2 2 4 6 8 10
Give this a try: m <- matrix(as.numeric(lung.X), nrow = 86, ncol = 7129)
If you need it in dataframe/list format, df <- data.frame(m)