Maintaining order of extracted patterns from strings in R - r

I'm trying to extract a pattern from a string, but am having difficulty maintaining the order. Consider:
library(stringr)
string <- "A A A A C A B A"
extract <- c("B","C")
str_extract_all(string,extract)
[[1]]
[1] "B"
[[2]]
[1] "C"
The output of this is a list; is it possible to return a vector that maintains the original ordering, ie that "C"precedes "B" in the string? I've tried many options of gsub with no luck. Thanks.

Try using the following regexp:
str_extract_all(string,"[BC]")
## [[1]]
## [1] "C" "B"
or more generally:
str_extract_all(string, paste(extract, collapse = "|"))

string <- "A A A A C A B A B"
extract <- c("B","C")
inds = unlist(sapply(extract, function(p){
as.numeric(gregexpr(p, string)[[1]])
}))
sort(inds[inds > 0])
# C B1 B2
# 9 13 17

Related

Convert dataframe into list in R

I would like to convert a data frame into a list.
See input in Table 1.
See the output in Table 2.
When you open a list in R from the environment.
Name - the following names clus1, clus2...
Type - should contain values from column V1
Value - list of length 3
Table 1
V1 V2 V3
clus1 10 a d
clus2 20 b e
clus3 5 c f
Table 2
$`clus1`
[1] "a" "d"
$`clus2`
[2] "b" "e"
$`clus3`
[2] "c" "f"
t1 = read.table(text = " V1 V2 V3
clus1 10 a d
clus2 20 b e
clus3 5 c ''", header = T)
result = split(t1[, 2:3], f = row.names(t1))
result = lapply(result, function(x) {
x = as.character(unname(unlist(x)))
x[x != '']})
result
# $clus1
# [1] "a" "d"
#
# $clus2
# [1] "b" "e"
#
# $clus3
# [1] "c"
In this particular case, we can go a bit more directly if we convert to matrix first:
r2 = split(as.matrix(t1[, 2:3]), f = row.names(t1))
r2 = lapply(r2, function(x) x[x != ''])
# same result
You might think of this as a reshaping task in order to scale it for multiple columns, i.e. create a column of values rather than tracking throughout that you're working with columns V2 and V3. That way, you can do it in one pass with some basic tidyverse functions. This also lets you easily filter the data before making the list, based on removing blanks or any other condition, again without specifying columns.
library(dplyr)
# thanks #Gregor for filling in the data
tibble::rownames_to_column(t1, var = "clust") %>%
select(-V1) %>%
tidyr::gather(key, value, -clust) %>%
filter(value != "") %>%
split(.$clust) %>%
purrr::map("value")
#> $clus1
#> [1] "a" "d"
#>
#> $clus2
#> [1] "b" "e"
#>
#> $clus3
#> [1] "c"

how to break a vector into subvectors in R

I have a vector like:
A B C A B A B D D E
and I'd like to break it into as many vectors as the number of "A" I have, like:
A B C
A B
A B D D E
is there a way to accomplish this task?
You can use split and cumsum:
split(x, cumsum(x == "A"))
What you get in return is a list of vectors. A list seems most useful to me here since it allows vectors of different sizes in each element (unlike a data.frame for instance).
Not as elegant as split approach but we can go also for strsplit:
strsplit(paste0("A", strsplit(paste0(vec, collapse = ""), "A")[[1]][-1]),"")
# [[1]]
# [1] "A" "B" "C"
# [[2]]
# [1] "A" "B"
# [[3]]
# [1] "A" "B" "D" "D" "E"

How to get a named list element in R if the appearance of the element is conditional?

I want to include a list element c in a list L in R and name it C.
The example is as follows:
a=c(1,2,3)
b=c("a","b","c")
c=rnorm(3)
L<-list(A=a,
B=b,
C=c)
print(L)
## $A
## [1] 1 2 3
##
## $B
## [1] "a" "b" "c"
##
## $C
## [1] -2.2398424 0.9561929 -0.6172520
Now I want to introduce a condition on C, so it is only included in
the list if C.bool==T:
C.bool<-T
L<-list(A=a,
B=b,
if(C.bool) C=c)
print(L)
## $A
## [1] 1 2 3
##
## $B
## [1] "a" "b" "c"
##
## [[3]]
## [1] -2.2398424 0.9561929 -0.6172520
Now, however, the list element of c is not being named as specified in
the list statement. What's the trick here?
Edit: The intention is to only include the element in the list if the condition is met (no NULL shoul be included otherwise). Can this be done within the core definition of the list?
I don't know why you want to do it "without adding C outside the core definition of the list?" but if you're content with two lists in a single c then:
L <- c(list(A=a, B=b), if(C.bool) list(C=c))
If you really want one list but don't mind subsetting after creation then
L <- list(A=a, B=b, C=if(C.bool) c)[c(TRUE, TRUE, C.bool)]
(pace David Arenburg, isTRUE() omitted for brevity)
you can try this if you want to keep the names
L2 <-list(A=a,
B=b,
C = if (TRUE) c)
You can of course replace TRUE with the statement containing C.bool
You could place the if statement outside the core definition of the list, like this:
L <- list(A = a, B= b)
if (isTRUE(C.bool)) L$C <- c
#> L
#$A
#[1] 1 2 3
#
#$B
#[1] "a" "b" "c"
#
#$C
#[1] -0.7631459 0.7353929 -0.2085646
(Edit with isTRUE() owing to the comment by #DavidArenburg)
As a combination of the previous answers by #MamounBenghezal, #user20637
and the comment made by #DavidArenburg, I would suggest this generalized
version that does not depend on the length of the list:
L <- Filter(Negate(is.null),
x = list(A = a, B = b, C = if (isTRUE(C.bool)) c, D = "foo"))

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

How to analyze the data whose different rows have different number of elements using R?

The data format is as following, the first column is the id:
1, b, c
2, a, d, e, f
3, u, i, c
4, k, m
5, o
However, i can do nothing to analyze this data. Do you have a good idea of how to read the data into R? Further, My question is: How to analyze the data whose different rows have different number of elements using R?
It seems you are trying to read a file with elements of unequal length. The structure in R that is list.
It is possible to do this by combining read.table with sep="\n" and then to apply strsplit on each row of data.
Here is an example:
dat <- "
1 A B
2 C D E
3 F G H I J
4 K L
5 M"
The code to read and convert to a list:
x <- read.table(textConnection(dat), sep="\n")
apply(x, 1, function(i)strsplit(i, "\\s")[[1]])
The results:
[[1]]
[1] "1" "A" "B"
[[2]]
[1] "2" "C" "D" "E"
[[3]]
[1] "3" "F" "G" "H" "I" "J"
[[4]]
[1] "4" "K" "L"
[[5]]
[1] "5" "M"
You can now use any list manipulation technique to work with your data.
using the readLines and strsplit to solve this problem.
text <- readLines("./xx.txt",encoding='UTF-8', n = -1L)
txt = unlist(strsplit(text, sep = " "))

Resources