Count repetitions of a set of characters

Count repetitions of a set of characters - r

How can I count repetitions of a set of characters in a vector? Imagine the following vector consisting of "A" and "B":
x <- c("A", "A", "A", "B", "B", "A", "A", "B", "A")
In this example, the first set would be the sequence of "A" and "B" from index 1 to 5, the second set is the sequence of "A" and "B" from index 6 to 8, and then the third set is the last single "A":
x <- c("A", "A", "A", "B", "B", # set 1
"A", "A", "B", # set 2
"A") # set 3
How can set a counter for each set of variables? I need a vector like this:
c(1, 1, 1, 1, 1, 2, 2, 2, 3)
thanks

Use rle:
x <- c("A", "A", "A", "B", "B", "A", "A", "B", "A")
tmp <- rle(x)
#Run Length Encoding
# lengths: int [1:5] 3 2 2 1 1
# values : chr [1:5] "A" "B" "A" "B" "A"
Now change the values:
tmp$values <- ave(rep(1L, length(tmp$values)), tmp$values, FUN = cumsum)
and inverse the run length encoding:
y <- inverse.rle(tmp)
#[1] 1 1 1 1 1 2 2 2 3

Alternative 1.
cumsum(c(TRUE, diff(match(x, c("A", "B"))) == -1))
# [1] 1 1 1 1 1 2 2 2 3
Step by step:
match(x, c("A", "B"))
# [1] 1 1 1 2 2 1 1 2 1
diff(match(x, c("A", "B")))
# [1] 0 0 1 0 -1 0 1 -1
diff(match(x, c("A", "B"))) == -1
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
c(TRUE, diff(match(x, c("A", "B"))) == -1)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Alternative 2.
Using data.table::rleid:
library(data.table)
cumsum(c(TRUE, diff(rleid(x) %% 2) == 1))
# [1] 1 1 1 1 1 2 2 2 3
Step by step:
rleid(x)
# [1] 1 1 1 2 2 3 3 4 5
rleid(x) %% 2
# [1] 1 1 1 0 0 1 1 0 1
diff(rleid(x) %% 2)
# [1] 0 0 -1 0 1 0 -1 1
diff(rleid(x) %% 2) == 1
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
c(TRUE, diff(rleid(x) %% 2) == 1)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE

We can use only base R methods
x1 <- split(x, cumsum(c(TRUE, x[-1]!= x[-length(x)])))
x2 <- sapply(x1, `[`, 1)
as.numeric(rep(ave(x2, x2, FUN = seq_along), lengths(x1)))
#[1] 1 1 1 1 1 2 2 2 3

Related

Match vectors in sequence

I have 2 vectors.
x=c("a", "b", "c", "d", "a", "b", "c")
y=structure(c(1, 2, 3, 4, 5, 6, 7, 8), .Names = c("a", "e", "b",
"c", "d", "a", "b", "c"))
I would like to match a to a, b to b in sequence accordingly, so that x[2] matches y[3] rather than y[7]; and x[5] matches y[6] rather than y[1], so on and so forth.
lapply(x, function(z) grep(z, names(y), fixed=T))
gives:
[[1]]
[1] 1 6
[[2]]
[1] 3 7
[[3]]
[1] 4 8
[[4]]
[1] 5
[[5]]
[1] 1 6
[[6]]
[1] 3 7
[[7]]
[1] 4 8
which matches all instances. How do I get this sequence:
1 3 4 5 6 7 8
So that elements in x can be mapped to the corresponding values in y accordingly?

You are actually looking for pmatch
pmatch(x,names(y))
[1] 1 3 4 5 6 7 8

You can change the names attributes according to the number of times each element appeared and then subset y:
x2 <- paste0(x, ave(x, x, FUN=seq_along))
#[1] "a1" "b1" "c1" "d1" "a2" "b2" "c2"
names(y) <- paste0(names(y), ave(names(y), names(y), FUN=seq_along))
y[x2]
#a1 b1 c1 d1 a2 b2 c2
# 1 3 4 5 6 7 8

Another option using Reduce
Reduce(function(v, k) y[-seq_len(v)][k],
x=x[-1L],
init=y[x[1L]],
accumulate=TRUE)

Well, I did it with a for-loop
#Initialise the vector with length same as x.
answer <- numeric(length(x))
for (i in seq_along(x)) {
#match the ith element of x with that of names in y.
answer[i] <- match(x[i], names(y))
#Replace the name of the matched element to empty string so next time you
#encounter it you get the next index.
names(y)[i] <- ""
}
answer
#[1] 1 3 4 5 6 7 8

Another possibility:
l <- lapply(x, grep, x = names(y), fixed = TRUE)
i <- as.integer(ave(x, x, FUN = seq_along))
mapply(`[`, l, i)
which gives:
[1] 1 3 4 5 6 7 8

Similar solution to Ronak, but it does not persist changes to y
yFoo<-names(y)
sapply(x,function(u){res<-match(u,yFoo);yFoo[res]<<-"foo";return(res)})
Result
#a b c d a b c
#1 3 4 5 6 7 8

Replace characters in a column with numbers R

I have a matrix with last column contains characters:
A
B
B
A
...
I would like to replace A with 1 and B with 2 in R. The expected result should be:
1
2
2
1
...

If you are 100% confident only "A" and "B" appear
sample_data = c("A", "B", "B", "A")
sample_data
# [1] "A" "B" "B" "A"
as.numeric(gsub("A", 1, gsub("B", 2, sample_data)))
# [1] 1 2 2 1

Using factor or a simple lookup table would be much more flexible:
sample_data = c("A", "B", "B", "A")
Recommended:
as.numeric(factor(sample_data))
# [1] 1 2 2 1
Possible alternative:
as.numeric(c("A" = "1", "B" = "2")[sample_data])
# [1] 1 2 2 1

Replacing Values in R - Error Received

So I have a data frame (called gen) filled with nucleotide information: each value is either A, C, G, or T. I am looking to replace A with 1, C with 2, G with 3, and T with 4. When I use the function gen[gen==A] = 1, I get the error:
Error in [<-.data.frame(*tmp*, gen == A, value = 1) :
object 'A' not found
I even tried using gen <- replace(gen, gen == A, 1), but it gives me the same error. Does anyone know how to fix this error? If not, is there a package that I can install in R with a program that will convert A, C, G, and T to numeric values?
Thanks

You need to wrap A in quotes or else R looks for a variable named A.
If the columns are character vectors:
R> gen = data.frame(x = sample(c("A", "C", "G", "T"), 10, replace = TRUE), y = sample(c("A", "C", "G", "T"), 10, replace= TRUE), stringsAsFactors = FALSE)
R> gen[gen == "A"] = 1
R> gen
x y
1 1 1
2 C C
3 G T
4 T T
5 G G
6 G G
7 1 1
8 C C
9 T 1
10 1 1
also 1 way to do all at once
R> library(car)
R> sapply(gen, recode, recodes = "'A'=1; 'C'=2; 'G'=3; 'T'=4")
x y
[1,] 1 1
[2,] 2 2
[3,] 3 4
[4,] 4 4
[5,] 3 3
[6,] 3 3
[7,] 1 1
[8,] 2 2
[9,] 4 1
[10,] 1 1
If the columns are factors
R> gen = data.frame(x = sample(c("A", "C", "G", "T"), 10, replace = TRUE), y = sample(c("A", "C", "G", "T"), 10, replace= TRUE))
R> sapply(gen, as.numeric)
x y
[1,] 1 1
[2,] 2 4
[3,] 1 2
[4,] 4 1
[5,] 2 2
[6,] 1 4
[7,] 4 3
[8,] 3 3
[9,] 2 4
[10,] 4 2

R Equality while ignoring NAs

Is there an equivalent of == but with the result that x != NA if x is not NA?
The following does what I want, but it's clunky:
mapply(identical, vec1, vec2)

Just replace "==" with %in%.
Example:
> df <- data.frame(col1= c("a", "b", NA), col2= 1:3)
> df
col1 col2
1 a 1
2 b 2
3 <NA> 3
> df[df$col1=="a", ]
col1 col2
1 a 1
NA <NA> NA
> df[df$col1%in%"a", ]
col1 col2
1 a 1
> "x"==NA
[1] NA
> "x"%in%NA
[1] FALSE

1 == NA returns a logical NA rather than TRUE or FALSE. If you want to call NA FALSE, you could add a second conditional:
set.seed(1)
x <- 1:10
x[4] <- NA
y <- sample(1:10, 10)
x <= y
# [1] TRUE TRUE TRUE NA FALSE TRUE TRUE FALSE TRUE FALSE
x <= y & !is.na(x)
# [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
You could also use a second processing step to convert all the NA values from your equality test to FALSE.
foo <- x <= y
foo[is.na(foo)] <- FALSE
foo
# [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
Also, for what its worth, NA == NA returns NA as does NA != NA.

The == operator is often used in combination with filtering data.frames.
In that situation, dplyr::filter will retain only rows where your condition evaluates to TRUE, unlike [. That effectively implements == but where 1 == NA evalutes as FALSE.
Example:
> df <- data.frame(col1= c("a", "b", NA), col2= 1:3)
> df
col1 col2
1 a 1
2 b 2
3 <NA> 3
> dplyr::filter(df, col1=="a")
col1 col2
1 a 1

Why not use base R:
df <- data.frame(col1 = c("a", "b", NA), col2 = 1:3, col3 = 11:13)
df
subset(x = df, subset = col1=="a", select = c(col1, col2))
# col1 col2
# 1 a 1
or with arrays:
df <- c("a", "b", NA)
subset(x = df, subset = df == "a")

Data frame of tables from a list

Suppose I have a list with observations:
foo <- list(c("C", "E", "A", "F"), c("B", "D", "B", "A", "C"), c("B",
"C", "C", "F", "A", "F"), c("D", "A", "A", "D", "D", "F", "B"
))
> foo
[[1]]
[1] "C" "E" "A" "F"
[[2]]
[1] "B" "D" "B" "A" "C"
[[3]]
[1] "B" "C" "C" "F" "A" "F"
[[4]]
[1] "D" "A" "A" "D" "D" "F" "B"
And a vector with each unique element:
vec <- LETTERS[1:6]
> vec
[1] "A" "B" "C" "D" "E" "F"
I want to obtain a data frame with the counts of each element of vec in each element of foo. I can do this with plyr in a very ugly unvectorized way:
> ldply(foo,function(x)sapply(vec,function(y)sum(y==x)))
A B C D E F
1 1 0 1 0 1 1
2 1 2 1 1 0 0
3 1 1 2 0 0 2
4 2 1 0 3 0 1
But that's obviously slow. How can this be done faster? I know of table() but haven't really figured out how to use it due to 0-counts in some of the elements of foo.

One solution (off the top of my head):
# convert foo to a list of factors
lfoo <- lapply(foo, factor, levels=LETTERS[1:6])
# apply table() to each list element
t(sapply(lfoo, table))
A B C D E F
[1,] 1 0 1 0 1 1
[2,] 1 2 1 1 0 0
[3,] 1 1 2 0 0 2
[4,] 2 1 0 3 0 1

or with reshape:
cast(melt(foo), L1 ~ value, length)[-1]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count repetitions of a set of characters - r

We can use only base R methods x1 <- split(x, cumsum(c(TRUE, x[-1]!= x[-length(x)]))) x2 <- sapply(x1, `[`, 1) as.numeric(rep(ave(x2, x2, FUN = seq_along), lengths(x1))) #[1] 1 1 1 1 1 2 2 2 3

Related

Match vectors in sequence

Replace characters in a column with numbers R

Replacing Values in R - Error Received

R Equality while ignoring NAs

Data frame of tables from a list

Categories

Resources