Using rle to eliminate first and last sequences - r

I am trying to solve a problem with R using rle() (or another relevant function) but am not sure where to start. The problem is as follows - foo, bar, and baz and qux can be in one of three positions - A, B, or C.
Their first position will always be A, and their last position will always be C, but their positions in between are random.
My objective is to eliminate the first A or first sequence of A's, and the last C or the last sequence of C's. For example:
> foo
position
1 A
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 C
10 C
> output(foo)
position
4 B
5 B
6 A
7 B
8 A
> bar
position
1 A
2 B
3 A
4 B
5 A
6 C
7 C
8 C
9 C
10 C
> output(bar)
position
2 B
3 A
4 B
5 A
> baz
position
1 A
2 A
3 A
4 A
5 A
6 C
7 C
8 C
9 C
10 C
> output(baz)
NULL
> qux
position
1 A
2 C
3 A
4 C
5 A
6 C
> output(qux)
position
2 C
3 A
4 C
5 A
Basic rle() will tell me about the sequences and their lengths but it will not preserve row indices. How should one go about solving this problem?
> rle(foo$position)
Run Length Encoding
lengths: int [1:6] 3 2 1 1 1 2
values : chr [1:6] "A" "B" "A" "B" "A" "C"

I would write a function using cumsum where we check how many of first consecutive values start with first_position and how many of last consecutive values start with last_position and remove them.
get_reduced_data <- function(dat, first_position, last_position) {
dat[cumsum(dat != first_position) != 0 &
rev(cumsum(rev(dat) != last_position) != 0)]
}
get_reduced_data(foo, first_position, last_position)
#[1] "B" "B" "A" "B" "A"
get_reduced_data(bar, first_position, last_position)
#[1] "B" "A" "B" "A"
get_reduced_data(baz, first_position, last_position)
#character(0)
get_reduced_data(qux, first_position, last_position)
#[1] "C" "A" "C" "A"
data
foo <- c("A", "A","A", "B", "B", "A", "B", "A", "C")
bar <- c("A", "B","A", "B", "A", "C", "C", "C", "C", "C")
baz <- c(rep("A", 5), rep("C", 5))
qux <- c("A", "C", "A", "C", "A", "C")
first_position <- "A"
last_position <- "C"

Here is one option with rle. The idea would be to subset the 1st and last values, check whether it is equal to 'A', 'C', assign it to NA and convert that to a logical vector for subsetting
i1 <- !is.na(inverse.rle(within.list(rle(foo$position),
values[c(1, length(values))][values[c(1, length(values))] == c("A", "C")] <- NA)))
foo[i1, , drop = FALSE]
# position
#4 B
#5 B
#6 A
#7 B
#8 A

A data.table approach could be,
library(data.table)
setDT(df)[, grp := rleid(position)][
!(grp == 1 & position == 'A' | grp == max(grp) & position == 'C'), ][
, grp := NULL][]
which gives,
position
1: B
2: B
3: A
4: B
5: A

Another possible solution without rle by creating an index and subsetting rows to between first occurrence of non-A and last occurrence of non-C:
library(data.table)
output <- function(DT) {
DT[, rn:=.I][,{
mn <- min(which(position!="A"))
mx <- max(which(position!="C"))
if (mn > mx) return(NULL)
.SD[mn:mx]
}]
}
output(setDT(foo))
# position rn
#1: B 4
#2: B 5
#3: A 6
#4: B 7
#5: A 8
output(setDT(baz))
#NULL
data:
foo <- fread("position
A
A
A
B
B
A
B
A
C
C")
baz <- fread("position
A
A
A
A
A
C
C
C
C
C")

The problem seems to be two-fold. Triming 'first' and 'last' elements, and identifying what constitutes 'first' and 'last'. I like your rle() approach, because it maps many possibilities into a common structure. So the task is to write a function to mask the first and last elements of a vector of any length
mask_end = function(x) {
n = length(x)
mask = !logical(n)
mask[c(min(1, n), max(0, n))] = FALSE # allow for 0-length x
mask
}
This is very easy to test comprehensively
> mask_end(integer(0))
logical(0)
> mask_end(integer(1))
[1] FALSE
> mask_end(integer(2))
[1] FALSE FALSE
> mask_end(integer(3))
[1] FALSE TRUE FALSE
> mask_end(integer(4))
[1] FALSE TRUE TRUE FALSE
The solution (returning the mask; easy to modify to return the actual values, x[inverse.rle(r)]) is then
mask_end_runs = function(x) {
r = rle(x)
r$values = mask_end(r$values)
inverse.rle(r)
}

Related

Sorting data frame by character string

I have a data frame and need to sort its columns by a character string.
I tried it like this:
# character string
a <- c("B", "E", "A", "D", "C")
# data frame
data <- data.frame(A = c(0, 0, 1), B = c(1, 1, 1), C = c(1, 0, 1), D = c(0, 0, 1), E = c(0, 1, 1))
data
# A B C D E
# 1 0 1 1 0 0
# 2 0 1 0 0 1
# 3 1 1 1 1 1
# sorting
data.sorted <- data[, order(a)]
# order of characters in data
colnames(data.sorted)
# [1] "C" "A" "E" "D" "B"
However, the order of columns in the sorted data frame is not the same as the characters in the original character string.
Is there any way, how to sort it?
The function order(a) returns the position in the vector a that each ranked value lies in. So, since "A" (ranked first) lies in the third position of a, a[1] is equal to 3. Similarly "C" (ranked third) lies in the fifth position of a, then a[3] equals 5.
Luckily your solution is actually even more simple, thanks to the way R works with brackets. If you ask to see just the column named "B" you'll get:
> data[, "B", drop=FALSE]
B
1 1
2 1
3 1
Or if you want two specific columns
> data[, c("B", "E")]
B E
1 1 0
2 1 1
3 1 1
And finally, more generally, if you have a whole vector by which you want to order your columns, then you can do that, too:
> data.sorted <- data[, a]
> data.sorted
B E A D C
1 1 0 0 0 1
2 1 1 0 0 0
3 1 1 1 1 1
> all(colnames(data.sorted)==a)
[1] TRUE
string[] str = { "H", "G", "F", "D", "S","A" };
Array.Sort(str);
for (int i = 0; i < str.Length; i++)
{
Console.WriteLine(str[i]);
}
Console.ReadLine();

Extract n rows after string in R

I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.

R: report the appearance sequence number of a value

So I have a list like below
1. a
2. a
3. b
4. b
5. c
6. c
7. a
8. a
Is it possible to get a list that number based on the sequence of each value in R:
1. 1
2. 2
3. 1
4. 2
5. 1
6. 2
7. 3
8. 4
You can use ave. Assuming your vector is called "x", try:
ave(x, x, FUN = seq_along)
# [1] "1" "2" "1" "2" "1" "2" "3" "4"
It's total overkill, but getanID from my "splitstackshape" package also does this:
library(splitstackshape)
getanID(as.data.table(x), "x")
# x .id
# 1: a 1
# 2: a 2
# 3: b 1
# 4: b 2
# 5: c 1
# 6: c 2
# 7: a 3
# 8: a 4
Sample data
x <- c("a", "a", "b", "b", "c", "c", "a", "a")

How do I combine two columns with offset data?

My dataset contains two columns with data that are offset - something like:
col1<-c("a", "b", "c", "d", "ND", "ND", "ND", "ND")
col2<-c("ND", "ND", "ND", "ND", "e", "f", "g", "h")
dataset<-data.frame(cbind(col1, col2))
I would like to combine those two offset columns into a single column that contains the letters a through h and nothing else.
Something like the following is what I'm thinking, but rbind is not the right command:
dataset$combine<-rbind(dataset$col1[1:4], dataset$col2[5:8])
What about:
sel2 <- col2!="ND"
col1[sel2] <- col2[sel2]
> col1
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Use sapply and an anonymous function:
dataset[sapply(dataset, function(x) x != "ND")]
# [1] "a" "b" "c" "d" "e" "f" "g" "h"
dataset$combine <- dataset[sapply(dataset, function(x) x != "ND")]
dataset
# col1 col2 combine
# 1 a ND a
# 2 b ND b
# 3 c ND c
# 4 d ND d
# 5 ND e e
# 6 ND f f
# 7 ND g g
# 8 ND h h
Use grep to find the matching elements and select them:
c(col1[grep("^[a-h]$",col1)],col2[grep("^[a-h]$",col2)])
Yet another way, using mapply and gsub:
within(dataset, combine <- mapply(gsub, pattern='ND', replacement=col2, x=col1))
# col1 col2 combine
# 1 a ND a
# 2 b ND b
# 3 c ND c
# 4 d ND d
# 5 ND e e
# 6 ND f f
# 7 ND g g
# 8 ND h h
Per your comment to #Andrie's answer, this will also preserve NA rows.
Another point of view:
transform(dataset,
combine=dataset[apply(dataset, 2, function(x) x %in% letters[1:8])])
col1 col2 combine
1 a ND a
2 b ND b
3 c ND c
4 d ND d
5 ND e e
6 ND f f
7 ND g g
8 ND h h
dataset$combine <- dataset[apply(dataset,2, function(x) nchar(x)==1)] #Also works
Sometimes the problem is to think simple enough... ;-)
dataset$combine<-c(dataset$col1[1:4], dataset$col2[5:8])

Create new column based on 4 values in another column

I want to create a new column based on 4 values in another column.
if col1=1 then col2= G;
if col1=2 then col2=H;
if col1=3 then col2=J;
if col1=4 then col2=K.
HOW DO I DO THIS IN R?
Please I need someone to help address this. I have tried if/else and ifelse but none seems to be working. Thanks
You could use nested ifelse:
col2 <- ifelse(col1==1, "G",
ifelse(col1==2, "H",
ifelse(col1==3, "J",
ifelse(col1==4, "K",
NA )))) # all other values map to NA
In this simple case it's overkill, but for more complicated ones...
You have a special case of looking up values where the index are integer numbers 1:4. This means you can use vector indexing to solve your problem in one easy step.
First, create some sample data:
set.seed(1)
dat <- data.frame(col1 = sample(1:4, 10, replace = TRUE))
Next, define the lookup values, and use [ subsetting to find the desired results:
values <- c("G", "H", "J", "K")
dat$col2 <- values[dat$col1]
The results:
dat
col1 col2
1 2 H
2 2 H
3 3 J
4 4 K
5 1 G
6 4 K
7 4 K
8 3 J
9 3 J
10 1 G
More generally, you can use [ subsetting combined with match to solve this kind of problem:
index <- c(1, 2, 3, 4)
values <- c("G", "H", "J", "K")
dat$col2 <- values[match(dat$col1, index)]
dat
col1 col2
1 2 H
2 2 H
3 3 J
4 4 K
5 1 G
6 4 K
7 4 K
8 3 J
9 3 J
10 1 G
There are a number of ways of doing this, but here's one.
set.seed(357)
mydf <- data.frame(col1 = sample(1:4, 10, replace = TRUE))
mydf$col2 <- rep(NA, nrow(mydf))
mydf[mydf$col1 == 1, ][, "col2"] <- "A"
mydf[mydf$col1 == 2, ][, "col2"] <- "B"
mydf[mydf$col1 == 3, ][, "col2"] <- "C"
mydf[mydf$col1 == 4, ][, "col2"] <- "D"
col1 col2
1 1 A
2 1 A
3 2 B
4 1 A
5 3 C
6 2 B
7 4 D
8 3 C
9 4 D
10 4 D
Here's one using car's recode.
library(car)
mydf$col3 <- recode(mydf$col1, "1" = 'A', "2" = 'B', "3" = 'C', "4" = 'D')
One more from this question:
mydf$col4 <- c("A", "B", "C", "D")[mydf$col1]
You could have a look at ?symnum.
In your case, something like:
col2<-symnum(col1, seq(0.5, 4.5, by=1), symbols=c("G", "H", "J", "K"))
should get you close.

Resources