Get first and last value from groups using rle - r

I want to get first and last value for groups using grouping similar to what rle() function does.
For example I have this data frame:
> df
df time
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 1 H
9 1 I
10 1 J
11 3 K
12 3 L
13 3 M
14 2 N
15 2 O
16 2 P
I want to get something like this:
> want
df first last
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
How you can see, I want to group my values in a way rle() function does. I want to group elements only when this same value is next to each other. group_by groups elements in the different way.
> rle(df$df)
Run Length Encoding
lengths: int [1:5] 4 3 3 3 3
values : num [1:5] 1 2 1 3 2
Is there a solution for my problem? Any advice will be appreciated.

There is a function rleid from data.table that does that job, i.e.
library(data.table)
setDT(dt)[, .(df = head(df, 1),
first = head(time, 1),
last = tail(time, 1)),
by = (grp = rleid(df))][, grp := NULL][]
Which gives,
df first last
1: 1 A D
2: 2 E G
3: 1 H J
4: 3 K M
5: 2 N P
Adding a dplyr approach, as #RonakShah mentions
library(dplyr)
df %>%
group_by(grp = cumsum(c(0, diff(df)) != 0)) %>%
summarise(df = first(df),
first = first(time),
last = last(time)) %>%
select(-grp)
Giving,
# A tibble: 5 x 3
df first last
<int> <chr> <chr>
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P

Here is an option using base R with rle. Once we do the rle on the first column, replicate the sequence of values with lengths, use that to create logical index with duplicated, then subset the values of the original dataset based on the index
rl <- rle(df[,1])
i1 <- rep(seq_along(rl$values), rl$lengths)
i2 <- !duplicated(i1)
i3 <- !duplicated(i1, fromLast = TRUE)
wanted <- data.frame(df = df[i2,1], first = df[i2,2], last = df[i3,2])
wanted
# df first last
#1 1 A D
#2 2 E G
#3 1 H J
#4 3 K M
#5 2 N P

Related

Which group meet the criterion a < b < c depending on condition

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed
Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1
Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

how to group my string vector in two data frame?

Hi: I have a simple question but it confuse me a lot. Below are my codes:
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
I want my newdata , which merge b$recipt into a and becomes:
>newdata
url id recipt
1 a n
2 b n
3 c n
4 d y
5 e n
please give me some hint, thanks
You could try this:
a$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])
#> a
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Here it is assumed that the recipt entries are the same for any given value of url in b. If this is not the case, things become more complicated.
If you want to keep a unchanged and generate a new frame newdata with the new column, then the above code can be slightly modified in a rather trivial way:
newdata <- a
newdata$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])
You could use match
transform(a, recipt= b$recipt[match(url, b$url)])
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Or using the devel version of data.table. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(a)[unique(b[c(1,3)], by='url'), on='url']
# url id recipt
#1: 1 a n
#2: 2 b n
#3: 3 c n
#4: 4 d y
#5: 5 e n
So I think what you want is to merge a onto b as there are multiple prices with the same url. Thus b is your base data frame and you want to append an id value to it. Some of the id values would be repeated.
One easy way is to do this with dplyr.
library(dplyr)
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
left_join(b, a, by = "url")
url price recipt id
1 1 10 n a
2 1 10 n a
3 2 20 n b
4 2 20 n b
5 2 20 n b
6 3 30 n c
7 3 30 n c
8 3 30 n c
9 3 30 n c
10 4 40 y d
11 4 40 y d
12 5 50 n e
13 5 50 n e

Resources