How to get data for rows that follow certain pattern - r

I have a data frame that looks something like this:
x
y
z
23
1
1
23
4
2
23
56
1
23
59
2
15
89
1
15
12
1
15
15
2
17
18
1
17
21
2
78
11
1
78
38
1
78
41
2
Now this data has certain pattern on column y and column z.
I want to get all the data where in column z we have a row wise pair of 2 followed by 1 for a given value in x. Simply put, we need to remove all rows that have 1 in column z but that 1 is not followed by 2 in next row.
The final output should look like this:
x
y
z
23
1
1
23
4
2
23
56
1
23
59
2
15
12
1
15
15
2
17
18
1
17
21
2
78
38
1
78
41
2

You can do this:
library(dplyr)
df %>%
group_by(x) %>%
filter((((z == 1) & (lead(z) == 2)) | ((z == 2) & (lag(z) == 1))))
# A tibble: 10 × 3
# Groups: x [4]
x y z
<int> <int> <int>
1 23 1 1
2 23 4 2
3 23 56 1
4 23 59 2
5 15 12 1
6 15 15 2
7 17 18 1
8 17 21 2
9 78 38 1
10 78 41 2

library(tidyverse)
df <- data.frame(x = c(23,23,23,23,15,15,15,17,17,78,78,78),
y = c(1,4,56,59,89,12,15,18,21,11,38,41),
z = c(1,2,1,2,1,1,2,1,2,1,1,2))
df %>%
filter(!(z == 1 & lead(z) != 2))

Related

dplyr creating new column based on some condition [duplicate]

This question already has an answer here:
Assign the value of the first row of a group to the whole group [duplicate]
(1 answer)
Closed 1 year ago.
I have the following df:
df<-data.frame(geo_num=c(11,12,22,41,42,43,77,71),
cust_id=c("A","A","B","C","C","C","D","D"),
sales=c(2,3,2,1,2,4,6,3))
> df
geo_num cust_id sales
1 11 A 2
2 12 A 3
3 22 B 2
4 41 C 1
5 42 C 2
6 43 C 4
7 77 D 6
8 71 D 3
Require to create a new column 'geo_num_new' which has for every group from 'cust_id' has first values from 'geo_num' as shown below:
> df_new
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
thanks.
We could use first after grouping by 'cust_id'. The single value will be recycled for the entire grouping
library(dplyr)
df <- df %>%
group_by(cust_id) %>%
mutate(geo_num_new = first(geo_num)) %>%
ungroup
-ouptut
df
# A tibble: 8 x 4
geo_num cust_id sales geo_num_new
<dbl> <chr> <dbl> <dbl>
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
Or use data.table
library(data.table)
setDT(df)[, geo_num_new := first(geo_num), by = cust_id]
or with base R
df$geo_num_new <- with(df, ave(geo_num, cust_id, FUN = function(x) x[1]))
Or an option with collapse
library(collapse)
tfm(df, geo_num_new = ffirst(geo_num, g = cust_id, TRA = "replace"))
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

Select rows by column value based on range of values in another column in R

I have a dataframe similar to this:
x <- data.frame("A" = c(11:24),
"B" = c(25,25,25,25,25,37,37,16,16,16,16,16,42,42),
"C" = c(1:3,1:2,1:2,1:3,1:2,1:2))
A B C
11 25 1
12 25 2
13 25 3
14 25 1
15 25 2
16 37 1
17 37 2
18 16 1
19 16 2
20 16 3
21 16 1
22 16 2
23 42 1
24 42 2
I want to keep only the rows where each value in B has at least one of all values (1-3) in C. So my result would look like:
A B C
11 25 1
12 25 2
13 25 3
14 25 1
15 25 2
18 16 1
19 16 2
20 16 3
21 16 1
22 16 2
I can't seem to get the right keywords in my search for answers.
We can use all after grouping by 'B'
library(dplyr)
x %>%
group_by(B) %>%
filter(all(1:3 %in% C))
# A tibble: 10 x 3
# Groups: B [2]
# A B C
# <int> <dbl> <int>
# 1 11 25 1
# 2 12 25 2
# 3 13 25 3
# 4 14 25 1
# 5 15 25 2
# 6 18 16 1
# 7 19 16 2
# 8 20 16 3
# 9 21 16 1
#10 22 16 2
Another option is to use data.table to count unique C's for each B and then filter your data to only contain B's that have 3 distinct C's
library(data.table)
setDT(x)
x[B %in% x[,length(unique(C)),by=B][V1==3,B]]

add column to dataframes from 1 to unique length of existing grouped rows

Here is my example df:
df = read.table(text = 'colA
22
22
22
45
45
11
11
87
90
110
32
32', header = TRUE)
I just need to add a new col based on colA with values from 1 to the unique length of colA.
Expected output:
colA newCol
22 1
22 1
22 1
45 2
45 2
11 3
11 3
87 4
90 5
110 6
32 7
32 7
Here is what I tried without succes:
library(dplyr)
new_df = df %>%
group_by(colA) %>%
mutate(newCol = seq(1, length(unique(df$colA)), by = 1))
Thanks
newcol = c(1, 1+cumsum(diff(df$colA) != 0))
[1] 1 1 1 2 2 3 3 4 5 6 7 7
The dplyr package has a function to get indices of group:
df$newcol = group_indices(df,colA)
This returns:
colA newcol
1 22 2
2 22 2
3 22 2
4 45 4
5 45 4
6 11 1
7 11 1
8 87 5
9 90 6
10 110 7
11 32 3
12 32 3
Though the index is not ordered according to the order of appearance.
You can also do it using factor:
df$newcol = as.numeric(factor(df$colA,levels=unique(df$colA)))
Another option: You can capitalize on the fact that factors are associated with underlying integers. First create a new factor variable with the same levels as the column, then transform it to numeric.
newCol <- factor(df$colA,
levels = unique(df$colA))
df$newCol <- as.numeric(newCol)
df
colA newCol
1 22 1
2 22 1
3 22 1
4 45 2
5 45 2
6 11 3
7 11 3
8 87 4
9 90 5
10 110 6
11 32 7
12 32 7

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Resources