R: subset/group data frame with a max value? - r

Given a data frame like this:
gid set a b
1 1 1 1 9
2 1 2 -2 -3
3 1 3 5 6
4 2 2 -4 -7
5 2 6 5 10
6 2 9 2 0
How can I subset/group data frame of a unique gid with the max set value and 1/0 wether its a value is greater than its b value?
So here, it'd be, uh...
1,3,0
2,9,1
Kind of a stupid simple thing in SQL but I'd like to have a bit better control over my R, so...

Piece of cake with dplyr:
dat <- read.table(text="gid set a b
1 1 1 9
1 2 -2 -3
1 3 5 6
2 2 -4 -7
2 6 5 10
2 9 2 0", header=TRUE)
library(dplyr)
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=a>b) %>%
select(gid, set, greater)
## Source: local data frame [2 x 3]
## Groups: gid
##
## gid set greater
## 1 1 3 FALSE
## 2 2 9 TRUE
If you really need 1's and 0's and the dplyr groups cause any angst:
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=ifelse(a>b, 1, 0)) %>%
select(gid, set, greater) %>%
ungroup
## Source: local data frame [2 x 3]
##
## gid set greater
## 1 1 3 0
## 2 2 9 1
You could do the same thing without pipes:
ungroup(
select(
mutate(
filter(row_number() == which.max(set)),
greater=ifelse(a>b, 1, 0)), gid, set, greater))
but…but… why?! :-)

Here's a data.table possibility, assuming your original data is called df.
library(data.table)
setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
# gid set b
# 1: 1 3 0
# 2: 2 9 1
Note that to account for multiple max(set) rows, I used set == max(set) as the subset so that this will return the same number of rows for which there are ties for the max (if that makes any sense at all).
And courtesy of #thelatemail, another data table option:
setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
# gid set ab
# 1: 1 3 0
# 2: 2 9 1

In base R, you can use ave
indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x)
# which.max(x) ==seq_along(x)))
transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
# gid set greater
# 3 1 3 0
# 6 2 9 1

Related

Nested if and max if statement in R error: unused argument

Still quite new to R and I would like to identify the biggest value of a variable within a group when an other variable is postive/negative. Specifically, if l_diff<0, I would like to find the biggest value of t in each group when pos_n<0. If l_diff>0, I would like to find the biggest value of t in each group when pos_p>0. Example data:
l_diff <- c(-1,-1,-1,-1,-1,-1,1,1,1,1,1)
pos_n <- c(2,2,1,-4,-2,-2,2,1,-5,4,8)
pos_p <- c(3,4,-5,6,-7,2,-3,3,2,1,4)
t <- c(5,7,3,1,6,2,7,5,3,2,1)
group <- c(1,1,1,1,1,1,2,2,2,2,2)
db <- data.frame(cbind(l_diff,pos_n, pos_p, t, group))
Desired output:
cmax<- c(6,6,6,6,6,6,5,5,5,5,5)
I tried the following:
db<-db %>%
group_by((group)) %>%
mutate(ifelse(l_diff<0, t1 = max(t[pos_n<0], ifelse(l_diff>0, t1 = max(t[pos_p >0])))))
But I get the following error:
Error: Problem with mutate() input ..1. x unused argument (t1 =
max(t[pos_n < 0], ifelse(l_diff > 0, t1 = max(t[pos_p > 0])))) i
Input ..1 is ifelse(...). i The error occurred in group 1: (group)
= 1.
Any idea what may be wrong or any other suggestions?
With ifelse, we need to place the assignment outside, similarly, all the arguments in the ifelse usage must be present
ifelse(test, yes, no)
Here, the no was not found in the nested second ifelse. It is not an issue if we use case_when as by default the TRUE ~ NA
library(dplyr)
db %>%
group_by(group) %>%
mutate(t1 = ifelse(l_diff<0, max(t[pos_n<0]),
ifelse(l_diff>0,max(t[pos_p >0]), NA))) %>%
ungroup
-output
# A tibble: 11 x 6
# l_diff pos_n pos_p t group t1
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 -1 2 3 5 1 6
# 2 -1 2 4 7 1 6
# 3 -1 1 -5 3 1 6
# 4 -1 -4 6 1 1 6
# 5 -1 -2 -7 6 1 6
# 6 -1 -2 2 2 1 6
# 7 1 2 -3 7 2 5
# 8 1 1 3 5 2 5
# 9 1 -5 2 3 2 5
#10 1 4 1 2 2 5
#11 1 8 4 1 2 5

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

How to select unique point

I am a novice R programmer. I have a following series of points.
df <- data.frame(x = c(1 , 2, 3, 4), y = c(6 , 3, 7, 5))
df <- df %>% mutate(k = 1)
df <- df %>% full_join(df, by = 'k')
df <- subset(df, select = c('x.x', 'y.x', 'x.y', 'y.y'))
df
Is there way to select for "unique" points? (the order of the points do not matter)
EDIT:
x.x y.x x.y y.y
1 6 2 3
2 3 3 7
.
.
.
(I changed the 2 to 7 to clarify the problem)
With data.table (and working from the OP's initial df):
library(data.table)
setDT(df)
df[, r := .I ]
df[df, on=.(r > r), nomatch=0]
x y r i.x i.y
1: 2 3 1 1 6
2: 3 2 1 1 6
3: 4 5 1 1 6
4: 3 2 2 2 3
5: 4 5 2 2 3
6: 4 5 3 3 2
This is a "non-equi join" on row numbers. In x[i, on=.(r > r)] the left-hand r refers to the row in x and the right-hand one to a row of i. The columns named like i.* are taken from i.
Data.table joins, which are of the form x[i], use i to look up rows of x. The nomatch=0 option drops rows of i that find no matches.
In the tidyverse, you can save a bit of work by doing the self-join with tidyr::crossing. If you add row indices pre-join, reducing is a simple filter call:
library(tidyverse)
df %>% mutate(i = row_number()) %>% # add row index column
crossing(., .) %>% # Cartesian self-join
filter(i < i1) %>% # reduce to lower indices
select(-i, -i1) # remove extraneous columns
## x y x1 y1
## 1 1 6 2 3
## 2 1 6 3 7
## 3 1 6 4 5
## 4 2 3 3 7
## 5 2 3 4 5
## 6 3 7 4 5
or in all base R,
df$m <- 1
df$i <- seq(nrow(df))
df <- merge(df, df, by = 'm')
df[df$i.x < df$i.y, c(-1, -4, -7)]
## x.x y.x x.y y.y
## 2 1 6 2 3
## 3 1 6 3 7
## 4 1 6 4 5
## 7 2 3 3 7
## 8 2 3 4 5
## 12 3 7 4 5
You can use the duplicated.matrix() function from base, to find the rows which are no duplicator - which means in fact that there are unique. When you call the duplicated() function you have to clarify that you only want to use the to first colons. With this call you check which line is unique. In a second step you call in your dataframe for this rows, with all columns.
unique_lines = !duplicated.matrix(df[,c(1,2)])
df[unique_lines,]

count the length of Number Sequences

Sample data containing some arithmetic sequences c(4,5,6) and c(10,11).
df <- data.frame(x = c(2, 4, 5, 6, 8, 10, 11))
What I want it is a new column that count the length of the each sequence, such as
> df
x cnt
1 2 1
2 4 1
3 5 2
4 6 3
5 8 1
6 10 1
7 11 2
It would be simple to first assign df$cnt[1] = 1, then for the second row and beyond just increment the count, or reset to 1 depending on if the consecutive numbers in df$x meet certain criteria (here x[i] - x[i-1] == 1). I am just not sure loop is the way to go in R-- also I need to deal with groups.
I can create new column to check if it is in a sequence. From there, I probably can use rle to calculate the run length and generate the cnt column (not sure how to do it with the NA).
> df %>% mutate(check=(x-lag(x)==1))
x check
1 2 NA
2 4 FALSE
3 5 TRUE
4 6 TRUE
5 8 FALSE
6 10 FALSE
7 11 TRUE
Is this the way to go? Please suggest solutions with dplyr or data.table?
dplyr. Set the default value and it will work:
df %>% mutate(check = x - lag(x, default = x[1L]) != 1) %>%
group_by(g = cumsum(check)) %>%
mutate(cnt = row_number()) %>%
ungroup %>% select(-g,-check)
x cnt
<dbl> <int>
1 2 1
2 4 1
3 5 2
4 6 3
5 8 1
6 10 1
7 11 2
data.table. Along the same lines and more concisely:
library(data.table)
setDT(df)
df[, cnt := 1:.N, by=cumsum(x != shift(x, fill=x[1L]) + 1L)]
x cnt
1: 2 1
2: 4 1
3: 5 2
4: 6 3
5: 8 1
6: 10 1
7: 11 2
shift is data.table's analogue to lag.
Alternately, from v1.9.7 of the package on, you're able to use rowid instead:
df[, cnt := rowid(cumsum(x != shift(x, fill=x[1L]) + 1L))]
Another option using base R
unlist(sapply(rle(cumsum(ifelse(diff(c(df$x[1],df$x))!=1,1,0)))$lengths,seq_len))

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

Resources