In a dataframe, find the index of the next smaller value for each element of a column - r

Question:
In a dataframe, I want to create a new column as the indices of the next smaller value of an existing column.
For example, the data looks like this. It is already arranged in item, day.
item day val
1 1 2 3
2 1 4 2
3 1 5 1
4 2 1 1
5 2 3 2
6 2 5 3
First I would like to use group_by(item) in dplyr to select the sub-dataframe of each item.
Then for row 1, I look down the rows and find that row 2 has a smaller val. This is what I want, so I record the day corresponding to that row. Similar for row 2.
Note that for row 3 and 6, they are the last rows of corresponding sub-dataframes, so there is no next smaller value. For row 4 and 5, there is no smaller val when I look down the rows.
The dataframe with the new column should look like this.
item day val next.smaller.day
1 1 2 3 4
2 1 4 2 5
3 1 5 1 -1
4 2 1 1 -1
5 2 3 2 -1
6 2 5 3 -1
I wonder if there is any way of using dplyr to implement this, or any codes in r other than a for loop.
I found a thread asking the algorithm of this question. Given an array, find out the next smaller element for each element .
It is relevant, and the proposed algorithm beats mine in terms of time complexity, but I still find it hard to implement in my scenario.
Thank you!
Update:
Here is another example to re-illustrate what I'm looking for.
item day val next.smaller.day
1 1 2 2 5
2 1 4 3 5
3 1 5 1 -1
4 2 1 3 3
5 2 3 1 -1
6 2 5 2 -1

You can group your data by the item, calculate the different between rows using the diff function and check if it is smaller than zero which will then generate a logic vector and you can use the logic vector to pick up the next day. And since you are picking up the next day, you will need the lead function to shift the day column forward so that it can match the rows where you want to place them.
Side note: Since diff function create a vector one element shorter than the original one and you will always leave the last row out per group, we can pad the diff result by a FALSE condition.
library(dplyr);
df %>% group_by(item) %>% mutate(smaller = c(diff(val) < 0, F),
next.smaller.day = ifelse(smaller, lead(day), -1)) %>%
select(-smaller)
# Source: local data frame [6 x 4]
# Groups: item [2]
# item day val next.smaller.day
# <int> <int> <int> <dbl>
# 1 1 2 3 4
# 2 1 4 2 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1
Update:
find.next.smaller <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] > vec[-1])),
find.next.smaller(ini + 1, vec[-1]))
} # the recursive function will go element by element through the vector and find out
# the index of the next smaller value.
df %>% group_by(item) %>% mutate(next.smaller.day = day[find.next.smaller(1, val)],
next.smaller.day = replace(next.smaller.day, is.na(next.smaller.day), -1))
# Source: local data frame [6 x 4]
# Groups: item [2]
#
# item day val next.smaller.day
# <int> <int> <dbl> <dbl>
# 1 1 2 2 5
# 2 1 4 3 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1

Related

rolling function with variable width R

I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})

R, dplyr: Is there a way to add order of groups when there are multiple rows per group without creating a new data frame? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I have data from an experiment that has multiple rows per item (each row has the reading time for one word of a sentence of n words), and multiple items per subject. Items can be varying numbers of rows. Items were presented in a random order, and their order in the data as initially read in reflects the sequence they saw the items in. What I'd like to do is add a column that contains the order in which the subject saw that item (i.e., 1 for the first item, 2 for the second, etc.).
Here's an example of some input data that has the relevant properties:
d <- data.frame(Subject = c(1,1,1,1,1,2,2,2,2,2),
Item = c(2,2,2,1,1,1,1,2,2,2))
Subject Item
1 2
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
2 2
And here's the output I want:
Subject Item order
1 2 1
1 2 1
1 2 1
1 1 2
1 1 2
2 1 1
2 1 1
2 2 2
2 2 2
2 2 2
I know I can do this by setting up a temp data frame that filters d to unique combinations of Subject and Item, adding order to that as something like 1:n() or row_number(), and then using a join function to put it back together with the main data frame. What I'd like to know is whether there's a way to do this without having to create a new data frame just to store the order---can this be done inside dplyr's mutate somehow if I group by Subject and Item, for instance?
Here's one way:
d %>%
group_by(Subject) %>%
mutate(order = match(Item, unique(Item))) %>%
ungroup()
# # A tibble: 10 x 3
# Subject Item order
# <dbl> <dbl> <int>
# 1 1 2 1
# 2 1 2 1
# 3 1 2 1
# 4 1 1 2
# 5 1 1 2
# 6 2 1 1
# 7 2 1 1
# 8 2 2 2
# 9 2 2 2
# 10 2 2 2
Here is a base R option
transform(d,
order = ave(Item, Subject, FUN = function(x) as.integer(factor(x, levels = unique(x))))
)
or
transform(d,
order = ave(Item, Subject, FUN = function(x) match(x, unique(x)))
)
both giving
Subject Item order
1 1 2 1
2 1 2 1
3 1 2 1
4 1 1 2
5 1 1 2
6 2 1 1
7 2 1 1
8 2 2 2
9 2 2 2
10 2 2 2

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

Computing Change from Baseline in R

I have a dataset in R, which contains observations by time. For each subject, I have up to 4 rows, and a variable of ID along with a variable of Time and a variable called X, which is numerical (but can also be categorical for the sake of the question). I wish to compute the change from baseline for each row, by ID. Until now, I did this in SAS, and this was my SAS code:
data want;
retain baseline;
set have;
if (first.ID) then baseline = .;
if (first.ID) then baseline = X;
else baseline = baseline;
by ID;
Change = X-baseline;
run;
My question is: How do I do this in R ?
Thank you in advance.
Dataset Example (in SAS, I don't know how to do it in R).
data have;
input ID, Time, X;
datalines;
1 1 5
1 2 6
1 3 8
1 4 9
2 1 2
2 2 2
2 3 7
2 4 0
3 1 1
3 2 4
3 3 5
;
run;
Generate some example data:
dta <- data.frame(id = rep(1:3, each=4), time = rep(1:4, 3), x = rnorm(12))
# > dta
# id time x
# 1 1 1 -0.232313499
# 2 1 2 1.116983376
# 3 1 3 -0.682125947
# 4 1 4 -0.398029820
# 5 2 1 0.440525082
# 6 2 2 0.952058966
# 7 2 3 0.690180586
# 8 2 4 -0.995872696
# 9 3 1 0.009735667
# 10 3 2 0.556254340
# 11 3 3 -0.064571775
# 12 3 4 -1.003582676
I use the package dplyr for this. This package is not installed by default, so, you'll have to install it first if it isn't already.
The steps are: group the data by id (following operations are done per group), sort the data to make sure it is ordered on time (that the first record is the baseline), then calculate a new column which is the difference between x and the first value of x. The result is stored in a new data.frame, but can of course also be assigned back to dta.
library(dplyr)
dta_new <- dta %>% group_by(id) %>% arrange(id, time) %>%
mutate(change = x - first(x))
# > dta_new
# Source: local data frame [12 x 4]
# Groups: id [3]
#
# id time x change
# <int> <int> <dbl> <dbl>
# 1 1 1 -0.232313499 0.00000000
# 2 1 2 1.116983376 1.34929688
# 3 1 3 -0.682125947 -0.44981245
# 4 1 4 -0.398029820 -0.16571632
# 5 2 1 0.440525082 0.00000000
# 6 2 2 0.952058966 0.51153388
# 7 2 3 0.690180586 0.24965550
# 8 2 4 -0.995872696 -1.43639778
# 9 3 1 0.009735667 0.00000000
# 10 3 2 0.556254340 0.54651867
# 11 3 3 -0.064571775 -0.07430744
# 12 3 4 -1.003582676 -1.01331834

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources