Create counter with multiple variables that restart within each subgroup - r

I have a dataframe with two columns (ident and value). I would like to create a counter that restart every time ident value change and also when value within each ident change. Here is an example to make it clear.
# ident value counter
#--------------------
# 1 0 1
# 1 0 2
# 1 1 1
# 1 1 2
# 1 1 3
# 1 0 1
# 1 1 1
# 1 1 2
# 2 1 1
# 2 0 1
# 2 0 2
# 2 0 3
I've tried the plyr package
ddply(mydf, .(ident, value), transform, .id = seq_along(ident))
Same result with the data.frame package.

A data.table alternative with the use of the rleid/rowid functions. With rleid you create a run length id for consecutive values, which can be used as a group. 1:.N or rowid can be used to create the counter. The code:
library(data.table)
# option 1:
setDT(d)[, counter := 1:.N, by = .(ident,rleid(value))]
# option 2:
setDT(d)[, counter := rowid(ident, rleid(value))]
which both give:
> d
ident value counter
1: 1 0 1
2: 1 0 2
3: 1 1 1
4: 1 1 2
5: 1 1 3
6: 1 0 1
7: 1 1 1
8: 1 1 2
9: 2 1 1
10: 2 0 1
11: 2 0 2
12: 2 0 3
With dplyr it is a bit less straightforward:
library(dplyr)
d %>%
group_by(ident, val.gr = cumsum(value != lag(value, default = first(value)))) %>%
mutate(counter = row_number()) %>%
ungroup() %>%
select(-val.gr)
As an alternative to the cumsum-function you could also use rleid from data.table.
Used data:
d <- structure(list(ident = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
value = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)),
.Names = c("ident", "value"), class = "data.frame", row.names = c(NA, -12L))

We can paste the two values together and use length attribute of rle to get the length of consecutive numbers. We then use sequence to generate the counter.
df$counter <- sequence(rle(paste0(df$dent, df$value))$lengths)
df
# dent value counter
#1 1 0 1
#2 1 0 2
#3 1 1 1
#4 1 1 2
#5 1 1 3
#6 1 0 1
#7 1 1 1
#8 1 1 2
#9 2 1 1
#10 2 0 1
#11 2 0 2
#12 2 0 3

Related

comparing a value with next value in a column

I have a data set of the following form:
Interval | Count | criteria
0 0 0
0 1 0
0 2 0
0 3 0
1 4 1
1 5 2
1 6 3
1 7 4
2 8 1
2 9 2
3 10 3
I need to compare the values in the Interval. I first need to create a new variable to store the values. If the value in the Interval is the same as the next value, then the new variable should have blanks. If the Interval value is not the same as the next value, then it should return criteria/Count. Output should be like this:
Interval | Count | criteria | N
0 0 0
0 1 0
0 2 0
0 3 0 0
1 4 1
1 5 2
1 6 3
1 7 4 0.5714
2 8 1
2 9 2 0.2222
3 10 3
Here is my code:
fid$N<-''
for (i in 1:length(fid$Interval))
{
if (fid$Interval[i] != fid$Interval[i+1])
fid$N<-fid$criteria/fid$Count
else
fid$N<-''
}
and this is the error I am getting.
Error in if (fid$Interval[i] != fid$Interval[i + 1]) fid$N <- fid$criteria/fid$Count else fid$N <- "" :
missing value where TRUE/FALSE needed
To add that, there is no missing value in the data set.
I would appreciate if anyone could help.
You don't necessarily need a loop for this since most of the R functions are vectorized. Here is a way to do this in base R, dplyr and data.table without using a loop.
#Base R
transform(df, N = ifelse(Interval != c(tail(Interval, -1), NA), criteria/Count, NA))
#dplyr
library(dplyr)
df %>% mutate(N = if_else(Interval != lead(Interval), criteria/Count, NA_real_))
#data.table
library(data.table)
setDT(df)[, N:= fifelse(Interval != shift(Interval, type = 'lead'), criteria/Count, NA_real_)]
All of which return :
# Interval Count criteria N
#1 0 0 0 NA
#2 0 1 0 NA
#3 0 2 0 NA
#4 0 3 0 0.0000000
#5 1 4 1 NA
#6 1 5 2 NA
#7 1 6 3 NA
#8 1 7 4 0.5714286
#9 2 8 1 NA
#10 2 9 2 0.2222222
#11 3 10 3 NA
I return NA instead of blank value because if we return blank value the entire column becomes of type character and the numbers are no longer useful. In the answer you can replace NA with '' to get a blank value.
data
df <- structure(list(Interval = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L,
2L, 3L), Count = 0:10, criteria = c(0L, 0L, 0L, 0L, 1L, 2L, 3L,
4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -11L))

R: subset dataframe for all rows after a condition is met

So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?
A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median?

Suppose I have the following data frame:
Base Coupled Derived Decl
1 0 0 1
1 7 0 1
1 1 0 1
2 3 12 1
1 0 4 1
Here is the dput output:
temp <- structure(list(Base = c(1L, 1L, 1L, 2L, 1L), Coupled = c(0L,7L, 1L, 3L, 0L), Derived = c(0L, 0L, 0L, 12L, 4L), Decl = c(1L, 1L, 1L, 1L, 1L)), .Names = c("Base", "Coupled", "Derived", "Decl"), row.names = c(NA, 5L), class = "data.frame")
I want to compute the median for each column. Then, for each row, I want to count the number of cell values greater than the median for their respective columns and append this as a column called AboveMedians.
In the example, the medians would be c(1,1,0,1). The resulting table I want would be
Base Coupled Derived Decl AboveMedians
1 0 0 1 0
1 7 0 1 1
1 1 0 1 0
2 3 12 1 3
1 0 4 1 1
What is the elegant R way to do this? I have something involving a for-loop and sapply, but this doesn't seem optimal.
Thanks.
We can use rowMedians from matrixStats after converting the data.frame to matrix.
library(matrixStats)
Medians <- colMedians(as.matrix(temp))
Medians
#[1] 1 1 0 1
Then, replicate the 'Medians' to make the dimensions equal to that of 'temp', do the comparison and get the rowSums on the logical matrix.
temp$AboveMedians <- rowSums(temp >Medians[col(temp)])
temp$AboveMedians
#[1] 0 1 0 3 1
Or a base R only option is
apply(temp, 2, median)
# Base Coupled Derived Decl
# 1 1 0 1
rowSums(sweep(temp, 2, apply(temp, 2, median), FUN = ">"))
Another alternative:
library(dplyr)
library(purrr)
temp %>%
by_row(function(x) {
sum(x > summarise_each(., funs(median))) },
.to = "AboveMedian",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# Base Coupled Derived Decl AboveMedian
# <int> <int> <int> <int> <int>
#1 1 0 0 1 0
#2 1 7 0 1 1
#3 1 1 0 1 0
#4 2 3 12 1 3
#5 1 0 4 1 1

How do I create occasion variable (time) for each ID?

I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")

Resources