Select row meeting condition and all subsequent rows by group - r

Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.

For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45

You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

Related

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

I want to create a loop with a large dataframe in R

Problem
I want to create a loop from data in df1 it's important the data is taken one ID value at a time.
I'm unsure how this can be done with R.
#original dataset
id=c(1,1,1,2,2,2,3,3,3)
dob=c("11-08","12-04","04-03","10-04","03-07","06-02","12-09","01-01","03-08")
count=c(1,6,3,2,5,6,8,6,4)
outcome=rep(1:0,length.out=9)
df1=data.frame(id,dob,count,outcome)
#changes for each value this needs to be completed separately for each value
df2<-df1[df1$id==1,]
df2<-df2[,-4]
addition<-df2$count+45
df2<-cbind(df2,addition)
df3<-df1[df1$id==2,]
df3<-df3[,-4]
addition<-df3$count+45
df3<-cbind(df3,addition)
df4<-df1[df1$id==3,]
df4<-df4[,-4]
addition<-df4$count+45
df4<-cbind(df4,addition)
df5<-rbind(df2,df3,df4)
Expected Output
df5<-rbind(df2,df3,df4)
1 1 11-08 1 46
2 1 12-04 6 51
3 1 04-03 3 48
4 2 10-04 2 47
5 2 03-07 5 50
6 2 06-02 6 51
7 3 12-09 8 53
8 3 01-01 6 51
9 3 03-08 4 49
In the present context (could be a simplified example) it doesn't even need that to loop, as we can directly add the 'count' with a number
df1$addition <- df1$count + 45
However, if it is a complicated operation and needs to look into the 'id' separately, then do a group_by operation
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(addition = count + 45)
# A tibble: 9 x 5
# Groups: id [3]
# id dob count outcome addition
# <dbl> <fct> <dbl> <int> <dbl>
#1 1 11-08 1 1 46
#2 1 12-04 6 0 51
#3 1 04-03 3 1 48
#4 2 10-04 2 0 47
#5 2 03-07 5 1 50
#6 2 06-02 6 0 51
#7 3 12-09 8 1 53
#8 3 01-01 6 0 51
#9 3 03-08 4 1 49
Also, data.table syntax would be
library(data.table)
setDT(df1)[, addition := count + 45, by = id]
or simply
setDT(df1)[, addition := count + 45]

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

Row-wise expansion of data.frame [duplicate]

This question already has answers here:
De-aggregate / reverse-summarise / expand a dataset in R [duplicate]
(4 answers)
Closed 6 years ago.
I basically want do the opposite of ddply(df, columns.to.preserve, numcolwise(FUNCTION).
Suppose I have
d <- data.frame(
count=c(2,1,3),
summed.value=c(50,20,30),
averaged.value=c(35,80,20)
)
count summed.value averaged.value
1 2 50 35
2 1 20 80
3 3 30 20
I want to do a row expansion of this data.frame based on the count column while specifying what kind of operation I want to apply to the other columns.
Here is the kind of result I'm looking for:
> d2
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20
Any there built in functions within dplyr or other packages that does this kind of operation?
Edit: This is different from the De-aggregate / reverse-summarise / expand a dataset in R question because I want to go further and actually apply different functions to columns within the table I wish to expand. There are also more useful and answers on this post.
Use dplyr and tidyr, you can do a rowwise transformation for the summed.value which produces a list for each cell and then unnest the column should give you what you need:
library(dplyr); library(tidyr)
d %>% rowwise() %>% summarise(summed.value = list(rep(summed.value/count, count)),
averaged.value = averaged.value, count = 1) %>% unnest()
# Source: local data frame [6 x 3]
# averaged.value count summed.value
# <dbl> <dbl> <dbl>
# 1 35 1 25
# 2 35 1 25
# 3 80 1 20
# 4 20 1 10
# 5 20 1 10
# 6 20 1 10
Another way is to use data.table, where you can specify the row number as group variable, and the data table will automatically expand it:
library(data.table)
setDT(d)
d[, .(summed.value = rep(summed.value/count, count), averaged.value, count = 1), .(1:nrow(d))]
[, nrow := NULL][]
# summed.value averaged.value count
#1: 25 35 1
#2: 25 35 1
#3: 20 80 1
#4: 10 20 1
#5: 10 20 1
#6: 10 20 1
There is a function untable in package reshape for getting the inverse of a table. Then divide the variables that need dividing by count via mutate_at (or mutate_each). mutate_at was introduced in dplyr_0.5.0.
First the untable:
library(reshape)
untable(d, num = d$count)
count summed.value averaged.value
1 2 50 35
1.1 2 50 35
2 1 20 80
3 3 30 20
3.1 3 30 20
3.2 3 30 20
Then the mutate_at for dividing summed.value and count by count:
library(dplyr)
untable(d, num = d$count) %>%
mutate_at(vars(summed.value, count), funs(./count))
count summed.value averaged.value
1 1 25 35
2 1 25 35
3 1 20 80
4 1 10 20
5 1 10 20
6 1 10 20
Here's a both simple and fully vecotrized base R approach
transform(d[rep(1:nrow(d), d$count), ],
count = 1,
summed.value = summed.value/count)
# count summed.value averaged.value
# 1 1 25 35
# 1.1 1 25 35
# 2 1 20 80
# 3 1 10 20
# 3.1 1 10 20
# 3.2 1 10 20
Or similarly, using data.table
library(data.table)
res <- setDT(d)[rep(1:.N, count)][, `:=`(count = 1, summed.value = summed.value / count)]
res
# count summed.value averaged.value
# 1: 1 25 35
# 2: 1 25 35
# 3: 1 20 80
# 4: 1 10 20
# 5: 1 10 20
# 6: 1 10 20
A base R solution: It tries to replicate each row by the value of the count column and then divide count and summed.value columns by count.
mytext <- 'count,summed.value,averaged.value
2,50,35
1,20,80
3,30,20'
mydf <- read.table(text=mytext,header=T,sep = ",")
mydf <- do.call(rbind,apply(mydf, 1, function(x) {
tempdf <- t(replicate(x[1],x,simplify = T))
tempdf[,1] <- tempdf[,1]/x[1]
tempdf[,2] <- tempdf[,2]/x[1]
return(data.frame(tempdf))
}))
count summed.value averaged.value
1 25 35
1 25 35
1 20 80
1 10 20
1 10 20
1 10 20

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Resources