compare the first and last observation in each group - r

I have a dataset like this:
df <- data.frame(group = c(rep(1,3),rep(2,2), rep(3,3),rep(4,3),rep(5, 2)), score = c(30, 10, 22, 44, 50, 5, 20, 1,35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 2 44
5 2 50
6 3 5
7 3 20
8 3 1
9 4 35
10 4 2
11 4 60
12 5 14
13 5 5
I wish to compare the first score and last score in each group, if the last score is smaller than the first score, then output the group number. The expected output should be like:
group 1 3 5
does anyone have idea how to realized this?

Here's data.table approach
library(data.table)
setDT(df)[, score[1] > score[.N], by = group][V1 == TRUE]
## group V1
## 1: 1 TRUE
## 2: 3 TRUE
## 3: 5 TRUE
Or
setDT(df)[, group[score[1] > score[.N]], by = group]
## group V1
## 1: 1 1
## 2: 3 3
## 3: 5 5
Or
setDT(df)[, .BY[score[1] > score[.N]], by = group]
As per #beginneR's comment, if you don't like V1 you could do
df2 <- as.data.table(df)[, .BY[score[1] > score[.N]], by = group][, V1 := NULL]
df2
## group
## 1: 1
## 2: 3
## 3: 5

This should do the job:
# First split the data frame by group
# This returns a list
df.split <- split(df, factor(df$group))
# Now use sapply on the list to check first and last of each group
# We return the group or NA using ifelse
res <- sapply(df.split,
function(x){ifelse(x$score[1] > x$score[nrow(x)], x$group[1], NA)})
# Finally, filter away the NAs
res <- res[!is.na(res)]

This answer assumes that every group has at least 2 observations:
newdf <- merge(rbind(df[diff(df$group) == 1 ,] , df[dim(df)[1], ]),
df[!duplicated(df$group), ],
by="group")
newdf[which(newdf$score.x < newdf$score.y), 'group']
#[1] 1 3 5
df[diff(df$group) == 1 ,] identifies the last observation of each group, except for the last group, which is why I rbind the last entry (i.e. df[dim(df)[1], ]). Then, the first observation of each group is given by df[!duplicated(df$group), ]. We merge these on the group column, then identify which ones meet the criteria.
Another option for the merge step:
merge(df[which(!duplicated(df$group))+(rle(df$group)$lengths-1),],
df[!duplicated(df$group), ],
by="group")

One more base R option:
with(df, unique(df$group[as.logical(ave(score, group, FUN = function(x) head(x,1) > tail(x, 1)))]))
#[1] 1 3 5
Or using dplyr:
library(dplyr)
group_by(df, group) %>% filter(first(score) > last(score)) %>% do(head(.,1)) %>%
select(group)
# group
#1 1
#2 3
#3 5

I'm plyr package fun..
library(plyr)
df1<-ddply(df,.(group),summarise,shown=score[length(group)]<score[1])
subset(df1,shown)
group shown
1 TRUE
3 TRUE
5 TRUE

Related

Compare Lists in datatable

I have a data table(data) which looks like the following.
rn peoplecount
1 0,2,0,1
2 1,1,0,0
3 0,1,0,5
4 5,3,0,2
5 2,2,0,1
6 1,2,0,3
7 0,1,0,0
8 0,2,0,8
9 8,2,0,0
10 0,1,0,0
My goal is to find out all records which have the 1st element of the present row not matching with 4th element of previous row. In this example, 7th row matches the criteria. How can I get a list of all such records.
My attempt so far.
data[, previous_peoplecount:= c(NA, peoplecount[shift(seq_along(peoplecount), fill = 0)])]
This gives a new table as follows:
rn peoplecount previous_peoplecount
1 0,2,0,1 NA
2 1,1,0,0 0,2,0,1
3 0,1,0,5 1,1,0,0
4 5,3,0,2 0,1,0,5
5 0,2,0,1 5,3,0,2
6 1,2,0,3 0,2,0,1
7 0,1,0,0 1,2,0,3
8 0,2,0,8 0,1,0,0
9 8,2,0,0 0,2,0,8
10 0,1,0,0 8,2,0,0
Now I have to fetch all records where 1st element of people_count is not equal to 4th element of previous_peoplecount. I am stuck at this part. Any suggestions?
Edit: poeplecount is list of numerics.
You can try something along the lines of removing all but first value and all but last value, and comparing, i.e.
library(data.table)
setDT(dt)[, first_pos := sub(',.*', '', peoplecount)][,
last_pos_shifted := sub('.*,', '', shift(peoplecount))][
first_pos != last_pos_shifted,]
which gives,
rn peoplecount first_pos last_pos_shifted
1: 7 0,1,0,0 0 3
I would convert to long format and then select interested elements:
dt <- data.table(rn = 1:3, x = lapply(1:3, function(x) x:(x+3)))
dt$x[[2]] <- c(4, 1, 1, 1)
dt
# rn x
# 1: 1 1,2,3,4
# 2: 2 4,1,1,1
# 3: 3 3,4,5,6
# convert to long format
dt2 <- dt[, .(rn = rep(rn, each = 4), x = unlist(x))]
dt2[, id:= 1:4]
dtSelected <- dt2[x == shift(x) & id == 4]
dtSelected
# rn x id
# 1: 2 1 4
dt[dtSelected$rn]
# rn x
# 1: 2 4,1,1,1
I was not satisfied with the answers and came up with my own solution as follows:
h<-sapply(data$peoplecount,function(x){x[1]})
t<-sapply(data$peoplecount,function(x){x[4]})
indices<-which(head(t,-1)!=tail(h,-1))
Thanks to #Sotos and #minem to push me in the correct direction.

calculate median from data.table columns in R

I am trying to calculate a median value across a number of columns, however my data is a bit funky. It looks like the following example.
library(data.table)
dt <- data.table("ID" = c(1,2,3,4),"none" = c(0,5,5,3),
"ten" = c(3,2,5,4),"twenty" = c(0,2,3,1))
ID none ten twenty
1: 1 0 3 0
2: 2 5 2 2
3: 3 5 5 3
4: 4 3 4 1
In the table to column represents the number of occurrences of that value. I am wanting to calculate the median occurrence.
For example for ID = 1
median(c(10, 10, 10))
is the calculation I am wanting to create.
for ID = 2
median(c(0, 0, 0, 0, 0, 10, 10, 20, 20))
I have tried using rep() and lapply() with very limited success and am after some clear guidance on how this might be achieved. I understand for the likes of rep() I would be having to hard code my value to be repeated (e.g. rep(0,2) or rep(10,2)) and this is what I expect. I am just struggling to create a list or vector with the repetitions from each column.
Here's another data.table way (assuming unique ID):
dt[, median(rep(c(0, 10, 20), c(none, ten, twenty))), by=ID]
# ID V1
# 1: 1 10
# 2: 2 0
# 3: 3 10
# 4: 4 10
This is just an attempt to get #eddi's answer without reshaping (which I tend to use as a last resort).
You need a dictionary to translate column names to corresponding numbers, and then it's fairly straightforward:
dict = data.table(name = c('none', 'ten', 'twenty'), number = c(0, 10, 20))
melt(dt, id.var = 'ID')[
dict, on = c(variable = 'name')][, median(rep(number, value)), by = ID]
# ID V1
#1: 1 10
#2: 2 0
#3: 3 10
#4: 4 10
Here's a way that avoids by-row operations and reshaping:
dt[, m := {
cSD = Reduce(`+`, .SD, accumulate=TRUE)
k = floor(cSD[[length(.SD)]]/2)
m = integer(.N)
for(i in seq_along(cSD)) {
left = m == 0L
if(!any(left)) break
m[left] = i * (cSD[[i]][left] >= k[left])
}
names(.SD)[m]
}, .SDcols=none:twenty]
which gives
ID none ten twenty m
1: 1 0 3 0 ten
2: 2 5 2 2 none
3: 3 5 5 3 ten
4: 4 3 4 1 ten
For the loop, I'm borrowing #alexis_laz' style, e.g. https://stackoverflow.com/a/30513197/
I've skipped translation of the column names, but that's pretty straightforward. You could use c(0,10,20) instead of names(.SD) at the end.
Here is a rowwise dplyr way:
dt %>% rowwise %>%
do(med = median(c(rep(0, .$none), rep(10, .$ten), rep(20, .$twenty)))) %>%
as.data.frame
med
1 10
2 0
3 10
4 10
Inspired by #Arun's answer, this is also working:
dt %>% group_by(ID) %>%
summarise(med = median(rep(c(0, 10, 20), c(none, ten, twenty))))
Source: local data table [4 x 2]
ID med
(dbl) (dbl)
1 1 10
2 2 0
3 3 10
4 4 10

Sum of hybrid data frames depending on multiple conditions in R

This is a more complex follow-up to my previous question. The answer there was to use a matrix, but that doesn't work with data frames having values of different modes.
I want to combine data frames of different sizes, with character and integer columns, and calculate their sum depending on multiple conditions.
Conditions
sums are only calculated for those rows that have a matching "Name"-value
sums are calculated for matching column names only
if a cell in df4 is not 0 and not NA, the sum should be df3 + df4
else the sum should be df1 + df2 + df3
Example
> df1 <- data.frame(Name=c("Joe","Ann","Lee","Dan"), "1"=c(0,1,5,2), "2"=c(3,1,0,0), "3"=c(2,0,2,2), "4"=c(2,1,3,4))
> df1
Name X1 X2 X3 X4
1 Joe 0 3 2 2
2 Ann 1 1 0 1
3 Lee 5 0 2 3
4 Dan 2 0 2 4
> df2 <- data.frame(Name=c("Joe","Ann","Ken"), "1"=c(3,4,1), "2"=c(2,3,0), "3"=c(2,4,3))
> df2
Name X1 X2 X3
1 Joe 3 2 2
2 Ann 4 3 4
3 Ken 1 0 3
> df3 <- data.frame(Name=c("Lee","Ben"), "1"=c(1,3), "2"=c(3,4), "3"=c(4,3))
> df3
Name X1 X2 X3
1 Lee 1 3 4
2 Ben 3 4 3
The condition depends on this frame:
> df4 <- data.frame(Name=c("Lee","Ann","Dan"), "1"=c(6,0,NA), "2"=c(0,0,4), "3"=c(0,NA,0))
> df4
Name X1 X2 X3
1 Lee 6 0 0
2 Ann 0 0 NA
3 Dan NA 4 0
With the above examples, this is the expected result (* values depend on df4):
> dfsum
Name X1 X2 X3 X4
1 Joe 3 5 4 2
2 Ann 5 4 4 1
3 Lee 7* 3 6 3
4 Dan 2 4* 2 4
5 Ken 1 0 3 NA
6 Ben 3 4 3 NA
Possible steps?
First expand df1, df2, df3, df4 to 5 columns and 6 rows, fill missing data with NA.
Then for each data frame:
sort rows by "Name"
separate "Name" column from "X1"..."X4"
transform "X1"..."X4" columns to matrix
calculate sums of the matrices like in the answer to my other question but with the additional condition 1
transform result matrix to data frame
cbind the "Name" column with the result data frame
How can this be done in R?
Solution
#Ricardo Saporta's solution works with little changes:
Add , padValue=NA) in the four addCols().
As answered here, replace the definitions of sumD3D4 and dtsum with:
plus <- function(x) {
if(all(is.na(x))){
c(x[0],NA)} else {
sum(x,na.rm = TRUE)}
}
sumD3D4 <- setkey(rbind(dt3, dt4)[,lapply(.SD, plus), by = Name], "Name")
dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, plus), by=Name], "Name")
If you use data.table instead of data.frame, you could use its by=xxxx feature, to add by name.
The code below should give you your expected results.
Please note that I am padding the data.tables with extra empty columns. However, we compute condTrue prior to then.
library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
# make sure all dt's have the same columns
#-----------------------------------------#
# identify which dt4 satisfy the condition
condTrue <- as.data.table(which(!(is.na(dt4) | dt4==0), arr.ind=TRUE))
# ignore column "Name" from dt4
condTrue <- condTrue[col>1]
# convert from (row, col) index to ("Name", columnName)
condTrue <- data.table(Name=dt4[condTrue$row, Name], colm=names(dt4)[condTrue$col], key="Name")
# First make a list of all the unique column names
allColumnNames <- unique(c(names(dt1), names(dt2), names(dt3), names(dt4)))
# add columns as necessary, using addCols (definted below)
addCols(dt1, allColumnNames)
addCols(dt2, allColumnNames)
addCols(dt3, allColumnNames)
addCols(dt4, allColumnNames)
sumD3D4 <- setkey(rbind(dt3, dt4)[, lapply(.SD, sum), by=Name], "Name")
dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, sum), by=Name], "Name")
for (Nam in condTrue$Name) {
colsRepl <- condTrue[.(Nam)]$colm
valsRepl <- unlist(sumD3D4[.(Nam), c(colsRepl), with=FALSE])
dtsum[.(Nam), c(colsRepl) := as.list(valsRepl)]
}
dtsum
# Name 1 2 3 4
# 1: Ann 5 4 4 1
# 2: Ben 3 4 3 0
# 3: Dan 2 4 2 4
# 4: Joe 3 5 4 2
# 5: Ken 1 0 3 0
# 6: Lee 7 3 6 3
addCols <- function(x, cols, padValue=0) {
# adds to x any columns that are in cols but not in x
# Returns TRUE if columns were added
# FALSE if no columns added
colsMissing <- setdiff(cols, names(x))
# grab the actual DT name that was passed to function
dtName <- as.character(match.call()[2])
if (length(colsMissing)) {
get(dtName, envir=parent.frame(1))[, c(colsMissing) := padValue]
return(TRUE)
}
return(FALSE)
}

Reduce dataset based on value

I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot
It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6
Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8
Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min
might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6
This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]

subset rows with (1) ALL and (2) ANY columns larger than a specific value

I have a data frame with an id column and some (potentially many) columns with values, here 'v1', 'v2':
df <- data.frame(id = c(1:5), v1 = c(0,15,9,12,7), v2 = c(9,32,6,17,11))
# id v1 v2
# 1 1 0 9
# 2 2 15 32
# 3 3 9 6
# 4 4 12 17
# 5 5 7 11
How can I extract rows where ALL values are larger than a certain value, say 10, which should return:
# id v1 v2
# 2 2 15 32
# 4 4 12 17
How can I extract rows with ANY (at least one) value is larger than 10:
# id v1 v2
# 2 2 15 32
# 4 4 12 17
# 5 5 7 11
See functions all() and any() for the first and second parts of your questions respectively. The apply() function can be used to run functions over rows or columns. (MARGIN = 1 is rows, MARGIN = 2 is columns, etc). Note I use apply() on df[, -1] to ignore the id variable when doing the comparisons.
Part 1:
> df <- data.frame(id=c(1:5), v1=c(0,15,9,12,7), v2=c(9,32,6,17,11))
> df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
Part 2:
> df[apply(df[, -1], MARGIN = 1, function(x) any(x > 10)), ]
id v1 v2
2 2 15 32
4 4 12 17
5 5 7 11
To see what is going on, x > 10 returns a logical vector for each row (via apply() indicating whether each element is greater than 10. all() returns TRUE if all element of the input vector are TRUE and FALSE otherwise. any() returns TRUE if any of the elements in the input is TRUE and FALSE if all are FALSE.
I then use the logical vector resulting from the apply() call
> apply(df[, -1], MARGIN = 1, function(x) all(x > 10))
[1] FALSE TRUE FALSE TRUE FALSE
> apply(df[, -1], MARGIN = 1, function(x) any(x > 10))
[1] FALSE TRUE FALSE TRUE TRUE
to subset df (as shown above).
This can be done using apply with margin 1, which will apply a function to each row. The function to check a given row would be
function(row) {all(row > 10)}
So the way to extract the rows themselves is
df[apply(df, 1, function(row) {all(row > 10)}),]
One option is looping row-by-row (e.g. with apply) and using any or all, as proposed in the other two answers. However, this can be inefficient for large data frames.
A vectorized approach would be to use rowSums to determine the number of values in each row matching your criterion, and filter based on that.
(1) When filtering to rows where ALL values are at least 10, this is the same as filtering to cases where the number of values in a row less than or equal to 10 is 0:
df[rowSums(df[,-1] <= 10) == 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
(2) Similarly, rowSums can easily be used to compute the rows with ANY (at least one) value is larger than 10:
df[rowSums(df[,-1] > 10) > 0,]
# id v1 v2
# 2 2 15 32
# 4 4 12 17
# 5 5 7 11
The speedup is clear with a larger input:
set.seed(144)
df <- matrix(sample(c(1, 10, 20), 3e6, replace=TRUE), ncol=3)
system.time(df[apply(df[, -1], MARGIN = 1, function(x) all(x > 10)), ])
# user system elapsed
# 1.754 0.156 2.102
system.time(df[rowSums(df[,-1] <= 10) == 0,])
# user system elapsed
# 0.04 0.01 0.05
The dplyr equivalent is as follows
library(dplyr)
#ANY
df %>% rowwise() %>%
filter(any(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 3 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17
3 5 7 11
#ALL
df %>% rowwise() %>%
filter(all(across(starts_with("v"), ~ sum((. > 10)))))
# A tibble: 2 x 3
# Rowwise:
id v1 v2
<int> <dbl> <dbl>
1 2 15 32
2 4 12 17

Resources