Compare Lists in datatable

Compare Lists in datatable - r

I have a data table(data) which looks like the following.
rn peoplecount
1 0,2,0,1
2 1,1,0,0
3 0,1,0,5
4 5,3,0,2
5 2,2,0,1
6 1,2,0,3
7 0,1,0,0
8 0,2,0,8
9 8,2,0,0
10 0,1,0,0
My goal is to find out all records which have the 1st element of the present row not matching with 4th element of previous row. In this example, 7th row matches the criteria. How can I get a list of all such records.
My attempt so far.
data[, previous_peoplecount:= c(NA, peoplecount[shift(seq_along(peoplecount), fill = 0)])]
This gives a new table as follows:
rn peoplecount previous_peoplecount
1 0,2,0,1 NA
2 1,1,0,0 0,2,0,1
3 0,1,0,5 1,1,0,0
4 5,3,0,2 0,1,0,5
5 0,2,0,1 5,3,0,2
6 1,2,0,3 0,2,0,1
7 0,1,0,0 1,2,0,3
8 0,2,0,8 0,1,0,0
9 8,2,0,0 0,2,0,8
10 0,1,0,0 8,2,0,0
Now I have to fetch all records where 1st element of people_count is not equal to 4th element of previous_peoplecount. I am stuck at this part. Any suggestions?
Edit: poeplecount is list of numerics.

You can try something along the lines of removing all but first value and all but last value, and comparing, i.e.
library(data.table)
setDT(dt)[, first_pos := sub(',.*', '', peoplecount)][,
last_pos_shifted := sub('.*,', '', shift(peoplecount))][
first_pos != last_pos_shifted,]
which gives,
rn peoplecount first_pos last_pos_shifted
1: 7 0,1,0,0 0 3

I would convert to long format and then select interested elements:
dt <- data.table(rn = 1:3, x = lapply(1:3, function(x) x:(x+3)))
dt$x[[2]] <- c(4, 1, 1, 1)
dt
# rn x
# 1: 1 1,2,3,4
# 2: 2 4,1,1,1
# 3: 3 3,4,5,6
# convert to long format
dt2 <- dt[, .(rn = rep(rn, each = 4), x = unlist(x))]
dt2[, id:= 1:4]
dtSelected <- dt2[x == shift(x) & id == 4]
dtSelected
# rn x id
# 1: 2 1 4
dt[dtSelected$rn]
# rn x
# 1: 2 4,1,1,1

I was not satisfied with the answers and came up with my own solution as follows:
h<-sapply(data$peoplecount,function(x){x[1]})
t<-sapply(data$peoplecount,function(x){x[4]})
indices<-which(head(t,-1)!=tail(h,-1))
Thanks to #Sotos and #minem to push me in the correct direction.

Related

Last observation of the previous group

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!

We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]

Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

Subset in i by variable name in data.table [duplicate]

This question already has an answer here:
Pass column name in data.table using variable [duplicate]
(1 answer)
Closed 6 years ago.
Suppose I have a data.table with columns names that are specified in a variable. For example I might have used dcast as:
groups <- sample(LETTERS, 2) # i.e. I don't now the values
dt1 <- data.table(ID = rep(1:2, each = 2), group = groups, value = 3:6)
(dt2 <- dcast(dt1, ID~group, value.var = "value"))
# ID D Q
# 1: 1 3 4
# 2: 2 5 6
Now I want to subset based on values in the last two columns, e.g. do something like:
dt2[groups[1] == 3 & groups[2] == 4]
# Empty data.table (0 rows) of 3 cols: ID,D,Q
Is there an easy way?
I found I can do this with keys:
setkeyv(dt2, groups)
dt2[.(3, 4)]
# ID D Q
# 1: 1 3 4
But how do I do something more elaborate, as
dt2[groups[1] > 3 & groups[2] < 7]
?

You can use get to (from ?get)
search by name for an object
:
dt2[get(groups[1]) > 2 & get(groups[2]) == 4]
# ID A J
#1: 1 3 4

We can use eval with as.name and it should be faster than get
dt2[eval(as.name(groups[1])) > 2 & eval(as.name(groups[2])) == 4]
# ID L U
#1: 1 4 3

compare the first and last observation in each group

I have a dataset like this:
df <- data.frame(group = c(rep(1,3),rep(2,2), rep(3,3),rep(4,3),rep(5, 2)), score = c(30, 10, 22, 44, 50, 5, 20, 1,35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 2 44
5 2 50
6 3 5
7 3 20
8 3 1
9 4 35
10 4 2
11 4 60
12 5 14
13 5 5
I wish to compare the first score and last score in each group, if the last score is smaller than the first score, then output the group number. The expected output should be like:
group 1 3 5
does anyone have idea how to realized this?

Here's data.table approach
library(data.table)
setDT(df)[, score[1] > score[.N], by = group][V1 == TRUE]
## group V1
## 1: 1 TRUE
## 2: 3 TRUE
## 3: 5 TRUE
Or
setDT(df)[, group[score[1] > score[.N]], by = group]
## group V1
## 1: 1 1
## 2: 3 3
## 3: 5 5
Or
setDT(df)[, .BY[score[1] > score[.N]], by = group]
As per #beginneR's comment, if you don't like V1 you could do
df2 <- as.data.table(df)[, .BY[score[1] > score[.N]], by = group][, V1 := NULL]
df2
## group
## 1: 1
## 2: 3
## 3: 5

This should do the job:
# First split the data frame by group
# This returns a list
df.split <- split(df, factor(df$group))
# Now use sapply on the list to check first and last of each group
# We return the group or NA using ifelse
res <- sapply(df.split,
function(x){ifelse(x$score[1] > x$score[nrow(x)], x$group[1], NA)})
# Finally, filter away the NAs
res <- res[!is.na(res)]

This answer assumes that every group has at least 2 observations:
newdf <- merge(rbind(df[diff(df$group) == 1 ,] , df[dim(df)[1], ]),
df[!duplicated(df$group), ],
by="group")
newdf[which(newdf$score.x < newdf$score.y), 'group']
#[1] 1 3 5
df[diff(df$group) == 1 ,] identifies the last observation of each group, except for the last group, which is why I rbind the last entry (i.e. df[dim(df)[1], ]). Then, the first observation of each group is given by df[!duplicated(df$group), ]. We merge these on the group column, then identify which ones meet the criteria.
Another option for the merge step:
merge(df[which(!duplicated(df$group))+(rle(df$group)$lengths-1),],
df[!duplicated(df$group), ],
by="group")

One more base R option:
with(df, unique(df$group[as.logical(ave(score, group, FUN = function(x) head(x,1) > tail(x, 1)))]))
#[1] 1 3 5
Or using dplyr:
library(dplyr)
group_by(df, group) %>% filter(first(score) > last(score)) %>% do(head(.,1)) %>%
select(group)
# group
#1 1
#2 3
#3 5

I'm plyr package fun..
library(plyr)
df1<-ddply(df,.(group),summarise,shown=score[length(group)]<score[1])
subset(df1,shown)
group shown
1 TRUE
3 TRUE
5 TRUE

How to Enter data for only conditioned rows on data table

I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?

I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]

The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]

Reduce dataset based on value

I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot

It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6

Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8

Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min

might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6

This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compare Lists in datatable - r

I was not satisfied with the answers and came up with my own solution as follows: h<-sapply(data$peoplecount,function(x){x[1]}) t<-sapply(data$peoplecount,function(x){x[4]}) indices<-which(head(t,-1)!=tail(h,-1)) Thanks to #Sotos and #minem to push me in the correct direction.

Related

Last observation of the previous group

Subset in i by variable name in data.table [duplicate]

compare the first and last observation in each group

How to Enter data for only conditioned rows on data table

Reduce dataset based on value

Categories

Resources