I have a data.frame with exactly one value measured for each subject at multiple timepoints. It simplifies to this:
> set.seed(42)
> x = data.frame(subject=rep(c('a', 'b', 'c'), 3), time=rep(c(1,2,3), each=3), value=rnorm(3*3, 0, 1))
> x
subject time value
1 a 1 1.37095845
2 b 1 -0.56469817
3 c 1 0.36312841
4 a 2 0.63286260
5 b 2 0.40426832
6 c 2 -0.10612452
7 a 3 1.51152200
8 b 3 -0.09465904
9 c 3 2.01842371
I want to calculate the change in value for each timepoint and for each subject. For this simple example, my My current solution is this:
> x$diff[x$time==1] = x$value[x$time==2] - x$value[x$time==1]
> x$diff[x$time==2] = x$value[x$time==3] - x$value[x$time==2]
> x
subject time value diff
1 a 1 1.37095845 -0.7380958
2 b 1 -0.56469817 0.9689665
3 c 1 0.36312841 -0.4692529
4 a 2 0.63286260 0.8786594
5 b 2 0.40426832 -0.4989274
6 c 2 -0.10612452 2.1245482
7 a 3 1.51152200 NA
8 b 3 -0.09465904 NA
9 c 3 2.01842371 NA
... and then remove the last rows. However, in my actual data set, there's way more levels of time and I need to do this for several columns instead of just value. The code gets very ugly. Is there a neat way to do this? A solution which does not assume that rows are ordered within subjects according to time would be nice.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(x)), grouped by 'subject', we take the difference of the next value (shift(value, type='lead')) with the current value and assign (:=) the output to create the 'Diff' column.
library(data.table)#v1.9.6+
setDT(x)[order(time),Diff := shift(value, type= 'lead') - value ,
by = subject]
# subject time value Diff
#1: a 1 1.37095845 -0.7380958
#2: b 1 -0.56469817 0.9689665
#3: c 1 0.36312841 -0.4692529
#4: a 2 0.63286260 0.8786594
#5: b 2 0.40426832 -0.4989274
#6: c 2 -0.10612452 2.1245482
#7: a 3 1.51152200 NA
#8: b 3 -0.09465904 NA
#9: c 3 2.01842371 NA
You can use dplyr for this:
library(dplyr)
x %>%
arrange(time, subject) %>%
group_by(subject) %>%
mutate(diff = c(diff(value), NA))
# Source: local data frame [9 x 4]
# Groups: subject [3]
#
# subject time value diff
# (fctr) (dbl) (dbl) (dbl)
# 1 a 1 1.30970525 -1.66596287
# 2 b 1 0.12556761 -0.06070412
# 3 c 1 -1.09423634 1.38590546
# 4 a 2 -0.35625763 0.91417329
# 5 b 2 0.06486349 0.06652424
# 6 c 2 0.29166912 -0.98495562
# 7 a 3 0.55791566 NA
# 8 b 3 0.13138773 NA
# 9 c 3 -0.69328649 NA
If you want to get rid of the NAs, add %>% na.omit.
You could try ave. ave applies a function to a subset of a values, for more details see ?ave, e.g.:
x$diff <- ave(x$value, x$subject, FUN=function(x)c(diff(x), NA))
x
# subject time value diff
# 1 a 1 1.37095845 -0.7380958
# 2 b 1 -0.56469817 0.9689665
# 3 c 1 0.36312841 -0.4692529
# 4 a 2 0.63286260 0.8786594
# 5 b 2 0.40426832 -0.4989274
# 6 c 2 -0.10612452 2.1245482
# 7 a 3 1.51152200 NA
# 8 b 3 -0.09465904 NA
# 9 c 3 2.01842371 NA
BTW the diff function requires that the time is ordered.
EDIT: Update with set.seed(42).
Related
I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11
I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:
We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]
data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".
I will create a simple example of some dummy data:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
object <- c(1,1,2,1,1,1,1,2,3,1,1,1,2)
df1 <- data.frame(case, object)
Now for each unique case and object value, I want to create a corresponding unique numerical value (an identifier)
df1$UNIQ_ID <- ........
The end result should take the following values c(1,1,2,3,3,4,4,5,6,7,7,8,9) as when
unique(df1$object[df1$case=='a'])
unique(df1$object[df1$case=='b'])
I have though of using dpylr and group_by(case)
We can use the .GRP from data.table after grouping by 'case' and 'object' on a data.table object (setDT(df1)).
library(data.table)
setDT(df1)[,UNIQ_ID:= .GRP ,.(case, object)]
df1
# case object UNIQ_ID
# 1: a 1 1
# 2: a 1 1
# 3: a 2 2
# 4: b 1 3
# 5: b 1 3
# 6: c 1 4
# 7: c 1 4
# 8: c 2 5
# 9: c 3 6
#10: d 1 7
#11: d 1 7
#12: e 1 8
#13: e 2 9
A base R option would be
grp <- interaction(df1)
as.numeric(factor(grp, levels= unique(grp)))
#[1] 1 1 2 3 3 4 4 5 6 7 7 8 9
Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)
I want to calculate the ratio between consecutive values within groups. It is easy for differences using diff:
mdata <- data.frame(group = c("A","A","A","B","B","C","C"), x = c(2,3,5,6,3,7,6))
mdata$diff <- unlist(by(mdata$x, mdata$group, function(x){c(NA, diff(x))}))
mdata
group x diff
1 A 2 NA
2 A 3 1
3 A 5 2
4 B 6 NA
5 B 3 -3
6 C 7 NA
7 C 6 -1
Is there an equivalent function to calculate ratios? Desired output would be:
group x ratio
1 A 2 NA
2 A 3 1.5000000
3 A 5 1.6666667
4 B 6 NA
5 B 3 0.5000000
6 C 7 NA
7 C 6 0.8571429
Try dplyr:
install.packages(dplyr)
require(dplyr)
mdata <- data.frame(group = c("A","A","A","B","B","C","C"), x = c(2,3,5,6,3,7,6))
mdata <- group_by(mdata, group)
mutate(mdata, ratio = x / lag(x))
# Source: local data frame [7 x 3]
# Groups: group
# group x ratio
# 1 A 2 NA
# 2 A 3 1.5000000
# 3 A 5 1.6666667
# 4 B 6 NA
# 5 B 3 0.5000000
# 6 C 7 NA
# 7 C 6 0.8571429
Your diff would simplify to:
mutate(mdata, diff = x - lag(x))
# Source: local data frame [7 x 3]
# Groups: group
# group x diff
# 1 A 2 NA
# 2 A 3 1
# 3 A 5 2
# 4 B 6 NA
# 5 B 3 -3
# 6 C 7 NA
# 7 C 6 -1
Same idea, using data.table:
library(data.table)
dt = as.data.table(mdata)
dt[, ratio := x / lag(x), by = group]
dt
# group x ratio
#1: A 2 NA
#2: A 3 1.5000000
#3: A 5 1.6666667
#4: B 6 NA
#5: B 3 0.5000000
#6: C 7 NA
#7: C 6 0.8571429
Another option with ave:
transform(mdata,
ratio=ave(x, group, FUN=function(y) c(NA, tail(y, -1) / head(y, -1))))
Using by:
do.call(rbind, by(mdata, mdata$group, function(dat) {
dat$ratio <- dat$x / c(NA, head(dat$x, -1))
dat
}))
# group x ratio
# A.1 A 2 NA
# A.2 A 3 1.5000000
# A.3 A 5 1.6666667
# B.4 B 6 NA
# B.5 B 3 0.5000000
# C.6 C 7 NA
# C.7 C 6 0.8571429