How to remove subjects who have missing measurements in time series data? - r

I have data like the following:
ID Year Measurement
1 2009 5.6
1 2010 6.2
1 2011 4.5
2 2008 6.4
2 2009 5.2
3 2008 3.5
3 2010 5.6
4 2009 5.9
4 2010 2.2
4 2011 4.1
4 2012 5.5
Where subjects are measured over a few years with different starting and ending years. Subjects are also measured a different number of times. I want to remove subjects that are not measured every single year between their start and end measurement years. So, in the above data I would want subject 3 removed since they missed a measurement in 2009.
I thought about doing a for loop where I get the maximum and minimum value of the variable Year for each unique ID. I then take the difference between the maximum and minimum for each player and add 1. I then count the number of times each unique ID appears in the data and check to see if they are equal. This ought to work but I feel like there has a got to be a quick and more efficient way to do this.

This will be easiest with the data.table package:
dt = data.table(df, key="Year")
dt[,Remove:=any(diff(Year) > 1),by=ID]
dt = dt[(!Remove)]
dt$Remove = NULL
ID Year Measurement
1: 1 2009 5.6
2: 1 2010 6.2
3: 1 2011 4.5
4: 2 2008 6.4
5: 2 2009 5.2
6: 4 2009 5.9
7: 4 2010 2.2
8: 4 2011 4.1
9: 4 2012 5.5

Here's an alternative
> ind <- aggregate(Year~ID, FUN=function(x) x[2]-x[1], data=df)$Year>1
> df[!df$ID==unique(df$ID)[ind], ]
ID Year Measurement
1 1 2009 5.6
2 1 2010 6.2
3 1 2011 4.5
4 2 2008 6.4
5 2 2009 5.2
8 4 2009 5.9
9 4 2010 2.2
10 4 2011 4.1
11 4 2012 5.5

You may try ave. My anonymous function is basically the pseudo code suggested in the question.
df[as.logical(ave(df$Year, df$ID, FUN = function(x) length(x) > max(x) - min(x))), ]
# ID Year Measurement
# 1 1 2009 5.6
# 2 1 2010 6.2
# 3 1 2011 4.5
# 4 2 2008 6.4
# 5 2 2009 5.2
# 8 4 2009 5.9
# 9 4 2010 2.2
# 10 4 2011 4.1
# 11 4 2012 5.5

Related

Repeat measures: how to use initial measurements to estimate subsequent measurement based off time differences

I have a dataframe with repeat recordings of individuals in the year they were found.
>long<-data.frame(identity,year,age)
> long
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
Age was estimated based off the year they were captured
yr.est<-data.frame(yr,est.age)
> yr.est
yr est.age
1 2000 10.0
2 2001 7.5
3 2002 11.0
4 2003 9.0
5 2004 11.0
When an individual is seen after the first time how do I give them an estimated age of the initial estimated age + difference between years (e.g. individual v was estimated to be 7.5 in 2001 and their age in 2004 should be 10.5 not 11)
My actual dataset is 15000 long so I am unable to do it manually
TIA
Edit.
Expected output posted as a comment by the OP.
long
identity year age
1 z 2000 10.0
2 z 2001 11.0
3 z 2001 11.0
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 10.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 8.5
10 v 2004 10.5
This code computes est.age by adding to the first age the difference between the current year and the first year, by group of identity.
library(tidyverse)
long %>%
group_by(identity) %>%
mutate(est.age = first(age) + (year - first(year))) %>%
select(identity, year, est.age)
## A tibble: 10 x 3
## Groups: identity [5]
# identity year est.age
# <fct> <int> <dbl>
# 1 z 2000 10
# 2 z 2001 11
# 3 z 2001 11
# 4 y 2000 10
# 5 x 2003 9
# 6 x 2004 10
# 7 w 2003 9
# 8 v 2001 7.5
# 9 v 2002 8.5
#10 v 2004 10.5
Data.
long <- read.table(text = "
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
", header = TRUE)

How to create matrix for heatmap by group in R?

I have a example data as follows, I want to create a heatmap where the average sleep duration hour (SLP) is shown according to 3 recruiting site(site) and 5 recruiting year(year).
SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
In the heatmap I want to make, x axis and y axis are year and site, respectivly, and each cell include mean duration of sleep duration.
I don't know how to make the matrix from data frame for making heatmap.
How do I do this?
There are many options, and it also depends on how you want to create the heatmap. If you want to use ggplot2 then you do not need to modify the data.frame. For example, this should work:
txt <- "SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1"
d <- read.table(text = txt, header = TRUE)
d$year <- factor(d$year) # make year a factor.
ggplot(d, aes(x = site, y = year, fill = SLP)) + geom_tile()

Split rows but maintain labels

I would like to literally split some of the values in a dataframe, but would like to maintain some of the labels while allowing for one new label for the new splits. For example:
day year depth mass
1 2008 10 13
2 2008 10 15
1 2008 20 14
2 2008 20 12
1 2009 10 14
2 2009 10 16
1 2009 20 12
2 2009 20 18
Now divide each mass by 2 to get:
day year depth mass
1 2008 10a 6.5
1 2008 10b 6.5
2 2008 10a 7.5
2 2008 10b 7.5
1 2008 20a 7
1 2008 20b 7
2 2008 20a 6
2 2008 20b 6
1 2009 10a 7
1 2009 10b 7
2 2009 10a 8
2 2009 10b 8
1 2009 20a 6
1 2009 20b 6
2 2009 20a 9
2 2009 20b 9
There are new values, but they have the corresponding day and year data.
To make things more complicated, I will be running a slightly different function on each depth. For example, I will divide depth == 10 by 2, but depth == 20 by three. But I can probably figure that out if the basic question here can be answered.
A somewhat long data.table line, but I think this will achieve what you need:
library(data.table)
df$id <- rownames(df)
df1 <- setDT(df)[rep(1:nrow(df),times = 2),.SD,by=id][,`:=`(mass=mass/2,depth=paste(depth,c("a","b"),sep=""))]
Output:
df1
id day year depth mass
1 1 2008 10a 6.5
1 1 2008 10b 6.5
2 2 2008 10a 7.5
2 2 2008 10b 7.5
3 1 2008 20a 7.0
3 1 2008 20b 7.0
4 2 2008 20a 6.0
4 2 2008 20b 6.0
5 1 2009 10a 7.0
5 1 2009 10b 7.0
6 2 2009 10a 8.0
6 2 2009 10b 8.0
7 1 2009 20a 6.0
7 1 2009 20b 6.0
8 2 2009 20a 9.0
8 2 2009 20b 9.0
Using dplyr, you can do this way:
library(dplyr)
df %>% group_by(day, year, depth) %>% bind_rows(., .) %>% mutate(mass = mass/2) %>% arrange(day, year, depth, mass)
Note, I have not done the a/b appending to the depth, but I think you can probably do it based on this same idea.
Output is as follows:
Source: local data frame [16 x 4]
day year depth mass
(dbl) (dbl) (dbl) (dbl)
1 1 2008 10 6.5
2 1 2008 10 6.5
3 1 2008 20 7.0
4 1 2008 20 7.0
5 1 2009 10 7.0
6 1 2009 10 7.0
7 1 2009 20 6.0
8 1 2009 20 6.0
9 2 2008 10 7.5
10 2 2008 10 7.5
11 2 2008 20 6.0
12 2 2008 20 6.0
13 2 2009 10 8.0
14 2 2009 10 8.0
15 2 2009 20 9.0
16 2 2009 20 9.0

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

Resources