Split rows but maintain labels - r

I would like to literally split some of the values in a dataframe, but would like to maintain some of the labels while allowing for one new label for the new splits. For example:
day year depth mass
1 2008 10 13
2 2008 10 15
1 2008 20 14
2 2008 20 12
1 2009 10 14
2 2009 10 16
1 2009 20 12
2 2009 20 18
Now divide each mass by 2 to get:
day year depth mass
1 2008 10a 6.5
1 2008 10b 6.5
2 2008 10a 7.5
2 2008 10b 7.5
1 2008 20a 7
1 2008 20b 7
2 2008 20a 6
2 2008 20b 6
1 2009 10a 7
1 2009 10b 7
2 2009 10a 8
2 2009 10b 8
1 2009 20a 6
1 2009 20b 6
2 2009 20a 9
2 2009 20b 9
There are new values, but they have the corresponding day and year data.
To make things more complicated, I will be running a slightly different function on each depth. For example, I will divide depth == 10 by 2, but depth == 20 by three. But I can probably figure that out if the basic question here can be answered.

A somewhat long data.table line, but I think this will achieve what you need:
library(data.table)
df$id <- rownames(df)
df1 <- setDT(df)[rep(1:nrow(df),times = 2),.SD,by=id][,`:=`(mass=mass/2,depth=paste(depth,c("a","b"),sep=""))]
Output:
df1
id day year depth mass
1 1 2008 10a 6.5
1 1 2008 10b 6.5
2 2 2008 10a 7.5
2 2 2008 10b 7.5
3 1 2008 20a 7.0
3 1 2008 20b 7.0
4 2 2008 20a 6.0
4 2 2008 20b 6.0
5 1 2009 10a 7.0
5 1 2009 10b 7.0
6 2 2009 10a 8.0
6 2 2009 10b 8.0
7 1 2009 20a 6.0
7 1 2009 20b 6.0
8 2 2009 20a 9.0
8 2 2009 20b 9.0

Using dplyr, you can do this way:
library(dplyr)
df %>% group_by(day, year, depth) %>% bind_rows(., .) %>% mutate(mass = mass/2) %>% arrange(day, year, depth, mass)
Note, I have not done the a/b appending to the depth, but I think you can probably do it based on this same idea.
Output is as follows:
Source: local data frame [16 x 4]
day year depth mass
(dbl) (dbl) (dbl) (dbl)
1 1 2008 10 6.5
2 1 2008 10 6.5
3 1 2008 20 7.0
4 1 2008 20 7.0
5 1 2009 10 7.0
6 1 2009 10 7.0
7 1 2009 20 6.0
8 1 2009 20 6.0
9 2 2008 10 7.5
10 2 2008 10 7.5
11 2 2008 20 6.0
12 2 2008 20 6.0
13 2 2009 10 8.0
14 2 2009 10 8.0
15 2 2009 20 9.0
16 2 2009 20 9.0

Related

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

Repeat measures: how to use initial measurements to estimate subsequent measurement based off time differences

I have a dataframe with repeat recordings of individuals in the year they were found.
>long<-data.frame(identity,year,age)
> long
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
Age was estimated based off the year they were captured
yr.est<-data.frame(yr,est.age)
> yr.est
yr est.age
1 2000 10.0
2 2001 7.5
3 2002 11.0
4 2003 9.0
5 2004 11.0
When an individual is seen after the first time how do I give them an estimated age of the initial estimated age + difference between years (e.g. individual v was estimated to be 7.5 in 2001 and their age in 2004 should be 10.5 not 11)
My actual dataset is 15000 long so I am unable to do it manually
TIA
Edit.
Expected output posted as a comment by the OP.
long
identity year age
1 z 2000 10.0
2 z 2001 11.0
3 z 2001 11.0
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 10.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 8.5
10 v 2004 10.5
This code computes est.age by adding to the first age the difference between the current year and the first year, by group of identity.
library(tidyverse)
long %>%
group_by(identity) %>%
mutate(est.age = first(age) + (year - first(year))) %>%
select(identity, year, est.age)
## A tibble: 10 x 3
## Groups: identity [5]
# identity year est.age
# <fct> <int> <dbl>
# 1 z 2000 10
# 2 z 2001 11
# 3 z 2001 11
# 4 y 2000 10
# 5 x 2003 9
# 6 x 2004 10
# 7 w 2003 9
# 8 v 2001 7.5
# 9 v 2002 8.5
#10 v 2004 10.5
Data.
long <- read.table(text = "
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
", header = TRUE)

How to create matrix for heatmap by group in R?

I have a example data as follows, I want to create a heatmap where the average sleep duration hour (SLP) is shown according to 3 recruiting site(site) and 5 recruiting year(year).
SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
In the heatmap I want to make, x axis and y axis are year and site, respectivly, and each cell include mean duration of sleep duration.
I don't know how to make the matrix from data frame for making heatmap.
How do I do this?
There are many options, and it also depends on how you want to create the heatmap. If you want to use ggplot2 then you do not need to modify the data.frame. For example, this should work:
txt <- "SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1"
d <- read.table(text = txt, header = TRUE)
d$year <- factor(d$year) # make year a factor.
ggplot(d, aes(x = site, y = year, fill = SLP)) + geom_tile()

Increase efficiency of dplyr summarising

I am trying to sort and make a new table from a large data set (>60k; NDw) with a sample here
Season ENo HNo Month Day Year Group
638447 2011 A903851 1881023 10 6 2011 Ducks
589219 2010 C409324 3648019 10 8 2010 Ducks
137451 2006 M576033 2506116 10 13 2006 Ducks
883040 2013 P886755 43313010 10 17 2013 Ducks
851378 2013 C700399 36413199 11 5 2013 Geese
552791 2010 M902312 2508141 11 16 2010 Ducks
152368 2006 M599973 2496101 10 3 2006 Ducks
395393 2008 C412049 3646096 10 28 2008 Ducks
857709 2013 C671619 36413012 9 15 2013 Ducks
67354 2005 C349762 3643011 10 22 2005 Geese
67126 2005 C427496 3643037 11 25 2005 Geese
62260 2005 C349776 3643023 10 7 2005 Ducks
847364 2013 C570491 36411001 10 5 2013 Ducks
447414 2009 A686943 1808206 11 3 2009 Geese
474743 2009 M813353 2509214 10 24 2009 Ducks
439477 2009 A746048 1639142 10 26 2009 Ducks
781218 2012 P792862 4142177 11 27 2012 Geese
806946 2013 M052893 20712036 11 5 2013 Ducks
174932 2006 C450351 3645098 12 5 2006 Geese
828816 2013 M054683 25012010 9 30 2013 Ducks
I want to group by Season and HNo and get a number of new variables. These include how many groups each Season/HNo is in, a count of rows total, in each group, and each group during each month. The result would look like this, but with all months.
Season HNo groupN total.envelopes ducks geese Octducks
1 2005 1253041 1 2 2 0 2
2 2005 1254026 1 5 5 0 5
3 2005 1254063 2 26 23 3 0
4 2005 1254115 2 14 10 4 10
5 2005 1274023 2 39 28 11 28
I have code that works but it runs slow and I feel like there should be a better way to code this block. Maybe I'm wrong, and it's not a large issue, just wanted to learn how to make my code more efficient. Here is what I use to get the above output.
NDw1 = NDw %>%
group_by(Season,HNo) %>%
summarise(groupN = n_distinct(Group),
total.envelopes=n(),
ducks = length(ENo[Group %in% 'Ducks']),
geese = length(ENo[Group %in% 'Geese']),
Octducks = length(ENo[Group=='Ducks' & Month == 10]))
The entire code has lines for Aug-Jan ducks and geese. I tried to use count rather than length but it didn't work with a factor variable as is ENo. Any thoughts would be appreciated. Thanks for your time and help.

How to remove subjects who have missing measurements in time series data?

I have data like the following:
ID Year Measurement
1 2009 5.6
1 2010 6.2
1 2011 4.5
2 2008 6.4
2 2009 5.2
3 2008 3.5
3 2010 5.6
4 2009 5.9
4 2010 2.2
4 2011 4.1
4 2012 5.5
Where subjects are measured over a few years with different starting and ending years. Subjects are also measured a different number of times. I want to remove subjects that are not measured every single year between their start and end measurement years. So, in the above data I would want subject 3 removed since they missed a measurement in 2009.
I thought about doing a for loop where I get the maximum and minimum value of the variable Year for each unique ID. I then take the difference between the maximum and minimum for each player and add 1. I then count the number of times each unique ID appears in the data and check to see if they are equal. This ought to work but I feel like there has a got to be a quick and more efficient way to do this.
This will be easiest with the data.table package:
dt = data.table(df, key="Year")
dt[,Remove:=any(diff(Year) > 1),by=ID]
dt = dt[(!Remove)]
dt$Remove = NULL
ID Year Measurement
1: 1 2009 5.6
2: 1 2010 6.2
3: 1 2011 4.5
4: 2 2008 6.4
5: 2 2009 5.2
6: 4 2009 5.9
7: 4 2010 2.2
8: 4 2011 4.1
9: 4 2012 5.5
Here's an alternative
> ind <- aggregate(Year~ID, FUN=function(x) x[2]-x[1], data=df)$Year>1
> df[!df$ID==unique(df$ID)[ind], ]
ID Year Measurement
1 1 2009 5.6
2 1 2010 6.2
3 1 2011 4.5
4 2 2008 6.4
5 2 2009 5.2
8 4 2009 5.9
9 4 2010 2.2
10 4 2011 4.1
11 4 2012 5.5
You may try ave. My anonymous function is basically the pseudo code suggested in the question.
df[as.logical(ave(df$Year, df$ID, FUN = function(x) length(x) > max(x) - min(x))), ]
# ID Year Measurement
# 1 1 2009 5.6
# 2 1 2010 6.2
# 3 1 2011 4.5
# 4 2 2008 6.4
# 5 2 2009 5.2
# 8 4 2009 5.9
# 9 4 2010 2.2
# 10 4 2011 4.1
# 11 4 2012 5.5

Resources