How to create matrix for heatmap by group in R? - r

I have a example data as follows, I want to create a heatmap where the average sleep duration hour (SLP) is shown according to 3 recruiting site(site) and 5 recruiting year(year).
SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
In the heatmap I want to make, x axis and y axis are year and site, respectivly, and each cell include mean duration of sleep duration.
I don't know how to make the matrix from data frame for making heatmap.
How do I do this?

There are many options, and it also depends on how you want to create the heatmap. If you want to use ggplot2 then you do not need to modify the data.frame. For example, this should work:
txt <- "SLP site year sex
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1
8.6 1 2008 1
7.2 1 2005 2
6.4 2 2006 2
9.5 3 2007 2
6.1 2 2009 2
5.1 3 2008 1
2.1 2 2006 2
3.6 1 2001 1"
d <- read.table(text = txt, header = TRUE)
d$year <- factor(d$year) # make year a factor.
ggplot(d, aes(x = site, y = year, fill = SLP)) + geom_tile()

Related

Repeat measures: how to use initial measurements to estimate subsequent measurement based off time differences

I have a dataframe with repeat recordings of individuals in the year they were found.
>long<-data.frame(identity,year,age)
> long
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
Age was estimated based off the year they were captured
yr.est<-data.frame(yr,est.age)
> yr.est
yr est.age
1 2000 10.0
2 2001 7.5
3 2002 11.0
4 2003 9.0
5 2004 11.0
When an individual is seen after the first time how do I give them an estimated age of the initial estimated age + difference between years (e.g. individual v was estimated to be 7.5 in 2001 and their age in 2004 should be 10.5 not 11)
My actual dataset is 15000 long so I am unable to do it manually
TIA
Edit.
Expected output posted as a comment by the OP.
long
identity year age
1 z 2000 10.0
2 z 2001 11.0
3 z 2001 11.0
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 10.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 8.5
10 v 2004 10.5
This code computes est.age by adding to the first age the difference between the current year and the first year, by group of identity.
library(tidyverse)
long %>%
group_by(identity) %>%
mutate(est.age = first(age) + (year - first(year))) %>%
select(identity, year, est.age)
## A tibble: 10 x 3
## Groups: identity [5]
# identity year est.age
# <fct> <int> <dbl>
# 1 z 2000 10
# 2 z 2001 11
# 3 z 2001 11
# 4 y 2000 10
# 5 x 2003 9
# 6 x 2004 10
# 7 w 2003 9
# 8 v 2001 7.5
# 9 v 2002 8.5
#10 v 2004 10.5
Data.
long <- read.table(text = "
identity year age
1 z 2000 10.0
2 z 2001 7.5
3 z 2001 7.5
4 y 2000 10.0
5 x 2003 9.0
6 x 2004 11.0
7 w 2003 9.0
8 v 2001 7.5
9 v 2002 11.0
10 v 2004 11.0
", header = TRUE)

Split rows but maintain labels

I would like to literally split some of the values in a dataframe, but would like to maintain some of the labels while allowing for one new label for the new splits. For example:
day year depth mass
1 2008 10 13
2 2008 10 15
1 2008 20 14
2 2008 20 12
1 2009 10 14
2 2009 10 16
1 2009 20 12
2 2009 20 18
Now divide each mass by 2 to get:
day year depth mass
1 2008 10a 6.5
1 2008 10b 6.5
2 2008 10a 7.5
2 2008 10b 7.5
1 2008 20a 7
1 2008 20b 7
2 2008 20a 6
2 2008 20b 6
1 2009 10a 7
1 2009 10b 7
2 2009 10a 8
2 2009 10b 8
1 2009 20a 6
1 2009 20b 6
2 2009 20a 9
2 2009 20b 9
There are new values, but they have the corresponding day and year data.
To make things more complicated, I will be running a slightly different function on each depth. For example, I will divide depth == 10 by 2, but depth == 20 by three. But I can probably figure that out if the basic question here can be answered.
A somewhat long data.table line, but I think this will achieve what you need:
library(data.table)
df$id <- rownames(df)
df1 <- setDT(df)[rep(1:nrow(df),times = 2),.SD,by=id][,`:=`(mass=mass/2,depth=paste(depth,c("a","b"),sep=""))]
Output:
df1
id day year depth mass
1 1 2008 10a 6.5
1 1 2008 10b 6.5
2 2 2008 10a 7.5
2 2 2008 10b 7.5
3 1 2008 20a 7.0
3 1 2008 20b 7.0
4 2 2008 20a 6.0
4 2 2008 20b 6.0
5 1 2009 10a 7.0
5 1 2009 10b 7.0
6 2 2009 10a 8.0
6 2 2009 10b 8.0
7 1 2009 20a 6.0
7 1 2009 20b 6.0
8 2 2009 20a 9.0
8 2 2009 20b 9.0
Using dplyr, you can do this way:
library(dplyr)
df %>% group_by(day, year, depth) %>% bind_rows(., .) %>% mutate(mass = mass/2) %>% arrange(day, year, depth, mass)
Note, I have not done the a/b appending to the depth, but I think you can probably do it based on this same idea.
Output is as follows:
Source: local data frame [16 x 4]
day year depth mass
(dbl) (dbl) (dbl) (dbl)
1 1 2008 10 6.5
2 1 2008 10 6.5
3 1 2008 20 7.0
4 1 2008 20 7.0
5 1 2009 10 7.0
6 1 2009 10 7.0
7 1 2009 20 6.0
8 1 2009 20 6.0
9 2 2008 10 7.5
10 2 2008 10 7.5
11 2 2008 20 6.0
12 2 2008 20 6.0
13 2 2009 10 8.0
14 2 2009 10 8.0
15 2 2009 20 9.0
16 2 2009 20 9.0

R How do I add a dataframe column whose values are derived from other column values in a DIFFERENT row?

The answer to this question might be simple but I can't seem to get around it.
I have a dataset with: years, treatments, treatment levels and a value (yield). Treatments include mineral (fertiliser), manure and compost. I would like to add a column with a reference value. This reference should be the value (yield) of given year and level of the mineral treatment. For example:
DF1<-data.frame(treatment = c("mineral","mineral", "manure","manure","compost","compost","mineral","mineral", "manure","manure", "compost","compost"),
year = c("1990","1990","1990","1990","1990","1990", "1991","1991","1991", "1991","1991","1991"),
level = c("1","2","1","2","1","2","1","2","1","2","1","2"),
value = c("1","2","1.1","2.2","1.3","2.5","3","4","3.2","4.4","3.5","4.8"))
DF1
treatment year level value
mineral 1990 1 1
mineral 1990 2 2
manure 1990 1 1.1
manure 1990 2 2.2
compost 1990 1 1.3
compost 1990 2 2.5
mineral 1991 1 3
mineral 1991 2 4
manure 1991 1 3.2
manure 1991 2 4.4
compost 1991 1 3.5
compost 1991 2 4.8
Mineral should be the referent. So I would like to add a column called ref which will give for all treatments (manure, compost and mineral) in year 1990 a value 1 if level 1 and a value 2 if level 2. For the year 1991 the reference value should be for all treatments 3 if level 1 and 4 if level 2.
Anybody would could give me advice on this: I would be very grateful
You could try
res <- do.call(rbind,
lapply(split(DF1, list(DF1$year, DF1$level), drop=TRUE),
function(x){x$ref <- x$value[x$treatment=='mineral']
x}))
indx <- as.numeric(gsub(".*\\.", "", row.names(res)))
res1 <- res[order(indx),]
row.names(res1) <- NULL
res1
Or using data.table
library(data.table)
DT <- as.data.table(DF1)
DT1 <- DT[treatment=='mineral', list(ref=value), by=list(year, level)]
DT[,indx:=1:.N]
setkey(DT, year, level)
DT[J(DT1)][order(indx),][,indx:=NULL][]
# treatment year level value ref
#1: mineral 1990 1 1 1
#2: mineral 1990 2 2 2
#3: manure 1990 1 1.1 1
#4: manure 1990 2 2.2 2
#5: compost 1990 1 1.3 1
#6: compost 1990 2 2.5 2
#7: mineral 1991 1 3 3
#8: mineral 1991 2 4 4
#9: manure 1991 1 3.2 3
#10: manure 1991 2 4.4 4
#11: compost 1991 1 3.5 3
#12: compost 1991 2 4.8 4

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

How to remove subjects who have missing measurements in time series data?

I have data like the following:
ID Year Measurement
1 2009 5.6
1 2010 6.2
1 2011 4.5
2 2008 6.4
2 2009 5.2
3 2008 3.5
3 2010 5.6
4 2009 5.9
4 2010 2.2
4 2011 4.1
4 2012 5.5
Where subjects are measured over a few years with different starting and ending years. Subjects are also measured a different number of times. I want to remove subjects that are not measured every single year between their start and end measurement years. So, in the above data I would want subject 3 removed since they missed a measurement in 2009.
I thought about doing a for loop where I get the maximum and minimum value of the variable Year for each unique ID. I then take the difference between the maximum and minimum for each player and add 1. I then count the number of times each unique ID appears in the data and check to see if they are equal. This ought to work but I feel like there has a got to be a quick and more efficient way to do this.
This will be easiest with the data.table package:
dt = data.table(df, key="Year")
dt[,Remove:=any(diff(Year) > 1),by=ID]
dt = dt[(!Remove)]
dt$Remove = NULL
ID Year Measurement
1: 1 2009 5.6
2: 1 2010 6.2
3: 1 2011 4.5
4: 2 2008 6.4
5: 2 2009 5.2
6: 4 2009 5.9
7: 4 2010 2.2
8: 4 2011 4.1
9: 4 2012 5.5
Here's an alternative
> ind <- aggregate(Year~ID, FUN=function(x) x[2]-x[1], data=df)$Year>1
> df[!df$ID==unique(df$ID)[ind], ]
ID Year Measurement
1 1 2009 5.6
2 1 2010 6.2
3 1 2011 4.5
4 2 2008 6.4
5 2 2009 5.2
8 4 2009 5.9
9 4 2010 2.2
10 4 2011 4.1
11 4 2012 5.5
You may try ave. My anonymous function is basically the pseudo code suggested in the question.
df[as.logical(ave(df$Year, df$ID, FUN = function(x) length(x) > max(x) - min(x))), ]
# ID Year Measurement
# 1 1 2009 5.6
# 2 1 2010 6.2
# 3 1 2011 4.5
# 4 2 2008 6.4
# 5 2 2009 5.2
# 8 4 2009 5.9
# 9 4 2010 2.2
# 10 4 2011 4.1
# 11 4 2012 5.5

Resources