Reduce the range of time in sequence analysis with R - r

I have a sequence that happens over a very long period of time. I tried 8 different algorithms to classify my sequences (OM, CHi2,...). Time goes from 1 to 123. I have 110 individual and 8 events.
My results are not as expected. First, it's very difficult to read. Second, a category contains too many representatives sequence (group3). Third, the number of sequence per group is really unbalanced.
It may comes from the fact that my time variable has a range of 123. I searched for articles that had an issue with a too long time range. I read in Sabherwal and Robey (1993) and in Shi and Prescott (2011) that you can standardize "each sequence by dividing the number of transformations required by the length of the longer sequence". How can I do that in R?
Please find underneath a description of my data:
library(TraMineRextras)
head(seq.tse.data)
seq.tse.data <- structure(list(
ID = c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L),
Year = c(2008L, 2010L, 2012L, 2007L, 2009L, 2010L, 2012L,
2013L, 1996L, 1997L, 1999L, 2003L, 2006L, 2008L,
2012L, 2007L, 2007L, 2008L, 2003L, 2007L, 2007L,
2009L, 2009L, 2011L, 2014L, 2016L, 2006L, 2009L,
2011L, 2013L, 2013L, 2015L, 2015L, 2016L),
Event = c(5L, 4L, 5L, 3L, 1L, 5L, 5L, 5L, 3L,3L,3L,3L,3L,5L, 1L, 5L,
5L,5L,4L,5L, 5L, 5L, 5L, 5L, 5L,5L,5L,5L, 4L, 4L, 1L, 4L, 1L,5L)),
class = "data.frame", row.names = c(NA, -34L)
)
seq.sts <- TSE_to_STS(seq.tse.data,
id = 1, timestamp = 2, event = 3,
stm =NULL, tmin = 1935, tmax = 2018,
firstState = "None")
seq.SPS <- seqformat(seq.sts, 1:84, from = "STS", to = "SPS")
seq.obj <- seqdef(seq.SPS)
> head(seq.tse.data)
ID Year Event
1 1 2008 5
2 2 2010 4
3 2 2012 5
4 3 2007 3
5 3 2009 1
6 3 2010 5
> head(seq.obj)
Sequence
[1] (None,74)-(5,10)-1
[2] (None,76)-(4,2)-(5.4,6)-2
[3] (None,73)-(3,2)-(3.1,1)-(5.3.1,8)-3
[4] (None,62)-(3,12)-(5.3,4)-(5.3.1,6)-3
[5] (None,73)-(5,11)-1
[6] (None,69)-(4,4)-(5.4,11)-2
> head(alphabet(seq.obj),10)
[1] "(1,1)" "(1,10)" "(1,11)" "(1,12)" "(1,14)" "(1,19)" "(1,2)" "(1,21)" "(1,25)" "(1,3)"
...
[145] "(5.4.3.1,5)" "(5.4.3.1,6)" "(5.4.3.1,7)" "(5.4.3.1,8)" "(5.4.3.1.2,9)" "(None,1)" "(None,11)" "(None,20)"
[153] "(None,26)" "(None,30)" "(None,38)" "(None,41)" "(None,42)" "(None,44)" "(None,45)" "(None,49)"
[161] "(None,51)" "(None,53)" "(None,55)" "(None,57)" "(None,58)" "(None,59)" "(None,60)" "(None,61)"
[169] "(None,62)" "(None,64)" "(None,65)" "(None,66)" "(None,67)" "(None,68)" "(None,69)" "(None,7)"
[177] "(None,70)" "(None,71)" "(None,72)" "(None,73)" "(None,74)" "(None,75)" "(None,76)" "(None,77)"
[185] "(None,78)" "(None,79)"
Thanks in advance,
Antonin

I guess that your question is about normalizing the dissimilarities between sequences. E.g., Sabherwal and Robey (1993, p 557) refer to the distance standardization proposed by Abbott & Hyrcac (1990) and do not consider at all the standardization of a sequence. Anyway, I cannot figure out what the standardization of a sequence could be.
The seqdist function of TraMineR has a norm argument that can be used to normalize some of the distance measures proposed. Here is an excerpt from the seqdist help page:
Distances can optionally be normalized by means of the norm argument.
If set to "auto", Elzinga's normalization (similarity divided by
geometrical mean of the two sequence lengths) is applied to "LCS",
"LCP" and "RLCP" distances, while Abbott's normalization (distance
divided by length of the longer sequence) is used for "OM", "HAM" and
"DHD". Elzinga's method can be forced with "gmean" and Abbott's rule
with "maxlength". With "maxdist" the distance is normalized by its
maximal possible value. For more details, see Gabadinho et al. (2009,
2011). Finally, "YujianBo" is the normalization proposed by Yujian and
Bo (2007) that preserves the triangle inequality.
Let me warn you that while normalization makes distances between two short sequences (say of length 10) more comparable with distances between two long sequences (say of length 100), it does not solve the issue of comparing sequences of different lengths.
You find a detailed discussion on the normalization of distance and similarity in sequence analysis in Elzinga & Studer (2016).

Related

How to take an Average of + or - SD

I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710

Calculate Difference between dates by group in R

I'm using a logistic exposure to calculate hatching success for bird nests. My data set is quite extensive and I have ~2,000 nests, each with a unique ID ("ClutchID). I need to calculate the number of days a given nest was exposed ("Exposure"), or more simply, the difference between the 1st and last day. I used the following code:
HS_Hatch$Exposure=NA
for(i in 2:nrow(HS_Hatch)){HS_Hatch$Exposure[i]=HS_Hatch$DateVisit[i]- HS_Hatch$DateVisit[i-1]}
where HS_Hatch is my dataset and DateVisit is the actual date. The only problem is R is calculating an exposure value for the 1st date (which doesn't make sense).
What I really need is to calculate the difference between the 1st and last date for a given clutch. I've also looked into the following:
Exposure=ddply(HS_Hatch, "ClutchID", summarize,
orderfrequency = as.numeric(diff.Date(DateVisit)))
df %>%
mutate(Exposure = as.Date(HS_Hatch$DateVisit, "%Y-%m-%d")) %>%
group_by(ClutchID) %>%
arrange(Exposure) %>%
mutate(lag=lag(DateVisit), difference=DateVisit-lag)
I'm still learning R so any help would be greatly appreciated.
Edit:
Below is a sample of the data I'm using
HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012",
"4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012",
"4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012",
"3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012",
"3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012",
"4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID",
"DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
ClutchID = structure(list(), class = c("collector_integer",
"collector")), DateVisit = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_integer",
"collector")), Survive = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("ClutchID", "DateVisit", "Year",
"Survive")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Collecting some of the comments...
Load dplyr
We need only the dplyr package for this problem. If we load other packages, e.g. plyr, it can cause conflicts if both packages have functions with the same name. Let's load only dplyr.
library(dplyr)
In the future, you may wish to load tidyverse instead -- it includes dplyr and other related packages, for graphics, etc.
Converting dates
Let's convert the DateVisit variable from character strings to something R can interpret as a date. Once we do this, it allows R to calculate differences in days by subtracting two dates from each other.
HS_Hatch <- HS_Hatch %>%
mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))
The date format %m/%d/%Y is different from your original code. This date format needs to match how dates look in your data. DateVisit has dates as month/day/year, so we use %m/%d/%Y.
Also, you don't need to specify the dataset for DateVisit inside mutate, as in HS_Hatch$DateVisit, because it's already looking in HS_Hatch. The code HS_Hatch %>% ... says 'use HS_Hatch for the following steps'.
Calculating exposures
To calculate exposure, we need to find the first date, last date, and then the difference between the two, for each set of rows by ClutchID. We use summarize, which collapses the data to one row per ClutchID.
exposure <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
first_visit = min(date_visit) will find the minimum date_visit for each ClutchID separately, since we are using group_by(ClutchID).
exposure = last_visit - first_visit takes the newly-calculated first_visit and last_visit and finds the difference in days.
This creates the following result:
ClutchID first_visit last_visit exposure
<int> <date> <date> <dbl>
1 1 2012-03-15 2012-04-03 19
2 2 2012-03-18 2012-04-04 17
3 3 2012-03-22 2012-04-04 13
4 4 2012-03-18 2012-04-04 17
5 5 2012-03-20 2012-04-05 16
If you want to keep all the original rows, you can use mutate in place of summarize.
Here is a similar solutions if you look for a difftime results in days, from a vector date, without NA values produce in the new column, and if you expect to group by several conditions/groups.
make sure that your vector of date as been converting in the good format as previously explained.
dat2 <- dat %>%
select(group1, group2, date) %>%
arrange(group1, group2, date) %>%
group_by(group1, group2) %>%
mutate(diff_date = c(0,diff(date)))

Sorting groups by months and years

I have to make groups by months and years and sort data in chronological order. I am using following data and code:
mydf = structure(list(vnum1 = c(0.213462416929903, 0.988030047419118,
-1.18652469981587, -0.869178623205718, 0.912875335795115, -1.98798388768447,
-0.304573289627417, 0.559868758619623, -0.663557878516269, -0.558487562052716,
0.437910610434683, 0.294626820421212, 1.22382550331396, 1.33307181022467,
-0.111632843418732, 0.012593612409791, 0.202491597986104, -0.0926340952847484,
0.838878748813974, 0.397235027161488, -0.24188970321148, 0.941276507145062,
0.209022985751647, 1.12583170538807, 1.32872138538229, 0.490518883526501,
-1.5848233402832, 0.21465692222817, 0.32862179851896, 1.25692197516853,
-0.101168594652985, 0.151940891762939, -1.56082855559097, 0.81784767965823,
0.400190430382005, -1.53216256468244, -1.28940381159733, -0.795006205948021,
1.06739871977495, 0.529556847460609, 0.39886466332703, 0.392956914201864,
-1.87574207207718, 0.394469467803633, 1.78815629799651, 1.64468036754424,
-1.5042078341332, 0.963769152123962, -0.22245472921696, 0.0439610905616637
), vmonthnum = c(12L, 7L, 3L, 9L, 3L, 9L, 9L, 5L, 7L, 12L, 5L,
8L, 12L, 6L, 3L, 1L, 3L, 8L, 7L, 3L, 6L, 8L, 7L, 4L, 4L, 8L,
10L, 1L, 11L, 9L, 7L, 6L, 10L, 8L, 9L, 8L, 3L, 9L, 1L, 6L, 12L,
6L, 2L, 2L, 7L, 1L, 6L, 8L, 3L, 12L), vmonth = c("Dec", "Jul",
"Mar", "Sep", "Mar", "Sep", "Sep", "May", "Jul", "Dec", "May",
"Aug", "Dec", "Jun", "Mar", "Jan", "Mar", "Aug", "Jul", "Mar",
"Jun", "Aug", "Jul", "Apr", "Apr", "Aug", "Oct", "Jan", "Nov",
"Sep", "Jul", "Jun", "Oct", "Aug", "Sep", "Aug", "Mar", "Sep",
"Jan", "Jun", "Dec", "Jun", "Feb", "Feb", "Jul", "Jan", "Jun",
"Aug", "Mar", "Dec"), vyear = c(2013L, 2014L, 2014L, 2010L, 2011L,
2012L, 2012L, 2011L, 2014L, 2011L, 2011L, 2010L, 2011L, 2014L,
2010L, 2009L, 2010L, 2012L, 2010L, 2009L, 2010L, 2011L, 2013L,
2013L, 2011L, 2013L, 2012L, 2011L, 2010L, 2010L, 2011L, 2014L,
2010L, 2014L, 2013L, 2009L, 2012L, 2011L, 2014L, 2013L, 2013L,
2009L, 2009L, 2010L, 2014L, 2011L, 2014L, 2010L, 2012L, 2014L
)), .Names = c("vnum1", "vmonthnum", "vmonth", "vyear"), row.names = c(NA,
-50L), class = "data.frame")
>
head(mydf)
vnum1 vmonthnum vmonth vyear
1 0.2134624 12 Dec 2013
2 0.9880300 7 Jul 2014
3 -1.1865247 3 Mar 2014
4 -0.8691786 9 Sep 2010
5 0.9128753 3 Mar 2011
6 -1.9879839 9 Sep 2012
outdf = with(mydf, aggregate(vnum1~paste(vmonthnum,vyear,sep="_"),FUN=mean))
names(outdf)=c("grp", "grp_mean")
head(outdf)
grp grp_mean
1 10_2010 -1.56082856
2 10_2012 -1.58482334
3 11_2010 0.32862180
4 1_2009 0.01259361
5 1_2011 0.92966864
6 1_2014 1.06739872
>
outdf
grp grp_mean
1 10_2010 -1.560828556
2 10_2012 -1.584823340
3 11_2010 0.328621799
4 1_2009 0.012593612
5 1_2011 0.929668645
6 1_2014 1.067398720
7 12_2011 0.332668971
8 12_2013 0.306163540
9 12_2014 0.043961091
10 2_2009 -1.875742072
11 2_2010 0.394469468
12 3_2009 0.397235027
13 3_2010 0.045429377
14 3_2011 0.912875336
15 3_2012 -0.755929270
16 3_2014 -1.186524700
17 4_2011 1.328721385
18 4_2013 1.125831705
19 5_2011 0.498889685
20 6_2009 0.392956914
21 6_2010 -0.241889703
22 6_2013 0.529556847
23 6_2014 -0.006398377
24 7_2010 0.838878749
25 7_2011 -0.101168595
26 7_2013 0.209022986
27 7_2014 0.704209489
28 8_2009 -1.532162565
29 8_2010 0.629197986
30 8_2011 0.941276507
31 8_2012 -0.092634095
32 8_2013 0.490518884
33 8_2014 0.817847680
34 9_2010 0.193871676
35 9_2011 -0.795006206
36 9_2012 -1.146278589
37 9_2013 0.400190430
>
How can I sort outdf on grp column so that it comes in chronological order? I can use 'Jan', 'Feb' etc (vmonth) for this. This is needed for plotting means (on y-axis) with time on x-axis. I tried to see solutions on this page but there the exact date was available: Sorting an data frame based on month-year time format
Thanks for your help.
You could also use zoo package as.yearmon
library(zoo)
mydf$grp <- with(mydf, as.yearmon(paste(vmonth, vyear)))
outdf <- with(mydf, aggregate(vnum1 ~ grp, FUN = mean))
head(outdf)
# grp vnum1
# 1 Jan 2009 0.01259361
# 2 Feb 2009 -1.87574207
# 3 Mar 2009 0.39723503
# 4 Jun 2009 0.39295691
# 5 Aug 2009 -1.53216256
# 6 Feb 2010 0.39446947
If you want to scale the solution, use data.table:
data.table is an evolved version of data.frame, and works much faster for grouping.
library(data.table) # Load package
mydt <- data.table(mydf) # Convert to data.table
setkey(mydt,vyear,vmonthnum) # Set the key, order is important
mydt[,mean(vnum1), by=key(mydt)] # Do the computation
Hope this helps.
A sortable year_month key could be obtained from these data using:
> sprintf("%4d_%02d",mydf$vyear,mydf$vmonthnum)
[1] "2013_12" "2014_07" "2014_03" "2010_09" "2011_03" "2012_09" "2012_09"
[8] "2011_05" "2014_07" "2011_12" "2011_05" "2010_08" "2011_12" "2014_06"
[15] "2010_03" "2009_01" "2010_03" "2012_08" "2010_07" "2009_03" "2010_06"
[22] "2011_08" "2013_07" "2013_04" "2011_04" "2013_08" "2012_10" "2011_01"
[29] "2010_11" "2010_09" "2011_07" "2014_06" "2010_10" "2014_08" "2013_09"
[36] "2009_08" "2012_03" "2011_09" "2014_01" "2013_06" "2013_12" "2009_06"
[43] "2009_02" "2010_02" "2014_07" "2011_01" "2014_06" "2010_08" "2012_03"
[50] "2014_12"
>
... or:
> yrVal <- mydf$vyear + mydf$vmonthnum/12
> yrVal
[1] 2014.000 2014.583 2014.250 2010.750 2011.250 2012.750 2012.750 2011.417
[9] 2014.583 2012.000 2011.417 2010.667 2012.000 2014.500 2010.250 2009.083
[17] 2010.250 2012.667 2010.583 2009.250 2010.500 2011.667 2013.583 2013.333
[25] 2011.333 2013.667 2012.833 2011.083 2010.917 2010.750 2011.583 2014.500
[33] 2010.833 2014.667 2013.750 2009.667 2012.250 2011.750 2014.083 2013.500
[41] 2014.000 2009.500 2009.167 2010.167 2014.583 2011.083 2014.500 2010.667
[49] 2012.250 2015.000
The second version has the advantage that it can be used as the value on the x axis of a graph without even sorting the data first.

Finding a specific correlation between two data frames, with one being offset by nine months

This is the next step from a question I asked earlier. I've got two data frames: one focused on birth data, and one focused on winter weather events. The aim of my project is to discover whether there exists a simple correlation between extreme winter weather events (i.e. winter storms) and a spike in births nine months later (due to people getting stuck in doors during the storms).
There are several areas with which I'm struggling:
I need to factor out less extreme winter weather events from combined.weather.birth$EVENT_TYPE. The factors currently included are "Frost/Freeze", "Hail", "Heavy Snow", "Ice Storm", "Winter Storm", "Winter Weather", and "Blizzard". Of these, I wish to exclude frost/freeze and hail.
I'm having difficulty running the cff() function on this data. As described above, I want to discover and analyze potential correlations between these data sets. I'm only comparing data in Massachusetts from years 2007-2011.
Here is what I've tried so far:
correlation1 <- ccf(birth.data$DATE, combined.weather.birth$DATE, lag.max = NULL,
type="correlation", plot=TRUE)
correlation2 <- ccf(birth.data$DATE, combined.weather.birth$DATE + combined.weather$EVENT_TYPE,
lag.max = NULL, type="correlation", plot=TRUE)
I need to offset this data by nine months, to account for pregnancy after the winter weather events.
Here is some information on the data I'm working with:
str(combined.weather.birth) <-
'data.frame': 966 obs. of 8 variables:
$ EVENT_ID : int 9620 9619 9623 5391 13835 13845 13844 13847 13846 13836 ...
$ STATE : Factor w/ 1 level "MASSACHUSETTS": 1 1 1 1 1 1 1 1 1 1 ...
$ YEAR : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ MONTH_NAME : Factor w/ 12 levels "April","August",..: 5 5 5 5 4 4 4 4 4 4 ...
$ EVENT_TYPE : Factor w/ 7 levels "Frost/Freeze",..: 6 6 4 4 5 5 5 5 5 5 ...
$ INJURIES_DIRECT: int 0 0 0 1 0 0 0 0 0 0 ...
$ DEATHS_DIRECT : int 0 0 0 0 0 0 0 0 0 0 ...
$ DATE : POSIXct, format: "2007-01-01" "2007-01-01" "2007-01-01" "2007-01-01" ...
str(birth.data) <-
'data.frame': 60 obs. of 4 variables:
$ YEAR : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ MONTH : Factor w/ 12 levels "April","August",..: 5 4 8 1 9 7 6 2 12 11 ...
$ BIRTH_TOTAL: num 6250 5833 6570 6227 6858 ...
$ DATE : POSIXct, format: "2007-01-01" "2007-02-01" "2007-03-01" "2007-04-01" ..
EDIT: I should add that I'm not married to using cff() here. If there is a better function for finding the specified correlation, I am open to learning about it. I've read a bit about cor(), but that doesn't seem appropriate here since it's designed to only work with matrices.
EDIT 2: adding dput() data.
dput(birth.data)
structure(list(YEAR = c(2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L,
2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L
), MONTH = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L,
10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L,
4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L,
9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("April", "August", "December",
"February", "January", "July", "June", "March", "May", "November",
"October", "September"), class = "factor"), BIRTH_TOTAL = c(6250,
5833, 6570, 6227, 6858, 6735, 6933, 7291, 6385, 6466, 6198, 6221,
6341, 6051, 6444, 6396, 6781, 6583, 6820, 6803, 6531, 6510, 5627,
6135, 5976, 5515, 6208, 6261, 6520, 6509, 6834, 6616, 6489, 6318,
5730, 6040, 5667, 5459, 6162, 6212, 6221, 6194, 6469, 6380, 6342,
5981, 5853, 5925, 5979, 5414, 6070, 6085, 6242, 6438, 6506, 6459,
6260, 6158, 5754, 5801), DATE = structure(c(1167609600, 1170288000,
1172707200, 1175385600, 1177977600, 1180656000, 1183248000, 1185926400,
1188604800, 1191196800, 1193875200, 1196467200, 1199145600, 1201824000,
1204329600, 1207008000, 1209600000, 1212278400, 1214870400, 1217548800,
1220227200, 1222819200, 1225497600, 1228089600, 1230768000, 1233446400,
1235865600, 1238544000, 1241136000, 1243814400, 1246406400, 1249084800,
1251763200, 1254355200, 1257033600, 1259625600, 1262304000, 1264982400,
1267401600, 1270080000, 1272672000, 1275350400, 1277942400, 1280620800,
1283299200, 1285891200, 1288569600, 1291161600, 1293840000, 1296518400,
1298937600, 1301616000, 1304208000, 1306886400, 1309478400, 1312156800,
1314835200, 1317427200, 1320105600, 1322697600), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("YEAR", "MONTH", "BIRTH_TOTAL", "DATE"
), row.names = c(NA, -60L), class = "data.frame")

Aggregate rows by shared values in a variable

I have a somewhat dumb R question. If I have a matrix (or dataframe, whichever is easier to work with) like:
Year Match
2008 1808
2008 137088
2008 1
2008 56846
2007 2704
2007 169876
2007 75750
2006 2639
2006 193990
2006 2
And I wanted to sum each of the match counts for the years (so, e.g. the 2008 row would be 2008 195743, how would I go about doing this? I've got a few solutions in my head but they are all needlessly complicated and R tends to have some much easier solution tucked away somewhere.
You can generate the same matrix above with:
structure(c(2008L, 2008L, 2008L, 2008L, 2007L, 2007L, 2007L,
2006L, 2006L, 2006L, 1808L, 137088L, 1L, 56846L, 2704L, 169876L,
75750L, 2639L, 193990L, 2L), .Dim = c(10L, 2L), .Dimnames = list(
NULL, c("Year", "Match")))
Thanks for any help you can offer.
aggregate(x = df$Match, by = list(df$Year), FUN = sum), assuming df is your data frame above.
You may also want to use 'ddply' function from 'plyr' package.
# install plyr package
install.packages('plyr')
library(plyr)
# creating your data.frame
foo <- as.data.frame(structure(c(2008L, 2008L, 2008L, 2008L, 2007L, 2007L, 2007L,
2006L, 2006L, 2006L, 1808L, 137088L, 1L, 56846L, 2704L, 169876L,
75750L, 2639L, 193990L, 2L), .Dim = c(10L, 2L), .Dimnames = list(
NULL, c("Year", "Match"))))
# here's what you're looking for
ddply(foo,.(Year),numcolwise(sum))
Year Match
1 2006 196631
2 2007 248330
3 2008 195743
By the way, the total sum for 2008 should be 195743 (1808+137088+1+56846) instead of 138897 you forgot add 56846 up.
As it is explained above, you can use aggregate to do it as follows. but in a much simpler way
aggregate(. ~ Year, df, sum)
# Year Match
#1 2006 196631
#2 2007 248330
#3 2008 195743
You can also use the Dplyr to solve this as follows
library(dplyr)
df %>% group_by(Year) %>% summarise(Match = sum(Match))
# Year Match
# (int) (int)
#1 2008 195743
#2 2007 248330
#3 2006 196631

Resources