Aggregate rows by shared values in a variable

Aggregate rows by shared values in a variable - r

I have a somewhat dumb R question. If I have a matrix (or dataframe, whichever is easier to work with) like:
Year Match
2008 1808
2008 137088
2008 1
2008 56846
2007 2704
2007 169876
2007 75750
2006 2639
2006 193990
2006 2
And I wanted to sum each of the match counts for the years (so, e.g. the 2008 row would be 2008 195743, how would I go about doing this? I've got a few solutions in my head but they are all needlessly complicated and R tends to have some much easier solution tucked away somewhere.
You can generate the same matrix above with:
structure(c(2008L, 2008L, 2008L, 2008L, 2007L, 2007L, 2007L,
2006L, 2006L, 2006L, 1808L, 137088L, 1L, 56846L, 2704L, 169876L,
75750L, 2639L, 193990L, 2L), .Dim = c(10L, 2L), .Dimnames = list(
NULL, c("Year", "Match")))
Thanks for any help you can offer.

aggregate(x = df$Match, by = list(df$Year), FUN = sum), assuming df is your data frame above.

You may also want to use 'ddply' function from 'plyr' package.
# install plyr package
install.packages('plyr')
library(plyr)
# creating your data.frame
foo <- as.data.frame(structure(c(2008L, 2008L, 2008L, 2008L, 2007L, 2007L, 2007L,
2006L, 2006L, 2006L, 1808L, 137088L, 1L, 56846L, 2704L, 169876L,
75750L, 2639L, 193990L, 2L), .Dim = c(10L, 2L), .Dimnames = list(
NULL, c("Year", "Match"))))
# here's what you're looking for
ddply(foo,.(Year),numcolwise(sum))
Year Match
1 2006 196631
2 2007 248330
3 2008 195743
By the way, the total sum for 2008 should be 195743 (1808+137088+1+56846) instead of 138897 you forgot add 56846 up.

As it is explained above, you can use aggregate to do it as follows. but in a much simpler way
aggregate(. ~ Year, df, sum)
# Year Match
#1 2006 196631
#2 2007 248330
#3 2008 195743
You can also use the Dplyr to solve this as follows
library(dplyr)
df %>% group_by(Year) %>% summarise(Match = sum(Match))
# Year Match
# (int) (int)
#1 2008 195743
#2 2007 248330
#3 2006 196631

Related

r collapse by year by ID

I have a dataset with multiple rows per ID like this
ID From To State
1 2004 2005 MD
1 2005 2005 MD
1 2005 2012 DC
1 2012 2015 DC
1 2015 2020 DC
1 2012 2013 MD
1 2013 2016 MD
1 2016 2019 MD
1 2019 2020 MD
2 2003 2004 OR
2 2004 2008 OR
2 2008 2013 AZ
2 2013 2015 AZ
My goal is to collapse the multiple From and To columns to create a smooth timeline like
ID From To State
1 2004 2005 MD
1 2005 2020 DC
1 2012 2020 MD
2 2003 2008 OR
2 2008 2015 AZ
Not sure how to accomplish this. An help is much appreciated. Thanks.

Group by 'ID', 'State' and the run-length-id of 'State', get the first of 'From' and last of 'To'
library(dplyr)
library(data.table)
df1 %>%
group_by(ID, State, grp = rleid(State)) %>%
summarise(From = first(From), To = last(To), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 5 × 4
ID State From To
<int> <chr> <int> <int>
1 1 DC 2005 2020
2 1 MD 2004 2005
3 1 MD 2012 2020
4 2 AZ 2008 2015
5 2 OR 2003 2008
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), From = c(2004L, 2005L, 2005L, 2012L, 2015L, 2012L,
2013L, 2016L, 2019L, 2003L, 2004L, 2008L, 2013L), To = c(2005L,
2005L, 2012L, 2015L, 2020L, 2013L, 2016L, 2019L, 2020L, 2004L,
2008L, 2013L, 2015L), State = c("MD", "MD", "DC", "DC", "DC",
"MD", "MD", "MD", "MD", "OR", "OR", "AZ", "AZ")),
class = "data.frame", row.names = c(NA,
-13L))

Why geom_line is not displaying correctly?

I am running analysis in Bike Sharing (kaggle) dataset. Heres is a sample:
Head
yr mnth Ano cnt
<int> <int> <chr> <int>
1 0 1 2011 985
2 0 1 2011 801
3 0 1 2011 1349
4 0 1 2011 1562
5 0 1 2011 1600
Tail
yr mnth Ano cnt
<int> <int> <chr> <int>
1 1 12 2012 2114
2 1 12 2012 3095
3 1 12 2012 1341
4 1 12 2012 1796
5 1 12 2012 2729
Where "cnt" means the number of bikes for each day. Every line is a day from 01/01/2011 to 12/12/2012
My goal was to analyse the cnt for each month from both 2011 and 2012; However, I keep getting this weird output:
my code:
k<- bike_new %>%
ggplot(aes(x=mnth,y=cnt))+ geom_line();k
What am I doing wrong here?

As mentioned by the sage advice from #AllanCameron add the group element as a factor, and as you have two years, you would need a color. Here the code using simulated data:
library(ggplot2)
library(dplyr)
#Code
bike_new %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
xlab('month')+
labs(color='Ano')
Output:
Some data used:
#Data
bike_new <- structure(list(yr = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L), mnth = c(1, 1, 1, 1, 1, 12, 12, 12, 12,
12, 2, 2, 2, 2, 2), Ano = c(2011L, 2011L, 2011L, 2011L, 2011L,
2012L, 2012L, 2012L, 2012L, 2012L, 2011L, 2011L, 2011L, 2011L,
2011L), cnt = c(985, 801, 1349, 1562, 1600, 2114, 3095, 1341,
1796, 2729, 1085, 901, 1449, 1662, 1700)), row.names = c(NA,
-15L), class = "data.frame")
If you want to see only one line per year, a strategy could be that explained by #Phil using other variable as day. Or you can aggregate values in next form:
#Code 2
bike_new %>%
group_by(Ano,mnth) %>%
summarise(cnt=sum(cnt,na.rm=T)) %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
geom_point()+
xlab('month')+
labs(color='Ano')
Output:
As you are analyzing number of bikes.

Reduce the range of time in sequence analysis with R

I have a sequence that happens over a very long period of time. I tried 8 different algorithms to classify my sequences (OM, CHi2,...). Time goes from 1 to 123. I have 110 individual and 8 events.
My results are not as expected. First, it's very difficult to read. Second, a category contains too many representatives sequence (group3). Third, the number of sequence per group is really unbalanced.
It may comes from the fact that my time variable has a range of 123. I searched for articles that had an issue with a too long time range. I read in Sabherwal and Robey (1993) and in Shi and Prescott (2011) that you can standardize "each sequence by dividing the number of transformations required by the length of the longer sequence". How can I do that in R?
Please find underneath a description of my data:
library(TraMineRextras)
head(seq.tse.data)
seq.tse.data <- structure(list(
ID = c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L),
Year = c(2008L, 2010L, 2012L, 2007L, 2009L, 2010L, 2012L,
2013L, 1996L, 1997L, 1999L, 2003L, 2006L, 2008L,
2012L, 2007L, 2007L, 2008L, 2003L, 2007L, 2007L,
2009L, 2009L, 2011L, 2014L, 2016L, 2006L, 2009L,
2011L, 2013L, 2013L, 2015L, 2015L, 2016L),
Event = c(5L, 4L, 5L, 3L, 1L, 5L, 5L, 5L, 3L,3L,3L,3L,3L,5L, 1L, 5L,
5L,5L,4L,5L, 5L, 5L, 5L, 5L, 5L,5L,5L,5L, 4L, 4L, 1L, 4L, 1L,5L)),
class = "data.frame", row.names = c(NA, -34L)
)
seq.sts <- TSE_to_STS(seq.tse.data,
id = 1, timestamp = 2, event = 3,
stm =NULL, tmin = 1935, tmax = 2018,
firstState = "None")
seq.SPS <- seqformat(seq.sts, 1:84, from = "STS", to = "SPS")
seq.obj <- seqdef(seq.SPS)
> head(seq.tse.data)
ID Year Event
1 1 2008 5
2 2 2010 4
3 2 2012 5
4 3 2007 3
5 3 2009 1
6 3 2010 5
> head(seq.obj)
Sequence
[1] (None,74)-(5,10)-1
[2] (None,76)-(4,2)-(5.4,6)-2
[3] (None,73)-(3,2)-(3.1,1)-(5.3.1,8)-3
[4] (None,62)-(3,12)-(5.3,4)-(5.3.1,6)-3
[5] (None,73)-(5,11)-1
[6] (None,69)-(4,4)-(5.4,11)-2
> head(alphabet(seq.obj),10)
[1] "(1,1)" "(1,10)" "(1,11)" "(1,12)" "(1,14)" "(1,19)" "(1,2)" "(1,21)" "(1,25)" "(1,3)"
...
[145] "(5.4.3.1,5)" "(5.4.3.1,6)" "(5.4.3.1,7)" "(5.4.3.1,8)" "(5.4.3.1.2,9)" "(None,1)" "(None,11)" "(None,20)"
[153] "(None,26)" "(None,30)" "(None,38)" "(None,41)" "(None,42)" "(None,44)" "(None,45)" "(None,49)"
[161] "(None,51)" "(None,53)" "(None,55)" "(None,57)" "(None,58)" "(None,59)" "(None,60)" "(None,61)"
[169] "(None,62)" "(None,64)" "(None,65)" "(None,66)" "(None,67)" "(None,68)" "(None,69)" "(None,7)"
[177] "(None,70)" "(None,71)" "(None,72)" "(None,73)" "(None,74)" "(None,75)" "(None,76)" "(None,77)"
[185] "(None,78)" "(None,79)"
Thanks in advance,
Antonin

I guess that your question is about normalizing the dissimilarities between sequences. E.g., Sabherwal and Robey (1993, p 557) refer to the distance standardization proposed by Abbott & Hyrcac (1990) and do not consider at all the standardization of a sequence. Anyway, I cannot figure out what the standardization of a sequence could be.
The seqdist function of TraMineR has a norm argument that can be used to normalize some of the distance measures proposed. Here is an excerpt from the seqdist help page:
Distances can optionally be normalized by means of the norm argument.
If set to "auto", Elzinga's normalization (similarity divided by
geometrical mean of the two sequence lengths) is applied to "LCS",
"LCP" and "RLCP" distances, while Abbott's normalization (distance
divided by length of the longer sequence) is used for "OM", "HAM" and
"DHD". Elzinga's method can be forced with "gmean" and Abbott's rule
with "maxlength". With "maxdist" the distance is normalized by its
maximal possible value. For more details, see Gabadinho et al. (2009,
2011). Finally, "YujianBo" is the normalization proposed by Yujian and
Bo (2007) that preserves the triangle inequality.
Let me warn you that while normalization makes distances between two short sequences (say of length 10) more comparable with distances between two long sequences (say of length 100), it does not solve the issue of comparing sequences of different lengths.
You find a detailed discussion on the normalization of distance and similarity in sequence analysis in Elzinga & Studer (2016).

ploting a graph using ggplot plotting system

I have the following data frame
year type Measure
1 1989 NP 2107
2 2002 NP 109
3 2003 NP 159
4 2008 NP 137
5 1989 NR 522
6 2002 NR 240
7 2003 NR 248
8 2008 NR 55
9 1989 OR 346
10 2002 OR 134
11 2003 OR 130
12 2008 OR 88
13 1989 P 296
14 2002 P 569
15 2003 P 1202
16 2008 P 34
I want to plot Measure Vs Year plot separately for each type using the ggplot2 system. Can someone help me in getting the plot. I want a single plot with Measure Vs Year subplots for each type
The output of packageDescription("ggplot2") :
packageDescription("ggplot2")
Package: ggplot2
Type: Package
Title: An Implementation of the Grammar of Graphics
Version: 1.0.1
Authors#R: c( person("Hadley", "Wickham", role = c("aut", "cre"), email
= "h.wickham#gmail.com"), person("Winston", "Chang", role =
"aut", email = "winston#stdout.org") )
Description: An implementation of the grammar of graphics in R. It
combines the advantages of both base and lattice graphics:
conditioning and shared axes are handled automatically, and you
can still build up a plot step by step from multiple data
sources. It also implements a sophisticated multidimensional
conditioning system and a consistent interface to map data to
aesthetic attributes. See http://ggplot2.org for more
information, documentation and examples.
Depends: R (>= 2.14), stats, methods
Imports: plyr (>= 1.7.1), digest, grid, gtable (>= 0.1.1), reshape2,
scales (>= 0.2.3), proto, MASS
Suggests: quantreg, Hmisc, mapproj, maps, hexbin, maptools, multcomp,
nlme, testthat, knitr, mgcv
VignetteBuilder: knitr
Enhances: sp
License: GPL-2
URL: http://ggplot2.org, https://github.com/hadley/ggplot2
BugReports: https://github.com/hadley/ggplot2/issues
LazyData: true
Collate: 'aaa-.r' 'aaa-constants.r' 'aes-calculated.r' .....
Packaged: 2015-03-16 20:29:42 UTC; winston
Author: Hadley Wickham [aut, cre], Winston Chang [aut]
Maintainer: Hadley Wickham <h.wickham#gmail.com>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2015-03-17 17:49:38
Built: R 3.2.1; ; 2015-07-19 04:13:46 UTC; unix
-- File: /home/R/i686-pc-linux-gnu-library/3.2/ggplot2/Meta/package.rds
output of dput(head(main_data))
dput(head(main_data))
structure(list(Measure = c(6.532,
78.88, 0.92, 10.376, 10.859, 83.025), type = c("P", "P",
"P", "P", "P", "P"), year = c(1989L, 1989L, 1989L,
1989L, 1989L, 1989L)), .Names = c("Measure", "type", "year"), row.names = c("114288", "114296",
"114300", "114308", "114325", "114329"), class = "data.frame")

Something like this?
df <- structure(list(year = c(1989L, 2002L, 2003L, 2008L, 1989L, 2002L,
2003L, 2008L, 1989L, 2002L, 2003L, 2008L, 1989L, 2002L, 2003L,
2008L), type = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c(" NP ", " NR ",
" OR ", " P "), class = "factor"), Measure = c(2107L,
109L, 159L, 137L, 522L, 240L, 248L, 55L, 346L, 134L, 130L, 88L,
296L, 569L, 1202L, 34L)), .Names = c("year", "type", "Measure"
), class = "data.frame", row.names = c(NA, -16L))
ggplot(df, aes(x=year, y=Measure)) +
geom_bar(stat='identity') +
facet_grid(. ~ type)

Finding a specific correlation between two data frames, with one being offset by nine months

This is the next step from a question I asked earlier. I've got two data frames: one focused on birth data, and one focused on winter weather events. The aim of my project is to discover whether there exists a simple correlation between extreme winter weather events (i.e. winter storms) and a spike in births nine months later (due to people getting stuck in doors during the storms).
There are several areas with which I'm struggling:
I need to factor out less extreme winter weather events from combined.weather.birth$EVENT_TYPE. The factors currently included are "Frost/Freeze", "Hail", "Heavy Snow", "Ice Storm", "Winter Storm", "Winter Weather", and "Blizzard". Of these, I wish to exclude frost/freeze and hail.
I'm having difficulty running the cff() function on this data. As described above, I want to discover and analyze potential correlations between these data sets. I'm only comparing data in Massachusetts from years 2007-2011.
Here is what I've tried so far:
correlation1 <- ccf(birth.data$DATE, combined.weather.birth$DATE, lag.max = NULL,
type="correlation", plot=TRUE)
correlation2 <- ccf(birth.data$DATE, combined.weather.birth$DATE + combined.weather$EVENT_TYPE,
lag.max = NULL, type="correlation", plot=TRUE)
I need to offset this data by nine months, to account for pregnancy after the winter weather events.
Here is some information on the data I'm working with:
str(combined.weather.birth) <-
'data.frame': 966 obs. of 8 variables:
$ EVENT_ID : int 9620 9619 9623 5391 13835 13845 13844 13847 13846 13836 ...
$ STATE : Factor w/ 1 level "MASSACHUSETTS": 1 1 1 1 1 1 1 1 1 1 ...
$ YEAR : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ MONTH_NAME : Factor w/ 12 levels "April","August",..: 5 5 5 5 4 4 4 4 4 4 ...
$ EVENT_TYPE : Factor w/ 7 levels "Frost/Freeze",..: 6 6 4 4 5 5 5 5 5 5 ...
$ INJURIES_DIRECT: int 0 0 0 1 0 0 0 0 0 0 ...
$ DEATHS_DIRECT : int 0 0 0 0 0 0 0 0 0 0 ...
$ DATE : POSIXct, format: "2007-01-01" "2007-01-01" "2007-01-01" "2007-01-01" ...
str(birth.data) <-
'data.frame': 60 obs. of 4 variables:
$ YEAR : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ MONTH : Factor w/ 12 levels "April","August",..: 5 4 8 1 9 7 6 2 12 11 ...
$ BIRTH_TOTAL: num 6250 5833 6570 6227 6858 ...
$ DATE : POSIXct, format: "2007-01-01" "2007-02-01" "2007-03-01" "2007-04-01" ..
EDIT: I should add that I'm not married to using cff() here. If there is a better function for finding the specified correlation, I am open to learning about it. I've read a bit about cor(), but that doesn't seem appropriate here since it's designed to only work with matrices.
EDIT 2: adding dput() data.
dput(birth.data)
structure(list(YEAR = c(2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L,
2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L
), MONTH = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L,
10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L,
4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L,
9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("April", "August", "December",
"February", "January", "July", "June", "March", "May", "November",
"October", "September"), class = "factor"), BIRTH_TOTAL = c(6250,
5833, 6570, 6227, 6858, 6735, 6933, 7291, 6385, 6466, 6198, 6221,
6341, 6051, 6444, 6396, 6781, 6583, 6820, 6803, 6531, 6510, 5627,
6135, 5976, 5515, 6208, 6261, 6520, 6509, 6834, 6616, 6489, 6318,
5730, 6040, 5667, 5459, 6162, 6212, 6221, 6194, 6469, 6380, 6342,
5981, 5853, 5925, 5979, 5414, 6070, 6085, 6242, 6438, 6506, 6459,
6260, 6158, 5754, 5801), DATE = structure(c(1167609600, 1170288000,
1172707200, 1175385600, 1177977600, 1180656000, 1183248000, 1185926400,
1188604800, 1191196800, 1193875200, 1196467200, 1199145600, 1201824000,
1204329600, 1207008000, 1209600000, 1212278400, 1214870400, 1217548800,
1220227200, 1222819200, 1225497600, 1228089600, 1230768000, 1233446400,
1235865600, 1238544000, 1241136000, 1243814400, 1246406400, 1249084800,
1251763200, 1254355200, 1257033600, 1259625600, 1262304000, 1264982400,
1267401600, 1270080000, 1272672000, 1275350400, 1277942400, 1280620800,
1283299200, 1285891200, 1288569600, 1291161600, 1293840000, 1296518400,
1298937600, 1301616000, 1304208000, 1306886400, 1309478400, 1312156800,
1314835200, 1317427200, 1320105600, 1322697600), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("YEAR", "MONTH", "BIRTH_TOTAL", "DATE"
), row.names = c(NA, -60L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex