Summing cells of some rows and columns - r

I have a large data frame where some rows have repeated values in some of their columns. I want to keep the repeated values and sum those which are different. Below there is a sample of my data:
data<-data.frame(season=c(2008,2009,2010,2011,2011,2012,2000,2001),
lic=c(132228,140610,149215,158559,158559,944907,37667,45724),
client=c(174,174,174,174,174,174,175,175),
qtty=c(31,31,31,31,31,31,36,26),
held=c(60,65,58,68,68,70,29,23),
catch=c(7904,6761,9236,9323.2,801,NA,2330,3594.5),
potlift=c(2715,2218,3000,3887,750,NA,2314,3472))
.
season lic client qtty held catch potlift
2008 132228 174 31 60 7904 2715
2009 140610 174 31 65 6761 2218
2010 149215 174 31 58 9236 3000
2011 158559 174 31 68 9323.2 3887
2011 158559 174 31 68 801 750
2012 944907 174 31 70 NA NA
2000 37667 175 36 29 2330 2314
2001 45724 175 26 23 3594.5 3472
Note that the season 2011 is repeated, each variable (client... held), except catch and potlift. I need to keep the values of (client... held) and sum catch and potlift; therefore my new data frame should be like the example below:
season lic client qtty held catch potlift
2008 132228 174 31 60 7904 2715
2009 140610 174 31 65 6761 2218
2010 149215 174 31 58 9236 3000
2011 158559 174 31 68 10124.2 4637
2012 944907 174 31 70 NA NA
2000 37667 175 36 29 2330 2314
2001 45724 175 26 23 3594.5 3472
I have attempted to do so using aggregate, but this function sum everything. Any help will be appreciated.

data$catch <- with(data, ave(catch,list(lic,client,qtty,held),FUN=sum))
data$potlift <- with(data, ave(potlift,list(lic,client,qtty,held),FUN=sum))
unique(data)
season lic client qtty held catch potlift
1 2008 132228 174 31 60 7904.0 2715
2 2009 140610 174 31 65 6761.0 2218
3 2010 149215 174 31 58 9236.0 3000
4 2011 158559 174 31 68 10124.2 4637
6 2012 944907 174 31 70 NA NA
7 2000 37667 175 36 29 2330.0 2314
8 2001 45724 175 26 23 3594.5 3472

aggregate seems to work fine for me, but I'm not sure what you were trying:
> aggregate(cbind(catch, potlift) ~ ., data, sum, na.action = "na.pass")
season lic client qtty held catch potlift
1 2001 45724 175 26 23 3594.5 3472
2 2000 37667 175 36 29 2330.0 2314
3 2010 149215 174 31 58 9236.0 3000
4 2008 132228 174 31 60 7904.0 2715
5 2009 140610 174 31 65 6761.0 2218
6 2011 158559 174 31 68 10124.2 4637
7 2012 944907 174 31 70 NA NA
Here, use cbind to identify the columns that you want to aggregate by. You can then specify all the other columns, or just use . to indicate "use all other columns not mentioned in the cbind call.

Related

Plot a data frame using ggplot

I have a dataframe, with the following data:
data1$YEAR data1$WEEK data1$TOTAL.PATIENTS
1 2009 1 579428
9 2009 2 565631
17 2009 3 582932
25 2009 4 611176
33 2009 5 638613
41 2009 6 648304
49 2009 7 624583
57 2009 8 659573
65 2009 9 623389
73 2009 10 637672
81 2009 11 605503
89 2009 12 608342
97 2009 13 586651
105 2009 14 564460
113 2009 15 558837
121 2009 16 577836
129 2009 17 624734
137 2009 18 598189
145 2009 19 550300
153 2009 20 544432
161 2009 21 531526
169 2009 22 538177
177 2009 23 493761
185 2009 24 521701
193 2009 25 512268
201 2009 26 475877
209 2009 27 480680
217 2009 28 502466
225 2009 29 503971
233 2009 30 485804
241 2009 31 496666
249 2009 32 506019
257 2009 33 544827
265 2009 34 588916
273 2009 35 573972
281 2009 36 571201
289 2009 37 638302
296 2009 38 608464
303 2009 39 606458
311 2009 40 855346
319 2009 41 853912
327 2009 42 906536
335 2009 43 898860
343 2009 44 899425
351 2009 45 864348
359 2009 46 853552
367 2009 47 654101
375 2009 48 814550
383 2009 49 781811
391 2009 50 728401
399 2009 51 536961
407 2009 52 583299
2 2010 1 721138
...
second column is the year from 2009 to 2015
third column is the week of the year
I would like to plot this data frame. On the x-axis of this plot I would like to see the weeks of each year separately.
something like this. How can I do that?
Doe this work or you need to re-label X-axis to Year only (in the following plot the x-axis is in Year-Weeks)?
head(df)
Year Week TOTAL.PATIENTS
1 2009 11 605503
2 2009 12 608342
3 2009 13 586651
4 2009 14 564460
5 2009 15 558837
6 2009 16 577836
df$Year_Week <- paste(df$Year, sprintf('%02d', df$Week), sep='-')
df$Year <- as.factor(df$Year)
library(scales)
ggplot(df, aes(Year_Week,TOTAL.PATIENTS,col=Year, group=Year)) +
geom_line(lwd=2) + scale_y_continuous(labels = comma) +
xlab('Year-Week') +
theme(axis.text.x = element_text(angle=90, vjust = 0.5))

dividing all values in one column by values in a separate data frame (colnames match)

I need to divide every value in a given column of the first data frame by the value in corresponding column name of the second data frame. For example, I need to divide every value in the 0 column of demand_copy by 25.5, every value in the 1 column by 13.0, etc. and get an output that is the same structure as the first data frame
How would one do this in R?
> head(demand_copy)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1 25 9 14 3 10 10 28 175 406 230 155 151 202 167 179 185 275 298 280 185 110 84 93 51
2 36 17 9 3 2 7 32 88 110 131 89 125 149 165 161 147 178 309 339 201 115 78 67 39
3 10 3 5 10 0 11 15 58 129 110 49 62 62 100 70 73 72 86 116 61 49 37 26 22
4 24 15 10 5 3 4 39 53 108 98 80 118 116 110 135 158 157 196 176 132 118 94 91 102
5 40 45 15 9 16 37 75 205 497 527 362 287 316 353 359 309 365 653 598 468 328 242 168 102
6 0 0 1 2 0 0 11 56 26 12 21 6 27 15 18 5 14 19 25 6 4 0 1 0
> medians
medians
0 25.5
1 13.0
2 8.0
3 4.0
4 4.0
5 10.0
6 38.5
7 106.5
8 205.5
9 164.0
10 111.5
11 130.5
12 160.0
13 164.5
14 170.0
15 183.0
16 202.0
17 282.0
18 256.5
19 178.0
20 109.0
21 80.0
22 60.5
23 41.0
You could use
t(t(demand_copy) / medians[, 1])
or
sweep(demand_copy, 2, medians[, 1], "/")
Notice that the first approach returns a matrix, whereas the second one returns a data frame.
I also suggest to have medians as a vector, not as a data frame with a single column. Then you could use medians instead of medians[, 1] in the two lines above.

R: Using decompose() in data.table

Hi I have a time series data as follows:
Code Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1001 2009 183 175 151 173 169 169 158 132 91 91 146 114
1001 2010 76 103 130 103 78 72 64 96 89 91 61 62
1001 2011 73 50 99 90 74 112 113 111 112 112 97 137
1001 2012 105 140 160 129 162 161 150 167 151 161 114 120
1001 2013 140 137 153 128 137 137 135 148 116 134 121 95
1001 2014 135 145 144 110 109 130 110 58 100 109 67 66
1015 2009 21 19 19 21 17 29 56 35 46 33 42 45
1015 2010 46 29 55 62 49 48 44 37 39 46 33 39
1015 2011 59 36 52 41 36 38 42 43 37 37 37 35
1015 2012 46 53 55 41 69 41 38 42 37 50 46 48
1015 2013 64 43 58 43 50 39 29 48 45 26 51 55
1015 2014 40 54 64 58 76 59 69 66 57 60 58 55
1031 2009 2408 2370 2799 3460 3263 3102 2769 2749 3018 3283 3343 3193
1031 2010 3130 3069 3776 3348 3341 4129 3920 4131 4152 4044 4241 3522
1031 2011 3454 3768 5217 4242 4624 5105 4712 6064 5546 6049 5957 4670
1031 2012 4959 3554 2163 1274 1452 1248 1303 1278 916 906 522 324
1031 2013 537 442 417 389 469 423 328 246 291 387 201 122
1031 2014 249 203 42 30 29 36 39 16 36 23 11 19
I am trying to find the decomposition of the timeseries by Code column as different Codes have different trends and seasonality during various months.
I tried using data.table but it gives me error. Following is the code that I am using:
sa_data_ssn_cnt_ts <- data.table(sa_data_ssn_cnt_ts)
sa_data_ssn_cnt_si <- sa_data_ssn_cnt_ts[,list(SI = decompose(sa_data_ssn_cnt_ts, type = "multiplicative", filter = NULL)), by = sa_data_ssn_cnt_ts$site_id]
Error that I get:
Error in decompose(sa_data_ssn_cnt_ts, type = "multiplicative", filter = NULL) : time series has no or less than 2 periods
What is it that I am messing up here?
Is there any other way that I can get the decompositions by Code column?
Thanks a lot for the help.

Time series, change monthly data to quarterly

Now I have some monthly data like :
1/1/90 620
2/1/90,591
3/1/90,574
4/1/90,542
5/1/90,534
6/1/90,545
#...etc
If I use ts() function, it's easy to make the data into time series structure like:
Jan Feb Mar ... Nov Dec
1990 620 591 574 ... 493 464
1991 100 200 300 ...........
Is there any possibilities to change it into quarterly repeating like this:
1st 2nd 3rd 4th
1990-Q1 620 591 574 464
1990-Q2 100 200 300 400
1990-Q3 ...
1990-Q4 ...
1991-Q1 ...
I tried to change
ts(mydata,start=c(1990,1),frequency=12)
to
ts(mydata,start=c(as.yearqrt("1990-1",1)),frequency=4)
but it seems not working.
Could anyone help me? Thank you very much.
monthly <- ts(mydata, start = c(1990, 1), frequency = 12)
quarterly <- aggregate(monthly, nfrequency = 4)
I don't agree with Hyndman on this one. Which is rare as Hyndman can usually do no wrong. However, I can show you his solution doesn't give the OP what he wants.
test<-c(1:100)
test_ts <- ts(test, start=c(2000,1), frequency=12)
test_ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 1 2 3 4 5 6 7 8 9 10 11 12
2001 13 14 15 16 17 18 19 20 21 22 23 24
2002 25 26 27 28 29 30 31 32 33 34 35 36
2003 37 38 39 40 41 42 43 44 45 46 47 48
2004 49 50 51 52 53 54 55 56 57 58 59 60
2005 61 62 63 64 65 66 67 68 69 70 71 72
2006 73 74 75 76 77 78 79 80 81 82 83 84
2007 85 86 87 88 89 90 91 92 93 94 95 96
2008 97 98 99 100
test_agg <- aggregate(test_ts, nfrequency=4)
test_agg
2000 6 15 24 33
2001 42 51 60 69
2002 78 87 96 105
2003 114 123 132 141
2004 150 159 168 177
2005 186 195 204 213
2006 222 231 240 249
2007 258 267 276 285
2008 294
Well, wait, that first quarter isn't the average of the 3 months, its the sum. (1+2+3 =6 but you want it to show the mean=2). So you will need to modify that a tad.
test_agg <- aggregate(test_ts, nfrequency=4)/3
# divisor is (old freq)/(new freq) = 12/4 = 3
Qtr1 Qtr2 Qtr3 Qtr4
2000 2 5 8 11
2001 14 17 20 23
2002 26 29 32 35
2003 38 41 44 47
2004 50 53 56 59
2005 62 65 68 71
2006 74 77 80 83
2007 86 89 92 95
2008 98
Which now shows you the mean of the monthly data written as quarterly.
The divisor is the trick here. If you had weekly (freq=52) and wanted quarterly (freq=4) you'd divide by 52/4=13.
If you want the mean instead of the sum, just add "mean":
quarterly <- aggregate(monthly, nfrequency=4,mean)

Generating Stacked bar plots

I have a dataframe with 3 columns
$x -- at http://pastebin.com/SGrRUJcA
$y -- at http://pastebin.com/fhn7A1rj
$z -- at http://pastebin.com/VmVvdHEE
that I wish to use to generate a stacked barplot. All of these columns hold integer data. The stacked barplot should have the levels along the x-axis and the data for each level along the y-axis. The stacks should then correspond to each of $x, $y and $z.
UPDATE: I now have the following:
counted <- data.frame(table(myDf$x),variable='x')
counted <- rbind(counted,data.frame(table(myDf$y),variable='y'))
counted <- rbind(counted,data.frame(table(myDf$z),variable='z'))
counted <- counted[counted$Var1!=0,] # to get rid of 0th level??
stackedBp <- ggplot(counted,aes(x=Var1,y=Freq,fill=variable))
stackedBp <- stackedBp+geom_bar(stat='identity')+scale_x_discrete('Levels')+scale_y_continuous('Frequency')
stackedBp
which generates:
.
Two issues remain:
the x-axis labeling is not correct. For some reason, it goes: 46, 47, 53, 54, 38, 40.... How can I order it naturally?
I also wish to remove the 0th label.
I've tried using +scale_x_discrete(breaks = 0:50, labels = 1:50) but this doesn't work.
NB. axis labeling issue: Dataframe column appears incorrectly sorted
Not completely sure what you're wanting to see... but reading ?barplot says the first argument, height must be a vector or matrix. So to fix your initial error:
myDf <- data.frame(x=sample(1:10,100,replace=T),y=sample(11:20,100,replace=T),z=1:10)
barplot(as.matrix(myDf))
If you provide a reproducible example and a more specific description of your desired output you can get a better answer.
Or if I were to guess wildly (and use ggplot)...
myDf <- data.frame(x=sample(1:10,100,replace=T),y=sample(11:20,100,replace=T),z=1:10)
myDf.counted<- data.frame(table(myDf$x),variable='x')
myDf.counted <- rbind(myDf.counted,data.frame(table(myDf$y),variable='y'))
myDf.counted <- rbind(myDf.counted,data.frame(table(myDf$z),variable='z'))
ggplot(myDf.counted,aes(x=Var1,y=Freq,fill=variable))+geom_bar(stat='identity')
I'm surprised that didn't blow up in your face. Cross-classifying the joint occurrence of three different vectors each of length 35204 would often consume many gigabytes of RAM (and would possibly create lots of useless 0's as you found). Maybe you wanted to examine instead the results of sapply(myDf, table)? This then creates three separate tables of counts.
It's a rather irregular result and would need further work to get it into a matrix form but you might want to consider using densityplot to display the comparative distributions which I think is your goal.
$x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
126 711 1059 2079 3070 2716 2745 3329 2916 2671 2349 2457 2055 1303 892 692
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
559 799 482 299 289 236 156 145 100 95 121 133 60 34 37 13
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
15 12 56 10 4 7 2 14 13 28 30 20 16 62 74 58
49 50
40 15
$y
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3069 32 1422 1376 1780 1556 1937 1844 1967 1699 1910 1924 1047 894 975 865
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
635 1002 710 908 979 848 678 908 696 491 417 412 499 411 421 217
32 33 34 35 36 37 39 42 46 47 53 54
265 182 121 47 38 11 2 2 1 1 1 4
$z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
31 202 368 655 825 1246 900 1136 1098 1570 1613 1144 1107 1037 1239 1372
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1306 1085 843 867 813 1057 1213 1020 1210 939 725 644 617 602 739 584
32 33 34 35 36 37 38 39 40 41 42 43
650 733 756 681 684 657 544 416 220 48 7 1
The density plot is really simple to create in lattice:
densityplot( ~x+y+z, myDf)

Resources