Say I have this data frame, df,
Day value
1 2012-06-10 552
2 2012-06-10 4850
3 2012-06-11 4642
4 2012-06-11 4132
5 2012-06-11 4190
6 2012-06-12 4186
7 2012-06-13 1139
8 2012-06-13 490
9 2012-06-13 5156
10 2012-06-13 4430
11 2012-06-13 4447
12 2012-06-14 4256
13 2012-06-14 3856
14 2012-06-14 1163
15 2012-06-17 564
16 2012-06-17 4866
17 2012-06-17 4421
18 2012-06-19 4206
19 2012-06-20 4272
20 2012-06-20 3993
21 2012-06-20 1211
22 2012-07-21 698
23 2012-07-21 5770
24 2012-07-21 5103
25 2012-07-21 775
26 2012-07-21 5140
27 2012-07-22 4868
I would like a to create a data.frame, dfvar, that would contain the daily variance: something like:
Day Variance
1 2012-06-10 9236402
2 2012-06-11 X
3 2012-06-12 4186
4 2012-06-13 1139
5 2012-06-14 4256
6 2012-06-17 564
7 2012-06-19 4206
8 2012-06-20 4272
9 2012-07-21 698
10 2012-07-22 4868
So for example, I computed it, the entry
dfvar$Variance[1] = var(c(552, 4850))
I tried to do
dfvar <- aggregate(df, by = list(Day), FUN = var)
but this isn't the input I expected. I really want to have the variance of the values of the same day, without the other days...
Any ideas about that?
Is this what you want ?
library(dplyr)
df%>%group_by(Day)%>%dplyr::summarise(Variance=var(value))#return NA if only one value within the group
Day Variance
<fctr> <dbl>
1 2012-06-10 9236402.00
2 2012-06-11 77961.33
3 2012-06-12 NA
4 2012-06-13 4615704.30
5 2012-06-14 2829816.33
6 2012-06-17 5596946.33
7 2012-06-19 NA
8 2012-06-20 2864514.33
9 2012-07-21 6422224.70
10 2012-07-22 NA
Related
Can any one help how to find approximate area under the curve using Riemann Sums in R?
It seems we do not have any package in R which could help.
Sample data:
MNo1 X1 Y1 MNo2 X2 Y2
1 2981 -66287 1 595 -47797
1 2981 -66287 1 595 -47797
2 2973 -66087 2 541 -47597
2 2973 -66087 2 541 -47597
3 2963 -65887 3 485 -47397
3 2963 -65887 3 485 -47397
4 2952 -65687 4 430 -47197
4 2952 -65687 4 430 -47197
5 2942 -65486 5 375 -46998
5 2942 -65486 5 375 -46998
6 2935 -65286 6 322 -46798
6 2935 -65286 6 322 -46798
7 2932 -65086 7 270 -46598
7 2932 -65086 7 270 -46598
8 2936 -64886 8 222 -46398
8 2936 -64886 8 222 -46398
9 2948 -64685 9 176 -46198
9 2948 -64685 9 176 -46198
10 2968 -64485 10 135 -45999
10 2968 -64485 10 135 -45999
11 2998 -64284 11 97 -45799
11 2998 -64284 11 97 -45799
12 3035 -64084 12 65 -45599
12 3035 -64084 12 65 -45599
13 3077 -63883 13 37 -45399
13 3077 -63883 13 37 -45399
14 3122 -63683 14 14 -45199
14 3122 -63683 14 14 -45199
15 3168 -63482 15 -5 -44999
15 3168 -63482 15 -5 -44999
16 3212 -63282 16 -20 -44799
16 3212 -63282 16 -20 -44799
17 3250 -63081 17 -31 -44599
17 3250 -63081 17 -31 -44599
18 3280 -62881 18 -38 -44399
18 3280 -62881 18 -38 -44399
19 3301 -62680 19 -43 -44199
19 3301 -62680 19 -43 -44199
20 3313 -62480 20 -45 -43999
Check this demo :
> library(zoo)
> x <- 1:10
> y <- -x^2
> Result <- sum(diff(x[x]) * rollmean(y[x], 2))
> Result
[1] -334.5
After check this question, I found function trapz() from package pracma be more efficient:
> library(pracma)
> Result.2 <- trapz(x, y)
> Result.2
[1] -334.5
I have a weekly dataset of prices of a product. This product has many varieties, each with its own price. I am interested in calculating a weighted price depending on the sales volume of each.
I tried to do with a loop, but does not work.
Can someone help me?
Here, a minimal example of my dataset:
Any
nrow week variety price volume
1 10 Semiduro 911 15550
2 10 Semiduro 809 13400
3 10 Semiduro 611 15200
4 10 Semiduro 517 17250
5 10 Semiduro 389 4550
6 10 Semiduro 300 1500
7 10 Paisana(o) 1100 19200
8 10 Paisana(o) 726 22900
9 10 Paisana(o) 452 10450
10 11 Semiduro 1362 13250
11 11 Semiduro 1163 7100
12 11 Semiduro 1032 15580
13 11 Semiduro 768 9700
14 11 Semiduro 703 3670
15 11 Semiduro 550 1450
16 11 Paisana(o) 1825 20200
17 11 Paisana(o) 1402 30650
18 11 Paisana(o) 838 9750
19 12 Semiduro 1050 11350
20 12 Semiduro 878 9200
We could use dplyr
library(dplyr)
df1 %>%
group_by(week, variety) %>%
summarise(wprice = weighted.mean(price, volume))
# week variety wprice
# <int> <chr> <dbl>
#1 10 Paisana(o) 808.1598
#2 10 Semiduro 673.5663
#3 11 Paisana(o) 1452.2574
#4 11 Semiduro 1048.4625
#5 12 Semiduro 972.9976
I'm trying to calculate incidence (with poisson regression) for a rare type of cancer. My dataset is quite large, consisting of 25.000 observations, i have only included the first 20 rows.
The nrcase variable indicates each individual, as you can see an individual can have a number of observations, depending on how many times they have visited the clinic. The variable visit is the number of observations each unique individual has in the dataset, and maxvisit is the total number.
Start is when the individuals was observed for the first time ever in the dataset and done is respectively the last observed date for each year the patient is in the dataset. I haven't included the censoring variable in this subset ( if the patient haven't suffered and event or quits the study for some reason the censoring date is 2011-12-31)
Survival is the number of days that a patient has lived since the inclusion date (start)
Event is if the patient suffered and event (which no patient has in the subset I have provided you)
This is how the dataset looks like
first <- read.table(header = TRUE, text ="nrcase visit maxvisit done start survival event
7 1 6 31/12/06 04/09/06 118 0
7 2 6 31/12/07 04/09/06 483 0
7 3 6 31/12/08 04/09/06 849 0
7 4 6 31/12/09 04/09/06 1214 0
7 5 6 31/12/10 04/09/06 1579 0
7 6 6 31/12/11 04/09/06 1944 0
20 1 9 31/12/03 24/10/03 68 0
20 2 9 31/12/04 24/10/03 434 0
20 3 9 31/12/05 24/10/03 799 0
20 4 9 31/12/06 24/10/03 1164 0
20 5 9 31/12/07 24/10/03 1529 0
20 6 9 31/12/08 24/10/03 1895 0
20 7 9 31/12/09 24/10/03 2260 0
20 8 9 31/12/10 24/10/03 2625 0
20 9 9 31/12/11 24/10/03 2990 0
87 1 6 31/12/06 17/01/06 348 0
87 2 6 31/12/07 17/01/06 713 0
87 3 6 31/12/08 17/01/06 1079 0
87 4 6 31/12/09 17/01/06 1444 0
87 5 6 31/12/10 17/01/06 1809 0")
This is how i want the dataset to look like:
make <- read.table(header=TRUE, text="nrcase visit maxvisit done start survival event startstop
7 1 6 31/12/06 04/09/06 118 0 118
7 2 6 31/12/07 04/09/06 483 0 365
7 3 6 31/12/08 04/09/06 849 0 365
7 4 6 31/12/09 04/09/06 1214 0 365
7 5 6 31/12/10 04/09/06 1579 0 365
7 6 6 31/12/11 04/09/06 1944 0 365
20 1 9 31/12/03 24/10/03 68 0 68
20 2 9 31/12/04 24/10/03 434 0 365
20 3 9 31/12/05 24/10/03 799 0 365
20 4 9 31/12/06 24/10/03 1164 0 365
20 5 9 31/12/07 24/10/03 1529 0 365
20 6 9 31/12/08 24/10/03 1895 0 365
20 7 9 31/12/09 24/10/03 2260 0 365
20 8 9 31/12/10 24/10/03 2625 0 365
20 9 9 31/12/11 24/10/03 2990 0 233
87 1 6 31/12/06 17/01/06 348 0 348
87 2 6 31/12/07 17/01/06 713 0 365
87 3 6 31/12/08 17/01/06 1079 0 365
87 4 6 31/12/09 17/01/06 1444 0 365
87 5 6 31/12/10 17/01/06 1809 0 105")
As you can see I want to create a new variable called startstop that is the total days the patient contributes with each year to the observation row.
Startstop will later on work as my offset variable in the glm (poisson) model.
Appreciate all the help I can get!
I hope this does what you need. I've used lubridate and dplyr because they make things easier but the same results could be achieved in base.
There's no need to retain year_done or first_jan_done, these can be removed with %>% select(-year_done, -first_jan_done) but I thought I would leave them in to make the process clearer.
require(dplyr)
require(lubridate)
make <- first %>%
mutate(start = dmy(start), done = dmy(done),
year_done = year(done), first_jan_done = dmy(paste0("01/01/",year_done)),
days_in_year = as.numeric(done - first_jan_done)+1
) %>% # Need to deal with those observations where patients entered study part way into year
mutate(days_in_year = ifelse(start > first_jan_done, as.numeric(done - start),
days_in_year))
make
nrcase visit maxvisit done start survival event year_done first_jan_done days_in_year
1 7 1 6 2006-12-31 2006-09-04 118 0 2006 2006-01-01 118
2 7 2 6 2007-12-31 2006-09-04 483 0 2007 2007-01-01 365
3 7 3 6 2008-12-31 2006-09-04 849 0 2008 2008-01-01 366
4 7 4 6 2009-12-31 2006-09-04 1214 0 2009 2009-01-01 365
5 7 5 6 2010-12-31 2006-09-04 1579 0 2010 2010-01-01 365
6 7 6 6 2011-12-31 2006-09-04 1944 0 2011 2011-01-01 365
7 20 1 9 2003-12-31 2003-10-24 68 0 2003 2003-01-01 68
8 20 2 9 2004-12-31 2003-10-24 434 0 2004 2004-01-01 366
9 20 3 9 2005-12-31 2003-10-24 799 0 2005 2005-01-01 365
10 20 4 9 2006-12-31 2003-10-24 1164 0 2006 2006-01-01 365
11 20 5 9 2007-12-31 2003-10-24 1529 0 2007 2007-01-01 365
12 20 6 9 2008-12-31 2003-10-24 1895 0 2008 2008-01-01 366
13 20 7 9 2009-12-31 2003-10-24 2260 0 2009 2009-01-01 365
14 20 8 9 2010-12-31 2003-10-24 2625 0 2010 2010-01-01 365
15 20 9 9 2011-12-31 2003-10-24 2990 0 2011 2011-01-01 365
16 87 1 6 2006-12-31 2006-01-17 348 0 2006 2006-01-01 348
17 87 2 6 2007-12-31 2006-01-17 713 0 2007 2007-01-01 365
18 87 3 6 2008-12-31 2006-01-17 1079 0 2008 2008-01-01 366
19 87 4 6 2009-12-31 2006-01-17 1444 0 2009 2009-01-01 365
20 87 5 6 2010-12-31 2006-01-17 1809 0 2010 2010-01-01 365
My data is follow the sequence:
deptime .count
1 4.5 6285
2 14.5 5901
3 24.5 6002
4 34.5 5401
5 44.5 5080
6 54.5 4567
7 104.5 3162
8 114.5 2784
9 124.5 1950
10 134.5 1800
11 144.5 1630
12 154.5 1076
13 204.5 738
14 214.5 556
15 224.5 544
16 234.5 650
17 244.5 392
18 254.5 309
19 304.5 356
20 314.5 364
My ggplot code:
ggplot(pplot, aes(x=deptime, y=.count)) + geom_bar(stat="identity",fill='#FF9966',width = 5) + labs(x="time", y="count")
output figure
There are a gap between each 100. Does anyone know how to fix it?
Thank You
I maintain my journal electronically and I'm trying to get an idea of how consistent I've been with my journal writing over the last few months. I have the following data file, which shows how many journal entries (Entry Count) and words (Word Count) I recorded over the preceding 30-day period.
Date Entry Count Word Count
2010-08-25 22 4205
2010-08-26 21 4012
2010-08-27 20 3865
2010-08-28 20 4062
2010-08-29 19 3938
2010-08-30 18 3759
2010-08-31 17 3564
2010-09-01 17 3564
2010-09-02 16 3444
2010-09-03 17 3647
2010-09-04 17 3617
2010-09-05 16 3390
2010-09-06 15 3251
2010-09-07 15 3186
2010-09-08 15 3186
2010-09-09 16 3414
2010-09-10 15 3228
2010-09-11 14 3006
2010-09-12 13 2769
2010-09-13 13 2781
2010-09-14 12 2637
2010-09-15 13 2774
2010-09-16 13 2808
2010-09-17 12 2732
2010-09-18 12 2664
2010-09-19 13 2931
2010-09-20 13 2751
2010-09-21 13 2710
2010-09-22 14 2950
2010-09-23 14 2834
2010-09-24 14 2834
2010-09-25 14 2834
2010-09-26 14 2834
2010-09-27 14 2834
2010-09-28 14 2543
2010-09-29 14 2543
2010-09-30 15 2884
2010-10-01 16 3105
2010-10-02 16 3105
2010-10-03 16 3105
2010-10-04 15 2902
2010-10-05 14 2805
2010-10-06 14 2805
2010-10-07 14 2805
2010-10-08 14 2812
2010-10-09 15 2895
2010-10-10 14 2667
2010-10-11 15 2876
2010-10-12 16 2938
2010-10-13 17 3112
2010-10-14 16 2894
2010-10-15 16 2894
2010-10-16 16 2923
2010-10-17 15 2722
2010-10-18 15 2722
2010-10-19 14 2544
2010-10-20 13 2277
2010-10-21 13 2329
2010-10-22 12 2132
2010-10-23 11 1892
2010-10-24 10 1764
2010-10-25 10 1764
2010-10-26 10 1764
2010-10-27 10 1764
2010-10-28 10 1764
2010-10-29 9 1670
2010-10-30 10 1969
2010-10-31 10 1709
2010-11-01 10 1624
2010-11-02 11 1677
2010-11-03 11 1677
2010-11-04 11 1677
2010-11-05 11 1677
2010-11-06 12 1786
2010-11-07 12 1786
2010-11-08 11 1529
2010-11-09 10 1446
2010-11-10 11 1682
2010-11-11 11 1540
2010-11-12 11 1673
2010-11-13 11 1765
2010-11-14 12 1924
2010-11-15 13 2276
2010-11-16 12 2110
2010-11-17 13 2524
2010-11-18 14 2615
2010-11-19 14 2615
2010-11-20 15 2706
2010-11-21 14 2549
2010-11-22 15 2647
2010-11-23 16 2874
2010-11-24 16 2874
2010-11-25 16 2874
2010-11-26 17 3249
2010-11-27 18 3421
2010-11-28 18 3421
2010-11-29 19 3647
I'm trying to plot this data with R to get a graphical representation of my journal-writing consistency. I load it into R with the following command.
d <- read.table("journal.txt", header=T, sep="\t")
I can then graph the data with the following command.
plot(seq(from=1, to=length(d$Entry.Count), by=1), d$Entry.Count, type="o", ylim=c(0, max(d$Entry.Count)))
However, in this plot the X axis is just a number, not a date. I tried adjusting the command to show dates on the X axis like this.
plot(d$Date, d$Entry.Count, type="o", ylim=c(0, max(d$Entry.Count)))
However, not only does the plot look strange, but the labels on the X axis are not very helpful. What is the best way to plot this data so that I can clearly associate dates with points on the plotted curve?
Based on your code the dates are just characters.
Try converting them to Dates:
plot(as.Date(d$Date), d$Entry.Count)
Quite simple in your case as the "%Y-%m-%d" format is the default for as.Date. See strptime for more general options.
You could use zoo. ?plot.zoo has several examples of how to create custom axis labels.
z <- zoo(d[,-1],as.Date(d[,1]))
plot(z)
# Example of custom axis labels
plot(z$Entry.Count, screen = 1, col = 1:2, xaxt = "n")
ix <- seq(1, length(time(z)), 3)
axis(1, at = time(z)[ix], labels = format(time(z)[ix],"%b-%d"), cex.axis = 0.7)