Count/frequency of observations by month - r

I have a dataset with 34 variables and about 25,000 observations. Each observation refers to a specific incident. Its formatted something like this:
no id date ....
1 363 006 2005-11-05
2 939 012 2007-01-23
3 541 090 2009-06-14
I'm wondering if someone could walk me through how to get the total number of incidents for each month over the 5 years contained in this dataset.

It seems like you just want to count the number of rows for each month, if so you can just use table():
> #make junk data
> data <- data.frame(no=rnorm(100),id=rnorm(100),date=seq(Sys.Date()-99,Sys.Date(),by="day"))
> table(format(data$date,"%b-%Y"))
Aug-2013 Jul-2013 Nov-2013 Oct-2013 Sep-2013
31 7 1 31 30

tapply( dfrm$no, sub(".+-(.+)-.+", "\\1", dfrm$date), sum, na.rm=TRUE)
For month and year in MM-YYYY format use this as you category code:
.... , sub("(.+)-(.+)-.+", "\\2-\\1", data$date) , ....

Related

Normalize aggregation results according to number of days per month

I have this table:
Month
nbr_of_days
aggregated_sum
1
25
120
2
28
70
3
30
130
4
31
125
My goal here is to normalize the aggregated sum to an assumed value of 30 (nbr_of_days) per month.
So for the first row, for example, the normalized aggregated_sum would be: 30*120/25=144
How to do this in R?
df <- df%>% mutate(normalized_aggregated_sum=30*aggregated_sum/nbr_of_days)
Note: while asking the question, I realized how it can be answered

How to use Sys.Date() To Extract Current Year? [duplicate]

This question already has answers here:
How can I get the extract the previous year (2020) using Sys.Date()?
(2 answers)
Closed 1 year ago.
I have manually separated my dataset (discrete_8) into 2 separate datasets (data & data2). 'Data' contains the data from this current year (2021), whereas 'Data2' contains data from previous years. Of course, this is based on the current year (2021), but I want to automate the line of code so that when the year 2022 comes, I will not have to edit the script to change 2021 to 2022. Should I use Sys.Date() for calling the most recent year? How would I go about incorporating sys.date() to partition the dataset?
Here is my code so far, where I partition the dataset:
data <- discrete_8 %>% filter(PS_DATE >= as.POSIXct("2021-01-01"))#current year
data2 <- discrete_8 %>% filter(PS_DATE < as.POSIXct("2021-01-01"))#past years
Here is what discrete_8 looks like:
X PS_DATE PS_NAME Control.Parameters.Cell.Return.Flow.Rate Control.Parameters.Harvest.Flow.Rate Control.Parameters.Microsparger.Total.Gas.Flow.Rate
1 0 2014-02-06 123 NA NA 1
2 1 2014-02-07 124 NA NA 1
3 2 2014-02-08 125 NA NA 1
4 3 2014-02-09 126 1.5 NA 1
5 4 2014-02-10 127 1.5 NA 1
6 5 2014-02-11 128 1.5 NA 1
There is somewhat tedious bug still present in that trunc(Sys.Date(), "year") does not give you Jan 01 of the current year -- it does in R-devel.
But you can build yourself a helper such as this:
> firstDay <- function() { d <- Sys.Date(); d - as.POSIXlt(d)$yday }
> firstDay()
[1] "2021-01-01"
and you can use that to compare. (Also, in the code you posted, as.Date() is simpler as you ignore hours/minutes/seconds here.)
one option can be the lubridate::floor_date() function:
lubridate::floor_date(Sys.Date(), unit = "years")
[1] "2021-01-01"
I use substr(Sys.Date(),1,4) to get the current year. In your code you can replace as.POSIXct("2021-01-01") with
as.POSIXct(paste0(substr(Sys.Date(),1,4),"-01-01"))
This will give the 1st of the current year in your datetime format.

Count number of days since a specific date [duplicate]

This question already has answers here:
Get the difference between dates in terms of weeks, months, quarters, and years
(9 answers)
Closed 6 years ago.
I have got a dataframe with a column Date in which the observations range from 1974-10-01 to 2014-30-09. I would like to create a new column ("Day") in the dataframe which specify the number of day since the first time period day (i.e. 1974-10-01).
I already have the code and it worked perfectly for a really similar dataframe but I do not know why with this 2nd dataframe it does not work.
1) The code is the following:
library(lubridate)
ref_date <- dmy("01-10-1974")
df$Day <- as.numeric(difftime(df$Date, ref_date))
2) The first rows of my dataframe are:
Code Area Date Height
1 2001 551.4 1975-04-01 120.209
2 2001 551.4 1976-01-06 158.699
3 2001 551.4 1977-01-21 128.289
4 2001 551.4 1978-02-23 198.254
5 2001 551.4 1979-07-31 131.811
[....]
3) What I obtain with my code (1) is the following:
Code Area Date Day Height
1 2001 551.4 1975-04-01 15724800 120.209
2 2001 551.4 1976-01-06 39916800 158.699
3 2001 551.4 1977-01-21 72835200 128.289
4 2001 551.4 1978-02-23 107222400 198.254
5 2001 551.4 1979-07-31 152409600 131.811
[....]
I spent more than 2 hours wondering why without any clue.
Any suggestion?
Another option
difftime(ref_date,df$Date,units = "days")
Are you looking for something like the example below :
df <- data.frame(Date = c("1975-04-01"))
> df
Date
1 1975-04-01
df$new_col <- as.Date(as.character(df$Date), format="%Y-%m-%d") - as.Date(as.character("1974-10-01"), format="%Y-%m-%d")
> df
Date new_col
1 1975-04-01 182 days
>
Your code seems to work as long as the Date is a character column.
library(lubridate)
ref_date <- dmy("01-10-1974")
df<- data.frame(Code=2001, Area=551.4, Date=c("1975-04-01","1976-01-06","1977-01-21","1978-02-23","1979-07-31"), Height=c(120.209, 158.699, 128.289, 198.254, 131.811))
df$Day <- as.numeric(difftime(df$Date, ref_date))

R: How to arrange a daily time series of rows and columns to a single column?

I have a dataset, a daily timeseries and I want to arrange into a single column, this is my data:
Date Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 .... Day 31
01/01/1964 0 0 0 0 0 0 3
01/02/1964 NA NA NA NA NA NA ...
01/03/1964 195 445 329 121 61,6 44 ...
01/04/1964 17,2 14,9 17,1 102 54,3 9,33 ...
I want this:
Day1 0
Day2 0
.
.
.
Day31 3
I having problems because of leap years that have 366 days, i trying this, but no succes, thanks in advanced.
EDIT:
I finally got it, but if anyone knows a more easy way, using some package or function, I'm grateful. Or I'll create my own function.
EDIT 2:
Now I have a problem, when I not start in the first month of a year.
rm(list = ls())
cat("\014")
setwd("C:/")
require(XLConnect)
# Load Streamflow Gauging Station
wb <- loadWorkbook("rainfall.xls")
Data<- readWorksheet(wb, sheet = "rainfall",header = FALSE,region = "B02:AF517")
R<- Data; ##1964 - 2006
sum(R[is.na(R)==FALSE])
# Number of days in each month
Ny<- c(31,28,31,30,31,30,31,31,30,31,30,31); # Normal Year
Ly<- c(31,29,31,30,31,30,31,31,30,31,30,31); # Leap/bissextile Year
S1<- c(1,0,0,0) # Leap year, normal year...
S2<- c(0,1,0,0) # Normal year, leap year...
S3<- c(0,0,1,0) #...
S4<- c(0,0,0,1) #...
Iab<- rep(S1,times=ceiling((nrow(R)/12)/4)); # Index of years
Iab<- Iab[1:(nrow(R)/12)];
Rnew<- matrix(numeric(0), 0,0);
#Organize data in a only collumn
for(i in 1:(nrow(R)/12)){
for(j in 1:12){
if(Iab[i]==0){
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ny[j]]))
}else{
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ly[j]]))
}
}
}
sum(R[is.na(R)==FALSE])==sum(Rnew[is.na(Rnew)==FALSE]) #Test for succes of organize
sum(R[is.na(R)==FALSE])
sum(Rnew[is.na(Rnew)==FALSE])
I have a similar problem. However in a way even worse, since I have discharge data (Brasilian ANA station) with several interruptions of several month and years. Vazao01 stands for the discharge at the first day of the month, Vazao02 for the second and the data frame goes up to Vazao31 (which is obviously NA for month with less days, but can as well be NA for existing days without record). The data looks like this and is the data.frame "ANAday"
Date Vazao01 Vazao02 Vazao03...
20 01.05.1989 3463.00 3476.500 3463.000
21 01.06.1989 1867.70 1835.900 1809.400
22 01.07.1989 809.90 798.200 774.800
23 01.08.1989 344.60 308.700 297.900
24 01.11.1989 376.50 388.100 391.000
25 01.12.1989 279.00 289.800 319.500
26 01.01.1990 1715.00 1649.000 1573.200
27 01.02.1990 1035.20 1005.800 972.200
28 01.03.1990 2905.60 2962.100 NA
29 01.06.1990 NA NA NA
30 01.07.1990 297.90 284.400 271.200
31 01.08.1990 228.00 223.200 218.400
32 01.08.1999 NA NA 144.000
33 01.09.1999 20.74 18.620 16.500
34 01.10.1999 119.85 111.450 95.385
35 01.11.1999 11.20 23.705 48.370
36 01.12.1999 160.10 179.000 187.400
37 01.01.2000 843.00 865.300 914.500
38 01.02.2000 1331.30 1368.900 1387.800
39 01.04.2000 1823.60 1808.000 1789.800
40 01.05.2000 1579.00 1524.100 1445.700
I made a list of the month with data
ANAm=as.Date(ANAday[,1], format="%d.%m.%Y")
format(ANAm, format="%Y-%m")
Than I used the "monthDays" function of the Hmisc package to list the number of days in each month
require(Hmisc)
nodm=monthDays(ANAm)
Nodm=cbind.data.frame(ANAm,nodm)
I prepared a data.frame for the data I want to have with 3 columns for "YEAR MONTH", "DAY" and "DISCHARGE"
ANATS=array(NA,c(1,3))
colnames(ANATS)=c("mY","d","Q")
And used a simple "for" loop to extract the data into one column according to the number of days in each month
for(i in 1:nrow(Nodm)){
selectANA=as.vector(ANAd[i,1:(Nodm[i,2]) ])
selectANA=as.vector(t(selectANA))##to generate a simple vector
dayANA=c(1:(Nodm[i,2]))
monthANA=rep(format(as.Date(Nodm[i,1]),format="%Y-%m"),times=as.numeric(Nodm[i,2]))
ANAts=cbind(monthANA,dayANA,auswahlANA)
ANATS<<-rbind(ANATS,ANAts)
}
The ANATS can than be transferred into a timeseries:
combine.date=as.character(paste(ANATS[,1],ANATS[,2],sep="-"))
DATE=as.Date(combine.date, format="%Y-%m-%d")
rownames(ANATS)=as.character(DATE)
ANATS=ANATS[-1,]
ANAXTS=as.xts(ANATS)
Maybe I'm having trouble understanding exactly what you're looking for, but are you trying to transpose the data?
t(data)

How to create dataframe subset of the one patient observation with the lowest score on a variable

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12
Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])
I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]
Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.
I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.

Resources