how to determine season dry or rainy in temporal analysis using R? - r

I have the data temporal of temperature, i would like determinate if date be to season dry or rainy.
In my coutry the season dry start in May up to October, and season rainy start in November up to April.
Would be possible create a column with this information in package dplyr ou other?
my data-frame in:
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
Temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
df<-data.frame(sample_station, Date_dmy, Temperature)

One option is to extract the month after converting to Date class, create a condition in case_when to return 'dry', 'rainy' based on the values of 'Month' column
library(dplyr)
library(lubridate)
df <- df %>%
mutate(Month = month(dmy(Date_dmy)),
categ = case_when(Month %in% 5:10 ~ 'dry', TRUE ~ 'rainy'))

Similar to akrun's solution but with ifelse:
library(dplyr)
library(lubridate)
df <- df %>%
mutate(Month = month(dmy(Date_dmy)),
categ = ifelse(Month %in% 5:10,'dry','rainy'))
Output:
sample_station Date_dmy Temperature Month categ
1 A 01/01/2000 17 1 rainy
2 A 08/08/2000 20 8 dry
3 A 16/03/2001 24 3 rainy
4 A 22/09/2001 19 9 dry
5 A 01/06/2002 17 6 dry
6 A 05/01/2002 19 1 rainy
7 A 26/01/2002 23 1 rainy
8 A 16/02/2002 26 2 rainy
9 A 09/03/2002 19 3 rainy
10 A 30/03/2002 19 3 rainy
11 A 20/04/2002 21 4 rainy
12 B 04/01/2000 15 1 rainy
13 B 11/08/2000 23 8 dry
14 B 19/03/2001 18 3 rainy
15 B 25/09/2001 22 9 dry
16 B 04/06/2002 22 6 dry
17 B 08/01/2002 23 1 rainy
18 B 29/01/2002 18 1 rainy
19 B 19/02/2002 19 2 rainy
20 B 12/03/2002 26 3 rainy
21 B 13/09/2001 21 9 dry
22 C 08/01/2000 22 1 rainy
23 C 15/08/2000 23 8 dry
24 C 23/03/2001 27 3 rainy
25 C 29/09/2001 19 9 dry
26 C 08/06/2002 19 6 dry
27 C 12/01/2002 21 1 rainy
28 C 02/02/2002 23 2 rainy
29 C 23/02/2002 24 2 rainy
30 C 16/03/2002 25 3 rainy
31 C 06/04/2002 26 4 rainy
32 A 01/02/2000 29 2 rainy
33 B 01/02/2000 30 2 rainy
34 C 01/02/2000 21 2 rainy
35 A 02/11/2001 25 11 rainy
36 B 02/11/2001 24 11 rainy
37 C 02/11/2001 23 11 rainy

Related

Count number of instances above a varying threshold

I have the 0.95 percentile threshold for temperature for each country. In the example below a week is 4 days. I want to count in a new vector/single-column-dataframe how many days each individual country's temperature is over that country's threshold on a weekly basis.
The country 95% percentile temperatures are:
q95 <- c(26,21,22,20,23)
DailyTempCountry <- data.frame(Date = c("W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4",
"W1D1","W1D2","W1D3","W1D4","W2D1","W2D2","W2D3","W2D4"),
Country = c("AL","AL", "AL", "AL","AL","AL", "AL", "AL",
"BE","BE", "BE", "BE", "BE","BE", "BE", "BE",
"CA","CA", "CA", "CA","CA","CA", "CA", "CA",
"DE","DE", "DE", "DE","DE","DE", "DE", "DE",
"UK","UK", "UK", "UK","UK","UK", "UK", "UK"),
DailyTemp = c(27,25,20,22,20,20,27,27,
24,22,23,18,17,19,20,16,
23,23,23,23,27,26,20,26,
19,18,17,19,16,15,19,18,
20,24,24,20,19,25,19,25))
DailyTempCountry
Date Country DailyTemp
1 W1D1 AL 27
2 W1D2 AL 25
3 W1D3 AL 20
4 W1D4 AL 22
5 W2D1 AL 20
6 W2D2 AL 20
7 W2D3 AL 27
8 W2D4 AL 27
9 W1D1 BE 24
10 W1D2 BE 22
11 W1D3 BE 23
12 W1D4 BE 18
13 W2D1 BE 17
14 W2D2 BE 19
15 W2D3 BE 20
16 W2D4 BE 16
17 W1D1 CA 23
18 W1D2 CA 23
19 W1D3 CA 23
20 W1D4 CA 23
21 W2D1 CA 27
22 W2D2 CA 26
23 W2D3 CA 20
24 W2D4 CA 26
25 W1D1 DE 19
26 W1D2 DE 18
27 W1D3 DE 17
28 W1D4 DE 19
29 W2D1 DE 16
30 W2D2 DE 15
31 W2D3 DE 19
32 W2D4 DE 18
33 W1D1 UK 20
34 W1D2 UK 24
35 W1D3 UK 24
36 W1D4 UK 20
37 W2D1 UK 19
38 W2D2 UK 25
39 W2D3 UK 19
40 W2D4 UK 25
What I want is a vector/column that counts the number of days in that week above the country's threshold like this:
DaysInWeekAboveQ95 <- c(1,2,3,0,4,3,0,0,2,2)
df_right <- data.frame(Week = c("W1","W2","W1","W2","W1","W2","W1","W2","W1","W2"),
DaysInWeekAboveQ95 = c(1,2,3,0,4,3,0,0,2,2))
Week DaysInWeekAboveQ95
1 W1 1
2 W2 2
3 W1 3
4 W2 0
5 W1 4
6 W2 3
7 W1 0
8 W2 0
9 W1 2
10 W2 2
The q95% vector was
q95 <- c(26,21,22,20,23)
so in the first week AL have 1 instance above its threshold value 26. UK have 2 instances above 23 (UK's threshold) in the second week. And so for every country and every week.
I handled a similar problem but where the threshold did not vary by country but was just a constant 30 degrees (where I divide by 7 because seven days in week)
DaysAbove30perWeek <- as.data.frame(tapply(testdlong$value > 30,
ceiling(seq(nrow(testdlong))/7),sum))
Maybe a solution is to loop over countries? However, I can't figure out how to incorporate the specific loop. Other solutions are welcome.
In revised scenario you also need calculating a new column for week too
q95 <- c(26,21,22,20,23)
c_q95 <- data.frame(Country = unique(DailyTempCountry$Country),
threshold = q95)
library(dplyr)
DailyTempCountry %>% left_join(c_q95, by = 'Country') %>%
group_by(Country, Week = substr(Date, 1, 2)) %>%
summarise(days = sum(DailyTemp > threshold), .groups = 'drop')
# A tibble: 10 x 3
Country Week days
<chr> <chr> <int>
1 AL W1 1
2 AL W2 2
3 BE W1 3
4 BE W2 0
5 CA W1 4
6 CA W2 3
7 DE W1 0
8 DE W2 0
9 UK W1 2
10 UK W2 2
Created on 2021-05-05 by the reprex package (v2.0.0)
OP has asked that date variable is in some different format than given in sample data
time <- as.character(20000101:20000130)
> time
[1] "20000101" "20000102" "20000103" "20000104" "20000105" "20000106" "20000107" "20000108" "20000109" "20000110"
[11] "20000111" "20000112" "20000113" "20000114" "20000115" "20000116" "20000117" "20000118" "20000119" "20000120"
[21] "20000121" "20000122" "20000123" "20000124" "20000125" "20000126" "20000127" "20000128" "20000129" "20000130"
library(lubridate)
time <- ymd(time)
# Either ISO week
isoweek(time)
# or week
week(time)
> isoweek(time)
[1] 52 52 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4
> # or week
> week(time)
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5
library(lubridate)
time <- ymd(time)
isoweek(time)
week(time)

how can i plot a histogram of crime type vs HOURS in r

i have a big dataset, with diferent variables and i want to make a histogram of type of crime against HOURS. how can i do that in r?
DATE TIME PLACE ZONE TYPE.OF.CRIME WEEK
1 2011/01/01 23:00 KIEPIES CLUB <NA> ARMED ROBBERY 1
2 2011/01/03 10:00 AUSSPANNPLATZ Zone 14 ARMED ROBBERY 1
3 2011/01/07 14:00 UNAM BUSHES Zone 16 ARMED ROBBERY 1
4 2011/01/08 21:34 TOTAL SERV. STATION, KHOMASDAL Zone 9 ARMED ROBBERY 1
5 2011/01/15 <NA> WOODPALM STR 625 Zone 11 ARMED ROBBERY 2
6 2011/01/03 14:03 C KANDOVAZU STR Zone 5 ASSAULT GBH 1
HOUR day month year HOURS
1 23 1 1 2011 23
2 10 3 1 2011 10
3 14 7 1 2011 14
4 21 8 1 2011 21
5 <NA> 15 1 2011 <NA>
6 14 3 1 2011 14
ggplot(df, aes(x=TYPE.OF.CRIME, y=HOURS)) +
geom_histogram()
Something like this should work.

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Replicating table in R with change in one column

I have this table in R :
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
I want to replicate this same table 4 times, all values should be the same. Except the Month column, which needs to be incremented by 1 every time. And the final table should look like this:
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
John 8 2017 8 16
Carol 90 2017 8 30
Bug 9 2017 8 1
John 8 2017 9 16
Carol 90 2017 9 30
Bug 9 2017 9 1
John 8 2017 10 16
Carol 90 2017 10 30
Bug 9 2017 10 1
John 8 2017 11 16
Carol 90 2017 11 30
Bug 9 2017 11 1
Please point how to do this efficiently in R. Many thanks!
If this is your dataframe:
df = read.table(text = "Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1", header = TRUE)
Then this is your dataframe repeating:
df2 = df[rep(rownames(df), 4),]
And this is it again, but with the months incremented:
df2$Month = df2$Month + rep(0:3, 3)
In the more general case:
m = 4 # <-- number of rows desired
df2 = df[rep(rownames(df), m), ]
df2$Month = df2$Month + rep(0:m, nrow(df))

Resources