How to subset consecutive rows if they meet a condition - r

I am using R to analyze a number of time series (1951-2013) containing daily values of Max and Min temperatures. The data has the following structure:
YEAR MONTH DAY MAX MIN
1985 1 1 22.8 9.4
1985 1 2 28.6 11.7
1985 1 3 24.7 12.2
1985 1 4 17.2 8.0
1985 1 5 17.9 7.6
1985 1 6 17.7 8.1
I need to find the frequency of heat waves based on this definition: A period of three or more consecutive days ‎with a daily maximum and minimum temperature exceeding the 90th percentile of the maximum ‎and minimum temperatures for all days in the studied period.
Basically, I want to subset those consecutive days (three or more) when the Max and Min temp exceed a threshold value. The output would be something like this:
YEAR MONTH DAY MAX MIN
1989 7 18 45.0 23.5
1989 7 19 44.2 26.1
1989 7 20 44.7 24.4
1989 7 21 44.6 29.5
1989 7 24 44.4 31.6
1989 7 25 44.2 26.7
1989 7 26 44.5 25.0
1989 7 28 44.8 26.0
1989 7 29 44.8 24.6
1989 8 19 45.0 24.3
1989 8 20 44.8 26.0
1989 8 21 44.4 24.0
1989 8 22 45.2 25.0
I have tried the following to subset my full dataset to just the days that exceed the 90th percentile temperature:
HW<- subset(Mydata, Mydata$MAX >= (quantile(Mydata$MAX,.9)) &
Mydata$MIN >= (quantile(Mydata$MIN,.9)))
However, I got stuck in how I can subset only consecutive days that have met the condition.

An approach with data.table which is slightly different from #jlhoward's approach (using the same data):
library(data.table)
setDT(df)
df[, hotday := +(MAX>=44.5 & MIN>=24.5)
][, hw.length := with(rle(hotday), rep(lengths,lengths))
][hotday == 0, hw.length := 0]
this produces a datatable with a heat wave length variable (hw.length) instead of a TRUE/FALSE variable for a specific heat wave length:
> df
YEAR MONTH DAY MAX MIN hotday hw.length
1: 1989 7 18 45.0 23.5 0 0
2: 1989 7 19 44.2 26.1 0 0
3: 1989 7 20 44.7 24.4 0 0
4: 1989 7 21 44.6 29.5 1 1
5: 1989 7 22 44.4 31.6 0 0
6: 1989 7 23 44.2 26.7 0 0
7: 1989 7 24 44.5 25.0 1 3
8: 1989 7 25 44.8 26.0 1 3
9: 1989 7 26 44.8 24.6 1 3
10: 1989 7 27 45.0 24.3 0 0
11: 1989 7 28 44.8 26.0 1 1
12: 1989 7 29 44.4 24.0 0 0
13: 1989 7 30 45.2 25.0 1 1

I may be missing something here but I don't see the point of subsetting beforehand. If you have data for every day, in chronological order, you can use run length encoding (see the docs on the rle(...) function).
In this example we create an artificial data set and define "heat wave" as MAX >= 44.5 and MIN >= 24.5. Then:
# example data set
df <- data.frame(YEAR=1989, MONTH=7, DAY=18:30,
MAX=c(45, 44.2, 44.7, 44.6, 44.4, 44.2, 44.5, 44.8, 44.8, 45, 44.8, 44.4, 45.2),
MIN=c(23.5, 26.1, 24.4, 29.5, 31.6, 26.7, 25, 26, 24.6, 24.3, 26, 24, 25))
r <- with(with(df, rle(MAX>=44.5 & MIN>=24.5)),rep(lengths,lengths))
df$heat.wave <- with(df,MAX>=44.5&MIN>=24.5) & (r>2)
df
# YEAR MONTH DAY MAX MIN heat.wave
# 1 1989 7 18 45.0 23.5 FALSE
# 2 1989 7 19 44.2 26.1 FALSE
# 3 1989 7 20 44.7 24.4 FALSE
# 4 1989 7 21 44.6 29.5 FALSE
# 5 1989 7 22 44.4 31.6 FALSE
# 6 1989 7 23 44.2 26.7 FALSE
# 7 1989 7 24 44.5 25.0 TRUE
# 8 1989 7 25 44.8 26.0 TRUE
# 9 1989 7 26 44.8 24.6 TRUE
# 10 1989 7 27 45.0 24.3 FALSE
# 11 1989 7 28 44.8 26.0 FALSE
# 12 1989 7 29 44.4 24.0 FALSE
# 13 1989 7 30 45.2 25.0 FALSE
This creates a column, heat.wave which is TRUE if there was a heat wave on that day. If you need to extract only the hw days, use
df[df$heat.wave,]
# YEAR MONTH DAY MAX MIN heat.wave
# 7 1989 7 24 44.5 25.0 TRUE
# 8 1989 7 25 44.8 26.0 TRUE
# 9 1989 7 26 44.8 24.6 TRUE

Your question really boils down to finding groupings of 3+ consecutive days in your subsetted dataset, removing all remaining data.
Let's consider an example where we would want to keep some rows and remove others:
dat <- data.frame(year = 1989, month=c(6, 7, 7, 7, 7, 7, 8, 8, 8, 10, 10), day=c(12, 11, 12, 13, 14, 21, 5, 6, 7, 12, 13))
dat
# year month day
# 1 1989 6 12
# 2 1989 7 11
# 3 1989 7 12
# 4 1989 7 13
# 5 1989 7 14
# 6 1989 7 21
# 7 1989 8 5
# 8 1989 8 6
# 9 1989 8 7
# 10 1989 10 12
# 11 1989 10 13
I've excluded the temperature data, because I'm assuming we've already subsetted to just the days that exceed the 90th percentile using the code from your question.
In this dataset there is a 4-day heat wave in July and a three-day heat wave in August. The first step would be to convert the data to date objects and compute the number of days between consecutive observations (I assume the data is already ordered by day here):
dates <- as.Date(paste(dat$year, dat$month, dat$day, sep="-"))
(dd <- as.numeric(difftime(tail(dates, -1), head(dates, -1), units="days")))
# [1] 29 1 1 1 7 15 1 1 66 1
We're close, because now we can see the time periods where there were multiple date gaps of 1 day -- these are the ones we want to grab. We can use the rle function to analyze runs of the number 1, keeping only the runs of length 2 or more:
(valid.gap <- with(rle(dd == 1), rep(values & lengths >= 2, lengths)))
# [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
Finally, we can subset the dataset to just the days that were on either side of a 1-day date gap that is part of a heat wave:
dat[c(FALSE, valid.gap) | c(valid.gap, FALSE),]
# year month day
# 2 1989 7 11
# 3 1989 7 12
# 4 1989 7 13
# 5 1989 7 14
# 7 1989 8 5
# 8 1989 8 6
# 9 1989 8 7

A simple approach, not full vectorized..
# play data
year <- c("1960")
month <- c(rep(1,30), rep(2,30), rep(3,30))
day <- rep(1:30,3)
maxT <- round(runif(90, 20, 22),1)
minT <- round(runif(90, 10, 12),1)
df <- data.frame(year, month, day, maxT, minT)
# target and tricky data...
df[1:3, 4] <- 30
df[1:4, 5] <- 14
df[10:13, 4] <- 30
df[10:11, 5] <- 14
# limits
df$maxTope <- df$maxT - quantile(df$maxT,0.9)
df$minTope <- df$minT - quantile(df$minT,0.9)
# define heat day
df$heat <- ifelse(df$maxTope > 0 & df$minTope >0, 1, 0)
# count heat day2
for(i in 2:dim(df)[1]){
df$count[1] <- ifelse(df$heat[1] == 1, 1, 0)
df$count[i] <- ifelse(df$heat[i] == 1, df$count[i-1]+1, 0)
}
# select last day of heat wave (and show the number of days in $count)
df[which(df$count >= 3),]

Here's a quick little solution:
is_High_Temp <- ((quantile(Mydata$MAX,.9)) &
Mydata$MIN >= (quantile(Mydata$MIN,.9)))
start_of_a_series <- c(T,is_High_Temp[-1] != is_High_Temp[-length(x)]) # this is the tricky part
series_number <- cumsum(start_of_a_series)
series_length <- ave(series_number,series_number,FUN=length())
is_heat_wave <- series_length >= 3 & is_High_Temp

A solution with dplyr , also using rle()
library(dplyr)
cond <- expr(MAX >= 44.5 & MIN >= 24.5)
df %>%
mutate(heatwave =
rep(rle(!!cond)$values & rle(!!cond)$lengths >= 3,
rle(!!cond)$lengths)) %>%
filter(heatwave)
#> YEAR MONTH DAY MAX MIN heatwave
#> 1 1989 7 24 44.5 25.0 TRUE
#> 2 1989 7 25 44.8 26.0 TRUE
#> 3 1989 7 26 44.8 24.6 TRUE
Created on 2020-05-16 by the reprex package (v0.3.0)
data
#devtools::install_github("alistaire47/read.so")
df <- read.so::read.so("YEAR MONTH DAY MAX MIN
1989 7 18 45.0 23.5
1989 7 19 44.2 26.1
1989 7 20 44.7 24.4
1989 7 21 44.6 29.5
1989 7 24 44.4 31.6
1989 7 25 44.2 26.7
1989 7 26 44.5 25.0
1989 7 28 44.8 26.0
1989 7 29 44.8 24.6
1989 8 19 45.0 24.3
1989 8 20 44.8 26.0
1989 8 21 44.4 24.0
1989 8 22 45.2 25.0")

Related

Loop to sum weekly rolling average

I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3

Find first previous lower value for each value in dataframe column

Given the following dataframe :
set.seed(1)
my_df = data.frame(x = rep(words[1:5], 50) %>% sort(),
y = 1:250,
z = sample(seq(from = 30 , to = 90, by = 0.1), size = 250, replace = T))
my_df %>% head(30)
x y z
1 a 1 45.9
2 a 2 52.3
3 a 3 64.4
4 a 4 84.5
5 a 5 42.1
6 a 6 83.9
7 a 7 86.7
8 a 8 69.7
9 a 9 67.8
10 a 10 33.7
11 a 11 42.3
12 a 12 40.6
13 a 13 71.2
14 a 14 53.0
15 a 15 76.2
16 a 16 59.9
17 a 17 73.1
18 a 18 89.6
19 a 19 52.8
20 a 20 76.7
21 a 21 86.1
22 a 22 42.7
23 a 23 69.1
24 a 24 37.5
25 a 25 46.0
26 a 26 53.2
27 a 27 30.8
28 a 28 52.9
29 a 29 82.2
30 a 30 50.4
I would like to create the following column using dplyr mutate
for each value In column z show the row index of the first value in z which is lower.
For example:
for row 8 show 5
for row 22 show 12
I'm not sure how to do this using dplyr, but here is a data.table attempt using a self non-equi join
library(data.table)
setDT(my_df) %>% #convert to data.table
# Run a self non-equi join and find the closest lower value
.[., .N - which.max(rev(z < i.z)) + 1L, on = .(y <= y), by = .EACHI] %>%
# filter the cases where there are no such values
.[y != V1] %>%
# join the result back to the original data
my_df[., on = .(y), res := V1]
head(my_df, 22)
# x y z res
# 1: a 1 45.9 NA
# 2: a 2 52.3 1
# 3: a 3 64.4 2
# 4: a 4 84.5 3
# 5: a 5 42.1 NA
# 6: a 6 83.9 5
# 7: a 7 86.7 6
# 8: a 8 69.7 5
# 9: a 9 67.8 5
# 10: a 10 33.7 NA
# 11: a 11 42.3 10
# 12: a 12 40.6 10
# 13: a 13 71.2 12
# 14: a 14 53.0 12
# 15: a 15 76.2 14
# 16: a 16 59.9 14
# 17: a 17 73.1 16
# 18: a 18 89.6 17
# 19: a 19 52.8 12
# 20: a 20 76.7 19
# 21: a 21 86.1 20
# 22: a 22 42.7 12
I have managed to find a dplyr solution inspired
by a solution given to one of my other questions using rollapply
in this link.
set.seed(1)
my_df = data.frame(x = rep(words[1:5], 50) %>% sort(),
y = 1:250,
z = sample(seq(from = 30 , to = 90, by = 0.1), size = 250, replace = T))
my_df %>%
mutate(First_Lower_z_Backwards = row_number() - rollapply(z,
width = list((0:(-n()))),
FUN = function(x) which(x < x[1])[1] - 1,
fill = NA,
partial = T)) %>%
head(22)
x y z First_Lower_z_Backwards
1 a 1 45.9 NA
2 a 2 52.3 1
3 a 3 64.4 2
4 a 4 84.5 3
5 a 5 42.1 NA
6 a 6 83.9 5
7 a 7 86.7 6
8 a 8 69.7 5
9 a 9 67.8 5
10 a 10 33.7 NA
11 a 11 42.3 10
12 a 12 40.6 10
13 a 13 71.2 12
14 a 14 53.0 12
15 a 15 76.2 14
16 a 16 59.9 14
17 a 17 73.1 16
18 a 18 89.6 17
19 a 19 52.8 12
20 a 20 76.7 19
21 a 21 86.1 20
22 a 22 42.7 12

Calculate the percent occurrence of a variable in multiple groups

Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)
Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9
A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]

R program - getting particular values depending on another column

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

Resources