How to do countif in R based on dates - r

I have this data in my excel files, and it has so much data to count if I do it in Excel. I want to count how many days in 1 month have a value of more than 50.
I'd like to turn it into something like :
Could someone help me to solve this?

Another option is count with as.yearmon from zoo - filter the rows where 'Value' is greater than 50, then use count after converting to yearmon class with as.yearmon
library(dplyr)
library(zoo)
df %>%
filter(Value > 50) %>%
count(month_year = as.yearmon(Date))
-ouptut
month_year n
1 Jan 2010 3
2 Feb 2010 1
data
df <- structure(list(Date = structure(c(14610, 14611, 14612, 14618,
14618, 14624, 14641), class = "Date"), Value = c(27, 35, 78,
88, 57, 48, 99)), class = "data.frame", row.names = c(NA, -7L
))

Suppose your data is given by
df <- data.frame(Date = as.Date(c("1/1/2010", "1/2/2010", "1/3/2010", "1/9/2010", "1/9/2010", "1/15/2010", "2/1/2010"), "%m/%d/%Y"),
Value = c(27, 35, 78, 88, 57, 48, 99))
To count your specific values you could use
library(dplyr)
df %>%
group_by(month_year = format(Date, "%m-%y")) %>%
summarise(count = sum(Value > 50))
which returns
# A tibble: 2 x 2
month_year count
<chr> <int>
1 01-10 3
2 02-10 1
Note: Your Date column has to contain dates (as in as.Date).

Related

Is there a way to use the lag function and subtract from the first row in the assigned group until a value is reached before repeating the process?

For example if I have a dataset that looks like this
structure(list(ID = c(123, 123, 123, 123, 123, 145, 145, 145,
145, 145, 145), `Date Time` = structure(c(1663037145, 1663037160,
1663040745, 1663042520, 1663043060, 1663372800, 1663373100, 1663376400,
1663376460, 1663376940, 1663377780), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-11L))
The way I use the lag function is
df %>%
group_by(ID) %>%
mutate(diff=`Date Time`-lag(`Date Time`))
However it subtracts by the previous row. How do I make it subtract by the first row in the group until the value goes over 1 hour as in 3600 seconds? Please no hard code, assume some of these groups are 100s of rows or 1.
structure(list(ID = c(123, 123, 123, 123, 123, 145, 145, 145,
145, 145, 145), `Date Time` = structure(c(1663037145, 1663037160,
1663040745, 1663042520, 1663043060, 1663372800, 1663373100, 1663376400,
1663376460, 1663376940, 1663377780), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), diff = c(NA, 15, 3600, NA, 540, NA, 300, 3600,
NA, 8, 1320)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-11L))
The purpose of lag and lead is to allow you to access a known & fixed number of rows ahead or behind in the dataset.
I tried the following experiment, which used a dynamic number of rows:
df %>%
group_by(ID) %>%
mutate(num_rows = n()) %>%
arrange(`Date Time`) %>%
mutate(first_dt = lag(`Date Time`, num_rows - 1))
But it errors as lag will only accept "a positive integer, not a vector". So you are going to need another method.
The following code finds the minimum date time for each ID:
df %>%
group_by(ID) %>%
mutate(min_date_time = min(`Date Time`))
You can then manipulate using standard calculations:
... %>%
mutate(diff_from_start = `Date Time` - min_date_time)
If you want to roll across multiple rows, another option is to look at using cumsum or similar. For example, to produce a cumulative sum of differences.

How to add a new column based on a few other variables

I am new to R and am having trouble creating a new variable using conditions from already existing variables. I have a dataset that has a few columns: Name, Month, Binary for Gender, and Price. I want to create a new variable, Price2, that will:
make the price charged 20 if [the month is 6-9(Jun-Sept) and Gender is 0]
make the price charged 30 if [the month is 6-9(Jun-Sept) and Gender is 1]
make the price charged 0 if [the month is 1-5(Jan-May) or month is 10-12(Oct-Dec]
--
structure(list(Name = c("ADI", "SLI", "SKL", "SNK", "SIIEL", "DJD"), Mon = c(1, 2, 3, 4, 5, 6), Gender = c(1, NA, NA, NA, 1, NA), Price = c(23, 34, 32, 64, 23, 34)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Using case_when() from the dplyr package:
mydf$newprice <- dplyr::case_when(
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 0 ~ 20,
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 1 ~ 30,
mydf$Mon < 6 | mydf$Mon > 9 ~ 0)

Conditionally replace values across multiple columns based on string match in a separate column

I'm trying to conditionally replace values in multiple columns based on a string match in a different column but I'd like to be able to do so in a single line of code using the across() function but I keep getting errors that don't quite make sense to me. I feel like this is probably a simple solution so if anyone could point me in the right direction, that would be fantastic!
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
# working but not concise
df %>%
mutate(total = ifelse(str_detect(type, "Park"), NA, total),
group_a = ifelse(str_detect(type, "Park"), NA, group_a),
group_b = ifelse(str_detect(type, "Park"), NA, group_b))
# concise but not working
df %>% mutate(across(total, group_a, group_b), ifelse(str_detect(type, "Park"), NA, .))
Update
We got a solution that works with my dummy dataset but is not working with my real data, so I am going to share a small snippet of my real data frame with the numbers changed and organization names hidden. When I run this line of code (df %>% mutate(across(c(Attempts, Canvasses, Completes)), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .))) on these data, I get the following error message:
Error: Problem with mutate() input ..2. x Input ..2 must be a
vector, not a formula object. i Input ..2 is
~ifelse(str_detect(long_name, "park-cemetery"), NA, .).
This a small sample of the data that produces this error:
df <- structure(list(Org = c("OrgName", "OrgName", "OrgName", "OrgName",
"OrgName", "OrgName", "OrgName", "OrgName", "OrgName", "OrgName"
), nCode = c("M34", "R36", "R46", "X29", "M31", "K39", "Q12",
"Q39", "X41", "K27"), Attempts = c(100, 100, 100, 100, 100, 100,
100, 100, 100, 100), Canvasses = c(80, 80, 80, 80, 80, 80, 80,
80, 80, 80), Completes = c(50, 50, 50, 50, 50, 50, 50, 50, 50,
50), van_nocc_id = c(999, 999, 999, 999, 999, 999, 999, 999,
999, 999), van_name = c("M-Upper West Side", "SI-Rosebank", "SI-Tottenville",
"BX-park-cemetery-etc-Bronx", "M-Stuyvesant Town-Cooper Village",
"BK-Kensington", "Q-Broad Channel", "Q-Lindenwood", "BX-Wakefield",
"BK-East New York"), boro_short = c("M", "SI", "SI", "BX", "M",
"BK", "Q", "Q", "BX", "BK"), long_name = c("Upper West Side",
"Rosebank", "Tottenville", "park-cemetery-etc-Bronx", "Stuyvesant Town-Cooper Village",
"Kensington", "Broad Channel", "Lindenwood", "Wakefield", "East New York"
)), row.names = c(NA, -10L), class = "data.frame")
Final update
The curse of the misplaced closing bracket! Thanks to everyone for your help... the correct solution was df %>% mutate(across(c(Attempts, Canvasses, Completes), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .)))
If you use the newly introduced function across (which is the correct way to approach this task), you have to specify inside across itself the function you want to apply. In this case the function ifelse(...) has to be a purrr-style lambda (so starting with ~). Check out across documentation and look for the arguments .cols and .fns.
df %>%
mutate(across(c(total, group_a, group_b), ~ifelse(str_detect(type, "Park"), NA, .)))
Output
# type total group_a group_b
# 1 Park NA NA NA
# 2 Neighborhood 56 26 30
# 3 Airport 75 45 30
# 4 Park NA NA NA
# 5 Neighborhood 21 3 18
# 6 Neighborhood 56 46 10
Here a data.table solution.
require(data.table)
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
setDT(df)
df[type == "Park", c("total", "group_a", "group_b") := NA]
Update: that didn't take long to figure out! Just needed to place the columns in a vector:
# concise AND working!
df %>% mutate(across(c(total, group_a, group_b)), ifelse(str_detect(type, "Park"), NA, .))
I had tried this initially but placed the columns in quotes... don't do that :)

How to compile the number of observations that follow a specific pattern?

I have a dataset with three variables (DateTime, Transmitter, and timediff). The timediff column is the time difference between subsequent detections of a transmitter. I want to know how many times the time differences followed a specific pattern. Here is a sample of my data.
> dput(Example)
structure(list(DateTime = structure(c(1501117802, 1501117805,
1501117853, 1501117857, 1501117913, 1501117917, 1501186253, 1501186254,
1501186363, 1501186365, 1501186541, 1501186542, 1501186550, 1501186590,
1501186591, 1501186644, 1501186646, 1501186737, 1501186739, 1501187151
), class = c("POSIXct", "POSIXt"), tzone = "GMT"), Transmitter = c(30767L,
30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L,
30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L, 30767L,
30767L, 30767L, 30767L), timediff = c(44, 3, 48, 4, 56, 4, 50,
1, 42, 2, 56, 1, 8, 40, 1, 53, 2, 37, 2, 42)), row.names = c(NA,
20L), class = "data.frame")
So looking at the time difference column, I want to know how many times there is a single timediff < 8seconds, how many times there are two subsequent timediff < 8 seconds, how many times there are three subsequent timediff < 8 seconds, and so on.
Example: In the given dataset, a single timediff <8 seconds happens 7 times while two subsequent timediffs < 8 seconds happens twice.
A "single timediff" = 44, 3 , 48
A "double timediff" = 56, 1, 8, 40
In terms of an output, I'd be looking for something like this...
> dput(output)
structure(list(ID = 30767, Single = 7, Double = 2), class = "data.frame", row.names = c(NA,
-1L))
Thanks for the help!
One dplyr possibility could be:
df %>%
mutate(cond = timediff <= 8) %>%
group_by(rleid = with(rle(cond), rep(seq_along(lengths), lengths))) %>%
add_count(rleid, name = "n_timediff") %>%
filter(cond & row_number() == 1) %>%
ungroup() %>%
count(n_timediff)
n_timediff n
<int> <int>
1 1 8
2 2 1
Considering there could be more values in "Transmitter", you can do (this requires also tidyr):
df %>%
mutate(cond = timediff <= 8) %>%
group_by(Transmitter, rleid = with(rle(cond), rep(seq_along(lengths), lengths))) %>%
add_count(rleid, name = "n_timediff") %>%
filter(cond & row_number() == 1) %>%
ungroup() %>%
group_by(Transmitter) %>%
count(n_timediff) %>%
mutate(n_timediff = paste("timediff", n_timediff, sep = "_")) %>%
spread(n_timediff, n)
Transmitter timediff_1 timediff_2
<int> <int> <int>
1 30767 8 1

15 Minute Period for Time Series

I have the chunk of code below where I am trying to fill the missing minutes in my data df_stuff by joining it to a time series which has all minutes for an entire year. I would actually like to aggregate this data at 15 minute intervals instead of minute. Does anyone know a simple way of doing this? I was looking at to.minutes15 from the xts package but it seems to have problems with my POSIXct format time series.
Code:
library("sqldf")
##Filling Gaps in time by minute
myTZ <- "America/Los_Angeles"
tseries <- seq(as.POSIXct("2015-01-01 00:00:00", tz=myTZ),
as.POSIXct("2015-12-31 23:59:00", tz=myTZ), by="min")
df2 <- data.frame(SeqDateTime=tseries)
finaldf <- sqldf("select df2.SeqDateTime,
median(df_stuff.brooms) as broomsTot
from df2
left outer join df_stuff on df2.SeqDateTime = df_stuff.broomTime
group by df2.SeqDateTime
order by df2.SeqDateTime asc")
Data:
df_stuff <- structure(list(brooms = c(27, 53, 10, 55, 14, 49, 26,
13, 12, NA, NA, 23, 28, 31, NA, 46, NA, 13, NA, 33, 12, 4, 28,
34, 0, 24, 7, 31, 33, 37, 56, 41, 50, 55, 41, 15, 23, 26, 14,
27, 22, 41, 48, 19, 28, 11, 11, NA, 49, NA), broomTime = structure(c(1423970100,
1424122200, 1424136180, 1424035260, 1424141580, 1424122440, 1423274580,
1424129580, 1424146320, 1429129320, 1429032060, 1429142940, 1428705000,
1429142460, 1429128720, 1429204560, 1422909480, 1424137200, 1424042100,
1424149620, 1424131920, 1424108940, 1424144820, 1424040600, 1424119620,
1424148660, 1443593040, 1443657120, 1424125860, 1424223120, 1424235240,
1424232720, 1424234940, 1424234640, 1424230440, 1424115300, 1429208280,
1429131720, 1429148460, 1429151040, 1424129760, 1424125380, 1424123220,
1424137380, 1424115780, 1424219340, 1424131560, 1424233560, 1424224920,
1443640800), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("brooms",
"broomTime"), row.names = c(NA, 50L), class = "data.frame")
You can summarize by any amount of time interval by using cut within the group_by function in dplyr.
library(dplyr)
ans <- finaldf %>%
group_by(SeqDateTime = cut(SeqDateTime, breaks = "15 min")) %>%
summarize(broomsTot = sum(as.numeric(broomsTot), na.rm = TRUE))
head(ans)
Source: local data frame [6 x 2]
SeqDateTime broomsTot
(fctr) (dbl)
1 2015-01-01 02:00:00 0
2 2015-01-01 02:15:00 0
3 2015-01-01 02:30:00 0
4 2015-01-01 02:45:00 0
5 2015-01-01 03:00:00 0
6 2015-01-01 03:15:00 0
I can assure you that xts does not have problem with your POSIXct time series. xts uses POSIXct for its internal time index.
Here's how to join df_stuff with a 1-minute series and then aggregate that result to a 15-minute series.
library(xts)
# create xts object
xts_stuff <- with(df_stuff, xts(brooms, broomTime))
# merge with empty xts object that contains a regular 1-minute index
xts_stuff_1min <- merge(xts_stuff, xts(,tseries))
# aggregate to 15-minutes
ep15 <- endpoints(xts_stuff_1min, "minutes", 15)
final_df <- period.apply(xts_stuff_1min, ep15, median, na.rm=TRUE)

Resources