Related
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have some data:
structure(list(date = structure(c(17888, 17888, 17888, 17888,
17889, 17889, 17891, 17891, 17891, 17891, 17891, 17892, 17894
), class = "Date"), type = structure(c(4L, 6L, 15L, 16L, 2L,
5L, 2L, 3L, 5L, 6L, 8L, 2L, 2L), .Label = c("aborted-live-lead",
"conversation-archived", "conversation-auto-archived", "conversation-auto-archived-store-offline-or-busy",
"conversation-claimed", "conversation-created", "conversation-dropped",
"conversation-restarted", "conversation-transfered", "cs-transfer-connected",
"cs-transfer-ended", "cs-transfer-failed", "cs-transfer-initiate",
"cs-transfer-request", "getnotified-requested", "lead-created",
"lead-expired"), class = "factor"), count = c(1L, 1L, 1L, 1L,
3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
It looks like this:
> head(dat)
# A tibble: 6 x 3
date type count
<date> <fct> <int>
1 2018-12-23 conversation-auto-archived-store-offline-or-busy 1
2 2018-12-23 conversation-created 1
3 2018-12-23 getnotified-requested 1
4 2018-12-23 lead-created 1
5 2018-12-24 conversation-archived 3
6 2018-12-24 conversation-claimed 1
For each unique type value, there is an associated count per day.
How can I count all of the values of each type (regardless of the date) and list them in a two-column data frame (in a format like this):
type count
------ ------
conversation-created 10
conversation-archived 4
lead-created 2
...
The reason for this is to show an overall count of each event type over the entire date range.
I presume that I have to use the select() function from dplyr but I am sure I am missing something.
This is what I have so far - it sums every value in the count column which isn't what I want as I want it broken down by day:
dat %>%
select(type, count) %>%
summarise(count = sum(count)) %>%
ungroup()
Seems like a combination of group_by and summarize with sum does the job:
dat %>% group_by(type) %>% summarise(count = sum(count))
# A tibble: 8 x 2
# type count
# <fct> <int>
# 1 conversation-archived 7
# 2 conversation-auto-archived 1
# 3 conversation-auto-archived-store-offline-or-busy 1
# 4 conversation-claimed 3
# 5 conversation-created 3
# 6 conversation-restarted 1
# 7 getnotified-requested 1
# 8 lead-created 1
There is no need for select as summarize will drop all the other variables anyway. Or perhaps you are confusing select with group_by, which is what we want in this case - to summarize those values of count where type takes the same value.
I'm trying to write code to partially automate the process of subsetting a temperature dataset with start/end date-times in a winter mortality dataset, the latter having something over 100 observations, each of which would end up with one such temp data subset. I plan to calculate some temperature variables using each of these subsets, and add them to this second dataset.. but I'm hung up on the subsetting step.
Here's example data and my code (and let me know if you have suggestions on making this minimum reproducible.. I haven't posted here too much yet):
# Temperature data dput..
tempd <- structure(list(date = structure(c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("12/1/2014", "12/2/2014", "12/3/2014", "12/4/2014", "12/5/2014", "12/6/2014"), class = "factor"), time = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0:00:00", "12:00:00"), class = "factor"), temp = c(3.274, -0.986, -0.088, 0.495, 6.23, 0.934, 0.715, -4.227, -1.584, 0.88, 1.967)), .Names = c("date", "time", "temp"), class = "data.frame", row.names = c(NA, -11L))
# and mortality data dput..
owmd <- structure(list(siteyear = structure(c(1L, 1L, 1L), .Label = "s1.y1", class = "factor"), winter = c(1415L, 1415L, 1415L), date = structure(1:3, .Label = c("12/1/2014", "12/3/2014", "12/5/2014"), class = "factor"), site = structure(c(1L, 1L, 1L), .Label = "s1", class = "factor"), mort = c(0.06651485, 0.120592869, 0.135272089)), .Names = c("siteyear", "winter", "date", "site", "mort"), class = "data.frame", row.names = c(NA, -3L))
EDIT:
In case I've oversimplified my temp dataset, I'll say that my real temperature datasets (there are 10 of them, one for each site-year combination I have) contain values of temperature at 15 minute intervals (i.e. 96/day). Importantly, I want these temp subsets to start and end at 12pm, so I need to be able to specify the time as well as the date at the subset limits (note the very first temp subset of the dataset might not be able to start at 12pm if the dataset itself begins at a later time)
So, the code..
library(tidyverse)
library(lubridate)
# Factorize winter and 'date-ize' date
owmd$winter <- as.factor(owmd$winter)
owmd$date <- as.Date(owmd$date, '%m/%d/%Y')
# Create start date (date value for the prior observation)
owmd %>%
tbl_df() %>%
mutate(sdate = lag(date, 1)) -> owmd
# Now the temperature dataset
# Factorize date, do *something* with time, and create datetime
tempd$date2 <- as.Date(tempd$date, '%m/%d/%Y')
tempd %>%
mutate(datetime = ymd_hms('2014-12-01 12:00:00') + c(0:10) * hours(12),
time2 = parse_time(tempd$time)) -> tempd
# write a function that creates, for each observation in owmd, a subset of the tempd data bounded by owmd$date and owmd$sdate ('start date')
subfun <- function(x,y) {
start <- owmd[(x-1),3]
end <- owmd[x,3]
period <- filter(y, date2 >= start & time2 >= '12:00:00' & date2 <= end & time2 <= '12:00:00')
}
# test it
subfun(3, tempd)
Finding the right subset conditions in period is where I'm hung up. I'm getting
Warning messages:
1: In evalq((date2 >= start & time2 >= "12:00:00") & (date2 <= end & :
Incompatible methods (">=.Date", "Ops.data.frame") for ">="
2: In evalq((date2 >= start & time2 >= "12:00:00") & (date2 <= end & :
Incompatible methods ("<=.Date", "Ops.data.frame") for "<="
Seems like it shouldn't be too hard to use owmd$date and owmd$sdate (start date) as bounds for temperature dataset subsets, but I haven't managed to figure out the right subset conditions. Would a different format for tempd$time help? I include temp$datetime in case it could be used, but I didn't see how.
Any thoughts for a beginner are greatly appreciated.
Here's my session info:
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Thanks for providing reproducible data! There are a number of fixes that can help make this easier and more readable; below is how I would approach it. Essentially your goal is to figure out which temperature measurements fit each mortality period to do statistics on them. There's a few key realisations:
You want a function that accepts a dataframe, a column to put conditions on, and two conditions, but this is already what dplyr::filter does so we don't need another wrapper function really;
It would be helpful to keep the subsets with the corresponding mortality data and start/end dates so we should use a list-column to hold it;
We can concatenate 12:00:00 to the dates in owmd before parsing so that they are datetimes;
We can do a bunch of style changes (e.g. stick to lubridate parsers since you already use the package)
This means that I can do this:
library(tidyverse)
library(lubridate)
tempd <- structure(list(date = structure(c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("12/1/2014", "12/2/2014", "12/3/2014", "12/4/2014", "12/5/2014", "12/6/2014"), class = "factor"), time = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("0:00:00", "12:00:00"), class = "factor"), temp = c(3.274, -0.986, -0.088, 0.495, 6.23, 0.934, 0.715, -4.227, -1.584, 0.88, 1.967)), .Names = c("date", "time", "temp"), class = "data.frame", row.names = c(NA, -11L))
owmd <- structure(list(siteyear = structure(c(1L, 1L, 1L), .Label = "s1.y1", class = "factor"), winter = c(1415L, 1415L, 1415L), date = structure(1:3, .Label = c("12/1/2014", "12/3/2014", "12/5/2014"), class = "factor"), site = structure(c(1L, 1L, 1L), .Label = "s1", class = "factor"), mort = c(0.06651485, 0.120592869, 0.135272089)), .Names = c("siteyear", "winter", "date", "site", "mort"), class = "data.frame", row.names = c(NA, -3L))
mort <- owmd %>%
mutate(
winter = factor(winter),
date = date %>% str_c(., " 12:00:00") %>% mdy_hms(),
sdate = lag(date)
)
temp <- tempd %>%
mutate(datetime = str_c(date, " ", time) %>% mdy_hms(), date = mdy(date))
mort_subs <- mort %>%
as_tibble() %>%
mutate(
subset = map2(sdate, date, ~ filter(temp, datetime >= .x & datetime <= .y)
)
)
mort_subs
#> # A tibble: 3 x 7
#> siteyear winter date site mort sdate
#> <fct> <fct> <dttm> <fct> <dbl> <dttm>
#> 1 s1.y1 1415 2014-12-01 12:00:00 s1 0.0665 NA
#> 2 s1.y1 1415 2014-12-03 12:00:00 s1 0.121 2014-12-01 12:00:00
#> 3 s1.y1 1415 2014-12-05 12:00:00 s1 0.135 2014-12-03 12:00:00
#> # ... with 1 more variable: subset <list>
so now we have the right subset for each original row of owmd. Each element of the subset column contains a dataframe that is the subset of temp corresponding to the sdate and date. The tidyverse has a lot of tools that let us work with this data easily; we can unnest, group_by and summarise to get a statistic per original row:
mort_subs %>%
unnest() %>%
group_by(siteyear, winter, sdate, date, site, mort) %>%
summarise(mean_temp = mean(temp))
#> # A tibble: 2 x 7
#> # Groups: siteyear, winter, sdate, date, site [?]
#> siteyear winter sdate date site mort
#> <fct> <fct> <dttm> <dttm> <fct> <dbl>
#> 1 s1.y1 1415 2014-12-01 12:00:00 2014-12-03 12:00:00 s1 0.121
#> 2 s1.y1 1415 2014-12-03 12:00:00 2014-12-05 12:00:00 s1 0.135
#> # ... with 1 more variable: mean_temp <dbl>
Or we can accomplish a similar thing by using map_dbl to iterate over the subset list-column:
mort_subs %>%
mutate(mean_temp = map_dbl(subset, ~ mean(.$temp)))
#> # A tibble: 3 x 8
#> siteyear winter date site mort sdate
#> <fct> <fct> <dttm> <fct> <dbl> <dttm>
#> 1 s1.y1 1415 2014-12-01 12:00:00 s1 0.0665 NA
#> 2 s1.y1 1415 2014-12-03 12:00:00 s1 0.121 2014-12-01 12:00:00
#> 3 s1.y1 1415 2014-12-05 12:00:00 s1 0.135 2014-12-03 12:00:00
#> # ... with 2 more variables: subset <list>, mean_temp <dbl>
Created on 2018-10-24 by the reprex package (v0.2.0).
I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:
Customer.ID Date Rank
576293 8/13/2012 2
576293 11/16/2015 6
581252 11/22/2013 4
581252 11/16/2011 6
581252 1/4/2016 5
581600 1/12/2015 3
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
Ideal outcome would then be like this:
Customer.ID Date Rank
576293 11/16/2015 6
581252 1/4/2016 5
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
With the desired output of the OP clarified:
We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:
df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]
Notes:
First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.
The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:
print(res)
## Customer.ID Date Rank
##2 576293 11/16/2015 6
##5 581252 1/4/2016 5
##7 581600 1/12/2015 2
##8 582560 4/13/2016 1
##10 586334 3/30/2014 1
##9 591674 3/21/2012 6
Data:
df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L,
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016",
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L,
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L,
10L, 8L, 7L, 5L, 2L), class = "data.frame")
If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:
library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>%
summarise_all(funs(first)) %>%
ungroup() %>% arrange(Customer.ID)
Notes:
group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
summarise_all to keep only the first row from each Customer.ID.
Finally, ungroup and sort by Customer.ID to get your desired result.
Using your data as a data frame df with the Date column converted to the Date class:
print(res)
### A tibble: 7 x 3
## Customer.ID Date Rank
## <int> <date> <int>
##1 576293 2015-11-16 6
##2 581252 2016-01-04 5
##3 581600 2015-01-12 2
##4 582560 2016-04-13 1
##5 586334 2014-03-30 1
##6 591674 2012-03-21 6
Data:
df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L,
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565,
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L,
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA,
-10L), class = "data.frame")
I have the following data,
data
date ID value1 value2
2016-04-03 1 0 1
2016-04-10 1 6 2
2016-04-17 1 7 3
2016-04-24 1 2 4
2016-04-03 2 1 5
2016-04-10 2 5 6
2016-04-17 2 9 7
2016-04-24 2 4 8
Now I want to group by ID and find the mean of value2 and latest value of value1. Latest value in the sense, I would like to get the value of latest date i.e. here I would like to get the value1 for corresponding value of 2016-04-24 for each IDs. My output should be like,
ID max_value1 mean_value2
1 2 2.5
2 4 6.5
The following is the command I am using,
data %>% group_by(ID) %>% summarize(mean_value2 = mean(value2))
But I am not sure how to do the first one. Can anybody help me in getting the latest value of value1 while summarizing in dplyr?
One way would be the following. My assumption here is that date is a date object. You want to arrange the order of date for each ID using arrange. Then, you group the data by ID. In summarize, you can use last() to take the last value1 for each ID.
arrange(data,ID,date) %>%
group_by(ID) %>%
summarize(mean_value2 = mean(value2), max_value1 = last(value1))
# ID mean_value2 max_value1
# <int> <dbl> <int>
#1 1 2.5 2
#2 2 6.5 4
DATA
data <- structure(list(date = structure(c(16894, 16901, 16908, 16915,
16894, 16901, 16908, 16915), class = "Date"), ID = c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), value1 = c(0L, 6L, 7L, 2L, 1L, 5L, 9L,
4L), value2 = 1:8), .Names = c("date", "ID", "value1", "value2"
), row.names = c(NA, -8L), class = "data.frame")
Here is an option with data.table
library(data.table)
setDT(data)[, .(max_value1 = value1[which.max(date)],
mean_value2 = mean(value2)) , by = ID]
# ID max_value1 mean_value2
#1: 1 2 2.5
#2: 2 4 6.5
You can do this using the function nth in dplyr which finds the nth value of a vector.
data %>% group_by(ID) %>%
summarize(max_value1 = nth(value1, n = length(value1)), mean_value2 = mean(value2))
This is based on the assumption that the data is ordered by date as in the example; otherwise use arrange as discussed above.
I am working on gps data right now, the position of the animal has been collected if possible every 4 hours. The data looks like this (XY data is not shown here for some reasons):
ID TIME POSIXTIME date_only
1 1 12:00 2005-05-08 12:00:00 2005-05-08
2 2 16:01 2005-05-08 16:01:00 2005-05-08
3 3 20:01 2005-05-08 20:01:00 2005-05-08
4 4 0:01 2005-05-09 00:01:00 2005-05-09
5 5 8:01 2005-05-09 08:01:00 2005-05-09
6 6 12:01 2005-05-09 12:01:00 2005-05-09
7 7 16:02 2005-05-09 16:02:00 2005-05-09
8 8 20:02 2005-05-09 20:02:00 2005-05-09
9 9 0:01 2005-05-10 00:01:00 2005-05-10
10 10 4:00 2005-05-10 04:00:00 2005-05-10
I would now like to take only the first locations per day. In most cases, this will be at 0:01 o'clock. However, sometimes it will be 4:01 or even later as there is missing data.
How can I get only the first locations per day? They should be included in a new dataframe. I tried it with :
tapply(as.numeric(Kandularaw$TIME),list(Kandularaw$date_only),min, na.rm=T)
However, this did not work as R takes strange values when TIME is set as numeric.
Is it possible do do it with an ifelse statement? If yes, how would it look like roughly?
I am grateful for every help I can get. Thank you for your efforts.
Cheers,
Jan
I am guessing you really want a row number as an index into a position record. If you know that these rows are ordered by date-time, and you are getting satisfactory group splits with that second argument to tapply (however it was created), then try this:
idx <- tapply(1:NROW(Kandularaw), Kandularaw$date_only, "[", 1)
If you want records (rows) in that same dataframe then just use:
Kandularaw[ idx, ]
I would approach this from a simpler point of view. First, ensure that POSIXTIME is one of the "POSIX" classes. Then order the data by POSIXTIME. At this point we can use any of the split-apply-combine idioms to do what you want, making use of the head() function. Here I use aggregate():
Using this example data set:
dat <- structure(list(ID = 1:10, TIME = structure(c(4L, 6L, 8L, 1L,
3L, 5L, 7L, 9L, 1L, 2L), .Label = c("00:01:00", "04:00:00", "08:01:00",
"12:00:00", "12:01:00", "16:01:00", "16:02:00", "20:01:00", "20:02:00"
), class = "factor"), POSIXTIME = structure(1:10, .Label = c("2005/05/08 12:00:00",
"2005/05/08 16:01:00", "2005/05/08 20:01:00", "2005/05/09 00:01:00",
"2005/05/09 08:01:00", "2005/05/09 12:01:00", "2005/05/09 16:02:00",
"2005/05/09 20:02:00", "2005/05/10 00:01:00", "2005/05/10 04:00:00"
), class = "factor"), date_only = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L), .Label = c("2005/05/08", "2005/05/09",
"2005/05/10"), class = "factor")), .Names = c("ID", "TIME", "POSIXTIME",
"date_only"), class = "data.frame", row.names = c(NA, 10L))
First, get POSIXTIME and date_only in the correct formats:
dat <- transform(dat,
POSIXTIME = as.POSIXct(POSIXTIME, format = "%Y/%m/%d %H:%M:%S"),
date_only = as.Date(date_only, format = "%Y/%m/%d"))
Next, order by POSIXTIME:
dato <- with(dat, dat[order(POSIXTIME), ])
The final step is to use aggregate() to split the data by date_only and use head() to select the first row:
aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
notice I pass the n argument of head() the value 1, indicating that it should extract only the first row of each days observations. Because we sorted by datetime and split on date, the first row should be the first observation per day. Do be aware of rounding issues however.
The final step results in:
> aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
date ID TIME POSIXTIME
1 2005-05-08 1 12:00:00 2005-05-08 12:00:00
2 2005-05-09 4 00:01:00 2005-05-09 00:01:00
3 2005-05-10 9 00:01:00 2005-05-10 00:01:00
Instead of dato[,1:3] refer to whatever columns in your original data set contain the variables (locations?) you wanted.