R-Having trouble with aggregating data over dates - r

I am trying to aggregate a table using ddply().
My table looks like this:
Year Month Count
2000 Jan 1
2000 Jan 2
2001 Feb 2
2001 Feb 1
I want to sum up the counts based on year and month. So I would have 2000, Jan, 3 and 2001, Feb, 3.
My code is
ddply(df,???,sum(Count))
I am not sure how to add in multiple variables.

We group by the variables 'Year', 'Month', and get the sum of 'Count' specifying summarise from the plyr.
Using plyr
library(plyr)
ddply(df, .(Year, Month), plyr::summarise, Count=sum(Count))
# Year Month Count
#1 2000 Jan 3
#2 2001 Feb 3
Or we can use the formula method of aggregate from base R.
aggregate(Count~., df, FUN=sum)
# Year Month Count
#1 2001 Feb 3
#2 2000 Jan 3
Or with dplyr, we group by the variables and summarise
library(dplyr)
df %>%
group_by(Year, Month) %>%
dplyr::summarise(Count=sum(Count))
# Year Month Count
# (int) (chr) (int)
#1 2000 Jan 3
#2 2001 Feb 3
Or we convert the 'data.frame' to 'data.table' (setDT(df)), group by the columns, and get the sum of 'Count'.
library(data.table)
setDT(df)[, list(Count=sum(Count)), .(Year, Month)]
# Year Month Count
#1: 2000 Jan 3
#2: 2001 Feb 3
NOTE: When we load functions that are similar from other packages, it is better to use packagename::function (plyr::summarise and dplyr::summarise)
data
df <- structure(list(Year = c(2000L, 2000L, 2001L, 2001L),
Month = c("Jan",
"Jan", "Feb", "Feb"), Count = c(1L, 2L, 2L, 1L)), .Names = c("Year",
"Month", "Count"), class = "data.frame",
row.names = c(NA, -4L))

Related

Creating min. date and max. date columns based on Quarter, Month, YTD

I have a data frame like the following:
Frequency Period Period No. Year
Monthly 1 1 2018
Quarterly Q1 3 2018
YTD YTD-Feb 2 2019
Based on these columns, I'd like to add a min. date and max. date column so that the data frame looks like this:
Frequency Period Period No. Year Min. Date Max. Date
Monthly 1 1 2018 1/1/2018 1/31/2018
Quarterly Q1 3 2018 1/1/2018 3/31/2018
YTD YTD-Feb 2 2019 1/1/2019 2/28/2019
If we need the max, min based on the 'PeriodNo.' column, create a sequence of Dates by month from the 'Year' column, then extract the min and max`
library(dplyr)
library(purrr)
library(lubridate)
library(stringr)
df1 %>%
mutate(date = map2(as.Date(str_c(Year, '-01-01')),
PeriodNo., ~ seq(.x, length.out = .y, by = '1 month')),
Min.Date = do.call(c, map(date, min)),
Max.Date = do.call(c, map(date, ~ceiling_date(max(.x), 'month')-1))) %>%
select(-date)
# Frequency Period PeriodNo. Year Min.Date Max.Date
#1 Monthly 1 1 2018 2018-01-01 2018-01-31
#2 Quarterly Q1 3 2018 2018-01-01 2018-03-31
#3 YTD YTD-Feb 2 2019 2019-01-01 2019-02-28
Or an option with Map
lst1 <- Map(function(x, y) seq(as.Date(paste0(x, "-01-01")),
length.out = y, by = '1 month'), df1$Year, df1$PeriodNo.)
df1$Min.Date <- do.call(c, lapply(lst1, min))
df1$Max.Date <- do.call(c, lapply(lst1, function(x) (max(x) + months(1) -1)) )
data
df1 <- structure(list(Frequency = c("Monthly", "Quarterly", "YTD"),
Period = c("1", "Q1", "YTD-Feb"), PeriodNo. = c(1L, 3L, 2L
), Year = c(2018L, 2018L, 2019L)), class = "data.frame",
row.names = c(NA,
-3L))

Reading an excel file containing dates and non-date entries in a column

In an excel file, there are two columns labelled "id" and "date" as in the following data frame:
df <-
structure(
list(
id = c(1L, 2L, 3L, 4L,5L),
date = c("10/2/2013", "-5/3/2015", "-11/-4/2019", "3/10/2019","")
),
.Names = c("id", "date"),
class = "data.frame",
row.names = c(NA,-5L)
)
The "date" column has both date e.g 10/2/2013 and non-date entries e.g. -5/3/2015 and -11/-4/2019 as well as blank spaces. I am looking for a way to read the excel file into R such that the dates and the non-dates are preserved and the blank spaces are replaced by NAs.
I have tried to use the function "read_excel" and argument "col_types" as follows:
df1<- data.frame(read_excel("df.xlsx", col_types = c("numeric", "date")))
However, this reads the dates and replaces the non-dates with NAs. I have tried other options of col_types e.g. "guess" and "skip" but these did not work for me. Any help on this is much appreciated.
Here's an approach using tidyr::separate and dplyr to filter out negative months so that only positive months are converted to "yearmon" data with zoo:
library(tidyverse)
df %>%
separate(date, c("day", "month", "year"),
sep = "/", remove = F, convert = T) %>%
mutate(month = if_else(month < 0, NA_integer_, month)) %>%
mutate(date2 = zoo::as.yearmon(paste(year, month, sep = "-")))
# id date day month year date2
#1 1 10/2/2013 10 2 2013 Feb 2013
#2 2 -5/3/2015 -5 3 2015 Mar 2015
#3 3 -11/-4/2019 -11 NA 2019 <NA>
#4 4 3/10/2019 3 10 2019 Oct 2019
#5 5 NA NA NA <NA>

How to undummy a datasset with R

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

In old data frame, order by two columns and store first of each row into new data frame

I have a data frame that contains 3 columns and I'd like use the columns date and location to obtain the most recent observation of each location and store it into a new data frame.
> old.data
date location amount
2014 NY 1
2015 NJ 2
2016 NY 3
2015 NM 4
2013 NY 5
2014 NJ 6
2016 NM 7
2016 NJ 8
2015 NY 9
> new.data
date location amount
2016 NJ 8
2016 NM 7
2016 NY 3
Using dplyr:
library(dplyr)
new.data <- old.data %>% arrange(desc(date), location) %>% group_by(location) %>% slice(1)
new.data
Source: local data frame [3 x 2]
Groups: location [3]
date location
<int> <fctr>
1 2016 NJ
2 2016 NM
3 2016 NY
Using data.table:
library(data.table)
# Code updated by Arun
setDT(old.data)[order(-date, location), .(date = date[1L]), by = location]
location date
1: NJ 2016
2: NM 2016
3: NY 2016
Data
old.data <- structure(list(date = c(2014L, 2015L, 2016L, 2015L, 2013L, 2014L,
2016L, 2016L, 2015L), location = structure(c(3L, 1L, 3L, 2L,
3L, 1L, 2L, 1L, 3L), .Label = c("NJ", "NM", "NY"), class = "factor")), .Names = c("date",
"location"), class = "data.frame", row.names = c(NA, -9L))
Update (as OP changed the original dataframe)
The dplyr solution is still valid.
For data.table, this is the only way I could think of:
setDT(old.data)[order(-date, location), colnames(old.data), with = F][date == max(date)]
date location amount
1: 2016 NJ 8
2: 2016 NM 7
3: 2016 NY 3
Using .SD and .SDcols as suggested by Arun
# adding more data
old.data$amount <- 1:9
old.data$a <- 10:18
# Retain all columns
keep_cols <- colnames(old.data)[-2] # Remove the column which is mentioned in by
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = keep_cols]
# or assigning colnames to .SDcols directly:
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = (colnames(old.data)[-2])]
location date amount a
1: NJ 2016 8 17
2: NM 2016 7 16
3: NY 2016 3 12
What about this:
library(dplyr)
date <- c(2014, 2015, 2016, 2015, 2013, 2014, 2016, 2016, 2015)
location <- c("NY", "NJ", "NY", "NM", "NY", "NJ", "NM", "NJ", "NY")
old.data <- data.frame(date, location)
new.data <- group_by(old.data, location)
new.data <- summarise(new.data, year = max(date))
Using the data.table package:
library(data.table)
setDT(dat)[order(-date), .SD[1L], by = location]
# location date
# 1: NY 2016
# 2: NM 2016
# 3: NJ 2016

combine data in depending on the value of one column

I have a data frame in R
year group sales
1 2000 1 20
2 2001 1 25
3 2002 1 23
4 2003 1 30
5 2001 2 50
6 2002 2 55
And I want to group the data by groups or create some kind of object. I want to create one array for each group that will store the year and the sales. And the I will try to save it as a json file with this structure:
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]]},
{"group": 2, "sales":[[2001, 50], [2002,55]]}]
Is it possible to do it automatically?
Thanks a lot
We can use data.table to paste the 'year' and 'sales' column grouped by 'group. We convert the 'data.frame' to 'data.table' (setDT(df1)). Group by 'group', we use sprintf to paste the 'year', 'sales' along with the parentheses ([]), then collapse the output to a single string with toString (it is a wrapper for paste(..., collapse=', ')), paste the [], and use toJSON.
library(jsonlite)
library(data.table)
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',
year, sales)),']')), by = group])
#[{"group":1,"sales":"[[2000,20], [2001,25], [2002,23], [2003,30]]"},
#{"group":2,"sales":"[[2001,50], [2002,55]]"}]
The paste by group can be done using base R. We split the dataset by the 'group' column to create a list. Loop through the list with lapply, paste, the 'year', 'sales' column as mentioned above. Create a data.frame with the first element of 'group' and the string from the paste step, rbind the list elements to create a single data.frame and then use toJSON.
toJSON(
do.call(rbind,
lapply(
split(df1, df1$group),
function(x) data.frame(group=x$group[1L],
sales=paste0('[',
toString(sprintf('[%d,%d]', x$year, x$sales)),
']')))))
data
df1 <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L
), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L,
30L, 50L, 55L)), .Names = c("year", "group", "sales"),
class = "data.frame", row.names = c(NA, -6L))
Since the other answer uses data.table, I thought it would be a interesting exercise to try to do this in dplyr. This is not the optimal way but illustrates do which I'm not convinced is well enough documented. I have also shown the more appropriate summarise solution.
df <-read.table(textConnection('
year group sales expenses
2000 1 20 19
2001 1 25 19
2002 1 23 20
2003 1 30 15
2001 2 50 27
2002 2 55 30
'),header=TRUE)
library(dplyr)
library(jsonlite)
df %>%
group_by( group ) %>%
do(
sales = group_by(.,year) %>% select(sales) %>% apply(MARGIN=2,identity),
expenses = group_by(.,year) %>% select(expenses) %>% apply(MARGIN=2,identity)
)
df %>%
group_by( group ) %>%
summarise(
sales = list(apply( data.frame(year,sales), MARGIN=2, identity ))
,expenses = list(apply( data.frame(year,sales), MARGIN=2, identity ))
) %>% jsonlite::toJSON()

Resources