How many counts are in each types in each year? With either group_by() or Split() [duplicate] - r

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.

We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1

How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)

You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))

Related

Calculate Sum of Random observations as sum per week in R

I have a dataset of random, sometimes infrequent, events that I want to count as a sum per week. Due to the randomness they are not linear so other examples I have tried so far are not applicable.
The data is similar to this:
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) )
What I'm trying to do is create a new table that contains the sum of events per week in the calendar year.
In this case producing something like this:
Name Week Events
Jim 1 3
Sue 1 3
Jim 2 0
Sue x ... x
and so on...
Update OP request for multiple years:
We could use isoweek also from lubridate instead of week
OR:
We could add the year as follows:
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
mutate(Year = year(ymd(Dates))) %>%
count(Name, Year, Week)
We could use lubridates Week function after transforming character Dates to date format with lubridates ymd function.
Then we can use count which is the short for group_by(Name, Week) %>% summarise(Count = n())
:
library(dplyr)
library(lubridate)
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
count(Name, Week)
Name Week n
<chr> <dbl> <int>
1 Jim 1 3
2 Jim 3 2
3 Jim 5 1
4 Jim 6 2
5 Jim 7 1
6 Jim 9 1
7 Sue 1 3
8 Sue 3 2
9 Sue 5 1
10 Sue 6 2
11 Sue 7 1
12 Sue 9 1
Here is an approach that gets you each ISO week for each individual, with zeros when there are no events for that week for that individual:
get_dates_df <- function(d) {
data.frame(date = seq(min(d, na.rm=T),max(d,na.rm=T),1)) %>%
mutate(Year=year(date), Week=week(date)) %>%
distinct(Year, Week)
}
df_date = df_date %>% mutate(Dates=lubridate::ymd(Dates))
left_join(
full_join(distinct(df_date %>% select(Name)), get_dates_df(df_date$Dates), by=character()),
df_date %>%
group_by(Name,Year=year(Dates), Week=week(Dates)) %>%
summarize(Events = sum(Event), .groups="drop")
) %>%
mutate(Events=if_else(is.na(Events),0,Events))
Output:
Name Year Week Events
1 Jim 2010 1 3
2 Jim 2010 2 0
3 Jim 2010 3 2
4 Jim 2010 4 0
5 Jim 2010 5 1
6 Jim 2010 6 2
7 Jim 2010 7 1
8 Jim 2010 8 0
9 Jim 2010 9 1
10 Sue 2010 1 3
11 Sue 2010 2 0
12 Sue 2010 3 2
13 Sue 2010 4 0
14 Sue 2010 5 1
15 Sue 2010 6 2
16 Sue 2010 7 1
17 Sue 2010 8 0
18 Sue 2010 9 1

replacing NA with next available number within a group

I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
ID year value
1 1 2000 1000
2 2 2001 20000
3 3 2002 30000
4 2 2002 NA
5 2 2003 40000
6 3 2007 NA
7 1 2001 6000
8 4 2000 4000
9 5 2005 NA
10 5 2006 20000
11 1 2002 7000
12 2 2004 50000
13 2 2005 60000
So, for example for ID=2 we have following value and years:
ID year value
2 2001 20000
2 2002 NA
2 2003 40000
2 2004 50000
2 2005 60000
So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs.
the final result should be in this form:
ID year value
1 2000 1000
1 2001 6000
1 2002 7000
2 2001 20000
2 2002 40000
2 2003 40000
2 2004 50000
2 2005 60000
3 2007 NA
4 2000 4000
5 2005 20000
5 2006 20000
Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA
I appreciate if you can suggest a solution
Thanks
dplyr solution
library(tidyverse)
data2 <- data %>%
dplyr::group_by(ID) %>%
dplyr::arrange(year) %>%
dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups: ID [5]
ID year value replaced_value
<dbl> <dbl> <dbl> <dbl>
1 1 2000 1000 1000
2 4 2000 4000 4000
3 2 2001 20000 20000
4 1 2001 6000 6000
5 3 2002 30000 30000
6 2 2002 NA 40000
7 1 2002 7000 7000
8 2 2003 40000 40000
9 2 2004 50000 50000
10 5 2005 NA 20000
11 2 2005 60000 60000
12 5 2006 20000 20000
13 3 2007 NA NA
Try this tidyverse approach using a flag to check sequential years and fill() to complete data:
library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
group_by(ID) %>%
mutate(Flag=c(1,diff(year))) %>%
fill(value,.direction = 'downup') %>%
mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 20000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
You could do:
library(dplyr)
data %>%
group_by(ID) %>%
mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
arrange(ID, year)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 40000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:
library(tidyverse)
data %>%
arrange(ID, year) %>%
group_by(ID, idx = cumsum(is.na(value))) %>%
fill(value, .direction = 'up') %>%
ungroup %>%
select(-idx)
This is much more straightforward (and likely much faster) in data.table:
library(data.table)
setDT(data)[order(ID, year), ][
, value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Resources