Cumulative sum for 2 criteria in R - r

I have a database where I want to calculate the cumulative sum of 2 criteria
dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
"honda","honda","honda","honda",
"lada","lada","lada","lada"),
year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))
You can see down the data:
dfdata
car year id
1 toyota 2000 a
2 toyota 2000 b
3 toyota 2001 a
4 toyota 2001 c
5 toyota 2002 a
6 honda 2001 d
7 honda 2001 d
8 honda 2002 d
9 honda 2002 e
10 lada 2003 f
11 lada 2004 f
12 lada 2005 f
13 lada 2006 f
Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.
I want the sum of cars companies I've seen in one year
I want the cumulative sum of cars companies I've seen across the years
I want the cumulative sum of the cars companies I've seen more than once (counting the ones I've seen in the same year and the other years AND another column counting the ones that I've seen ONLY in the other years)
Here is how I got point 1. and point 2.
dfdata %>%
group_by(car, year) %>%
dplyr::summarise(nb = n()) %>%
dplyr::mutate(cs = cumsum(nb)) %>%
ungroup()
nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.
# A tibble: 9 x 4
car year nb cs
<fct> <dbl> <int> <int>
1 honda 2001 2 2
2 honda 2002 2 4
3 lada 2003 1 1
4 lada 2004 1 2
5 lada 2005 1 3
6 lada 2006 1 4
7 toyota 2000 2 2
8 toyota 2001 2 4
9 toyota 2002 1 5
But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.
Final output should be based on grouping ID (to answer point 3):
car year nb cs curetrap curetrap.no.same.year
1 honda 2001 2 2 1 0
2 honda 2002 2 4 2 1
3 lada 2003 1 1 0 0
4 lada 2004 1 2 1 1
5 lada 2005 1 3 2 2
6 lada 2006 1 4 3 3
7 toyota 2000 2 2 0 0
8 toyota 2001 2 4 1 1
9 toyota 2002 1 5 2 2
This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.
How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?
So far this is what I have:
tab.df = cbind(table(dfdata$id,dfdata$year),
car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)
2000 2001 2002 2003 2004 2005 2006 car
a 1 1 1 0 0 0 0 toyota
b 1 0 0 0 0 0 0 toyota
c 0 1 0 0 0 0 0 toyota
d 0 2 1 0 0 0 0 honda
e 0 0 1 0 0 0 0 honda
f 0 0 0 1 1 1 1 lada
Which shows all the times I've seen a car in a year for a certain ID.

You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.
The following code gives the result you want
dfdata %>%
group_by(car, id) %>%
arrange(year, .by_group=TRUE) %>%
dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1]) %>%
group_by(car, year) %>%
dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year)) %>%
dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>%
ungroup()
NB: duplicating variable cs = nb is just a trick to write easily the subsequent call to mutate_at

Related

Year sum dataframe per label

Having a data structure as this one:
dtest <- data.frame(label=c("yahoo","google","yahoo","yahoo","google","google","yahoo","yahoo"), year=c(2000,2001,2000,2001,2003,2003,2003,2003))
How is it possible to extract a new dataframe like this one:
doutput <- data.frame(label=c("yahoo","yahoo","yahoo","yahoo","google","google","google","google"), year=c(2000,2001,2002,2003,2000,2001,2002,2003), volume=c(2,1,0,3,0,1,0,2))
> doutput
label year volume
1 yahoo 2000 2
2 yahoo 2001 1
3 yahoo 2002 0
4 yahoo 2003 3
5 google 2000 0
6 google 2001 1
7 google 2002 0
8 google 2003 2
One way is with dplyr:
library(dplyr)
dtest %>%
group_by(label, year) %>%
tally(name = "volume")
# A tibble: 5 x 3
# Groups: label [2]
label year volume
<fct> <dbl> <int>
1 google 2001 1
2 google 2003 2
3 yahoo 2000 2
4 yahoo 2001 1
5 yahoo 2003 2
Here is a solution with base R:
as.data.frame(table(transform(dtest,
year = factor(year, levels = seq(min(year), max(year))))))
The result:
label year Freq
1 google 2000 0
2 yahoo 2000 2
3 google 2001 1
4 yahoo 2001 1
5 google 2002 0
6 yahoo 2002 0
7 google 2003 2
8 yahoo 2003 2

How many counts are in each types in each year? With either group_by() or Split() [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.
We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1
How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)
You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

adding rows to data.frame conditionally

I have a big data.frame of flowers and fruits in a plant for a 30 years survey. I want to add zeros (0) in some rows which represent individuals in specific months where the plant did not have flowers or fruits (because it is a seasonal species).
Example:
Year Month Flowers Fruits
2004 6 25 2
2004 7 48 4
2005 7 20 1
2005 8 16 1
I want to add the months that are not included with values of zero so I was thinking in a function that recognize the missing months and fill them with 0.
Thanks.
## x is the data frame you gave in the question
x <- data.frame(
Year = c(2004, 2004, 2005, 2005),
Month = c(6, 7, 7, 8),
Flowers = c(25, 48, 20, 16),
Fruits = c(2, 4, 1, 1)
)
## y is the data frame that will provide the missing values,
## so you can replace 2004 and 2005 with whatever your desired
## time interval is
y <- expand.grid(Year = 2004:2005, Month = 1:12)
## this final step fills in missing dates and replaces NA's with zeros
library(tidyr)
x <- merge(x, y, all = TRUE) %>%
replace_na(list(Flowers = 0, Fruits = 0))
## if you don't want to use tidyr, you can alternatively do
x <- merge(x, y, all = TRUE)
x[is.na(x)] <- 0
It looks like this:
head(x, 10)
# Year Month Flowers Fruits
# 1 2004 1 0 0
# 2 2004 2 0 0
# 3 2004 3 0 0
# 4 2004 4 0 0
# 5 2004 5 0 0
# 6 2004 6 25 2
# 7 2004 7 48 4
# 8 2004 8 0 0
# 9 2004 9 0 0
# 10 2004 10 0 0
Here is another option using expand and left_join
library(dplyr)
library(tidyr)
expand(df1, Year, Month = 1:12) %>%
left_join(., df1) %>%
replace_na(list(Flowers=0, Fruits=0))
# Year Month Flowers Fruits
# <int> <int> <dbl> <dbl>
#1 2004 1 0 0
#2 2004 2 0 0
#3 2004 3 0 0
#4 2004 4 0 0
#5 2004 5 0 0
#6 2004 6 25 2
#7 2004 7 48 4
#8 2004 8 0 0
#9 2004 9 0 0
#10 2004 10 0 0
#.. ... ... ... ...

Merging on pairs of values in one column of each data frame

I am trying to merge two data frames with columns of different lengths and rows.To give the exact idea DF1 is:
ID year freq1 mun
1 2005 2 61137
1 2006 1 61383
2 2005 3 14520
2 2006 2 14604
4 2005 3 101423
4 2006 1 102257
6 2005 0 39039
6 2006 1 39346
Whereas DF2 is:
ID year freq2 mun
1 2004 5 60857
1 2005 3 61137
2 2004 4 14278
2 2005 4 14520
3 2004 2 22563
3 2005 0 22635
4 2004 6 101015
4 2005 4 101423
5 2004 6 61152
5 2005 3 61932
6 2004 4 38456
6 2005 3 39039
As you can see both year and mun variables are somewhat different and have only one common entry. So what I'm trying to achieve is to merge freq1 and freq2 columns with respect to ID's. However the trick is that DF1 should take priority (left merge?) in such way that year and mun variables are the ones chosen from DF1. Desired output:
ID year freq1 mun freq2
1 2005 2 61137 5
1 2006 1 61383 3
2 2005 3 14520 4
2 2006 2 14604 4
4 2005 3 101423 6
4 2006 1 102257 4
6 2005 0 39039 4
6 2006 1 39346 3
As well as other way around for DF2 taking priority in such way that:
ID year freq2 mun freq1
1 2004 5 60857 2
1 2005 3 61137 1
2 2004 4 14278 3
2 2005 4 14520 2
3 2004 2 22563 0
3 2005 0 22635 0
4 2004 6 101015 3
4 2005 4 101423 1
5 2004 6 61152 0
5 2005 3 61932 0
6 2004 4 38456 0
6 2005 3 39039 1
I've tried deleting year and mun columns and merge freq1 and freq2 according to common ID's however it only provides me with multiple duplicate entries. Any suggestions?
It appears that you are trying to match pairs of IDs in the data frames, in the order presented.
Matching on the ID column alone will cause a cross-product to be formed, giving four rows for ID == 1, which is what I assume you mean by "multiple duplicate entries."
To merge the pairs of ID values, you need to disambiguate the individual values, so the merge merges the first ID value in df1 with the first ID value in df2, and similarly for the second ID values.
This disambiguation can be done by adding another column, which adds a counter for the number of ID values seen. seq_along counts, and ave applies to the "levels" of ID:
df1$ID2 <- ave(df1$ID, df1$ID, FUN=seq_along)
df2$ID2 <- ave(df2$ID, df2$ID, FUN=seq_along)
Here's the new df1. df2 is similarly modified.
> df1
ID year freq1 mun ID2
1 1 2005 2 61137 1
2 1 2006 1 61383 2
3 2 2005 3 14520 1
4 2 2006 2 14604 2
5 4 2005 3 101423 1
6 4 2006 1 102257 2
7 6 2005 0 39039 1
8 6 2006 1 39346 2
These are now appropriate for passing to merge to get the two sides that you want. Removing the unused column from each side prevents the merge from taking data that you don't want:
> merge(df1, df2[-c(2,4)], by=c('ID', 'ID2'), all.x=T)[-2]
ID year freq1 mun freq2
1 1 2005 2 61137 5
2 1 2006 1 61383 3
3 2 2005 3 14520 4
4 2 2006 2 14604 4
5 4 2005 3 101423 6
6 4 2006 1 102257 4
7 6 2005 0 39039 4
8 6 2006 1 39346 3
> merge(df1[-c(2,4)], df2, by=c('ID', 'ID2'), all.y=T)[-2]
ID freq1 year freq2 mun
1 1 2 2004 5 60857
2 1 1 2005 3 61137
3 2 3 2004 4 14278
4 2 2 2005 4 14520
5 3 NA 2004 2 22563
6 3 NA 2005 0 22635
7 4 3 2004 6 101015
8 4 1 2005 4 101423
9 5 NA 2004 6 61152
10 5 NA 2005 3 61932
11 6 0 2004 4 38456
12 6 1 2005 3 39039
Note that NA values are used where there is no match. You can replace these with 0 values if that is really appropriate.
The [-2] at the end removes the added column ID2.
This is a fairly unusual way to merge. It depends on the order of the data in addition the values, so it does seem to be fragile. But I do think that I've captured what you want to accomplish.
Use match function to find corresponding rows between DF1 and DF2. See the code below.
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF1, DF2[ match( DF1[,"year"], DF2[,"year"] ), "freq2" ])
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF2, DF1[ match( DF2[,"year"], DF1[,"year"] ), "freq1" ])

Resources