adding rows to data.frame conditionally - r

I have a big data.frame of flowers and fruits in a plant for a 30 years survey. I want to add zeros (0) in some rows which represent individuals in specific months where the plant did not have flowers or fruits (because it is a seasonal species).
Example:
Year Month Flowers Fruits
2004 6 25 2
2004 7 48 4
2005 7 20 1
2005 8 16 1
I want to add the months that are not included with values of zero so I was thinking in a function that recognize the missing months and fill them with 0.
Thanks.

## x is the data frame you gave in the question
x <- data.frame(
Year = c(2004, 2004, 2005, 2005),
Month = c(6, 7, 7, 8),
Flowers = c(25, 48, 20, 16),
Fruits = c(2, 4, 1, 1)
)
## y is the data frame that will provide the missing values,
## so you can replace 2004 and 2005 with whatever your desired
## time interval is
y <- expand.grid(Year = 2004:2005, Month = 1:12)
## this final step fills in missing dates and replaces NA's with zeros
library(tidyr)
x <- merge(x, y, all = TRUE) %>%
replace_na(list(Flowers = 0, Fruits = 0))
## if you don't want to use tidyr, you can alternatively do
x <- merge(x, y, all = TRUE)
x[is.na(x)] <- 0
It looks like this:
head(x, 10)
# Year Month Flowers Fruits
# 1 2004 1 0 0
# 2 2004 2 0 0
# 3 2004 3 0 0
# 4 2004 4 0 0
# 5 2004 5 0 0
# 6 2004 6 25 2
# 7 2004 7 48 4
# 8 2004 8 0 0
# 9 2004 9 0 0
# 10 2004 10 0 0

Here is another option using expand and left_join
library(dplyr)
library(tidyr)
expand(df1, Year, Month = 1:12) %>%
left_join(., df1) %>%
replace_na(list(Flowers=0, Fruits=0))
# Year Month Flowers Fruits
# <int> <int> <dbl> <dbl>
#1 2004 1 0 0
#2 2004 2 0 0
#3 2004 3 0 0
#4 2004 4 0 0
#5 2004 5 0 0
#6 2004 6 25 2
#7 2004 7 48 4
#8 2004 8 0 0
#9 2004 9 0 0
#10 2004 10 0 0
#.. ... ... ... ...

Related

How do I add a column indicating the years since a binary variable in R?

I thought this would be trivial, and I think it must be, but I am very tired and stuck at this problem at the moment.
Consider a df with two columns, one with a year, and the other with a binary variable indicating some event.
df <- data.frame(year = c(2000,2001,2002,2003,2004, 2005,2006,2007,2008,2010),
flag = c(0,0,0,1,0,0,0,1,0,0))
I want to create a third column that simply counts the years since the last flag and that resets when a new flag appears, like so:
I thought this code would do the job:
First, add a 0 as "year_since" for every year with a flag, then, if there was a flag in the previous year, add 1 to the value of the previous "year_since".
df <- df %>% mutate(year_since = ifelse(flag == 1, 0, NA)) %>%
mutate(year_since = ifelse(dplyr::lag(flag, n=1, order_by = "year") == 1 & is.na(year_since),
dplyr::lag(year_since, n=1, order_by = "year")+1, year_since))
However, this returns NA for every row that should be 1,2,3, and so on.
You could do
df %>%
group_by(group = cumsum(flag)) %>%
mutate(year_since = ifelse(group == 0, NA, seq(n()) - 1)) %>%
ungroup() %>%
select(-group)
#> # A tibble: 10 x 3
#> year flag year_since
#> <dbl> <dbl> <dbl>
#> 1 2000 0 NA
#> 2 2001 0 NA
#> 3 2002 0 NA
#> 4 2003 1 0
#> 5 2004 0 1
#> 6 2005 0 2
#> 7 2006 0 3
#> 8 2007 1 0
#> 9 2008 0 1
#> 10 2010 0 2
Created on 2022-09-16 with reprex v2.0.2
Using data.table
library(data.table)
setDT(df)[, year_since := (NA^!cummax(flag)) * rowid(cumsum(flag))-1]
-output
> df
year flag year_since
<num> <num> <num>
1: 2000 0 NA
2: 2001 0 NA
3: 2002 0 NA
4: 2003 1 0
5: 2004 0 1
6: 2005 0 2
7: 2006 0 3
8: 2007 1 0
9: 2008 0 1
10: 2010 0 2

Penalized cumulative sum in r

I need to calculate a penalized cumulative sum.
Individuals "A", "B" and "C" were supposed to get tested every other year. Every time they get tested, they accumulate 1 point. However, when they miss a test, their cumulative score gets deducted in 1.
I have the following code:
data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
...which gives the following:
year person.id needs.testing test.compliance penalized.compliance.cum.sum
1 1990 A Yes 1 1
2 1991 A No 0 1
3 1992 A Yes 1 2
4 1993 A No 0 2
5 1994 A Yes 1 3
6 1995 A No 0 3
7 1990 B Yes 1 1
8 1991 B No 0 1
9 1992 B Yes 1 2
10 1993 B No 0 2
11 1994 B Yes 0 1
12 1995 B No 0 1
13 1990 C Yes 1 1
14 1991 C No 0 1
15 1992 C Yes 0 0
16 1993 C No 0 0
17 1994 C Yes 0 -1
18 1995 C No 0 -1
As it is evident, "A" fully complied. "B" somewhat complied (in year 1994 he's supposed to get tested, but he missed the test, and consequently his cumulative sum gets deducted from 2 to 1). Finally, "C" complies just once (in year 1990, and every time she needs to get tested, she misses the test).
What I need is some code to get the "penalized.compliance.cum.sum" variable.
Please note:
Tests are every other year.
The "penalized.compliance.cum.sum" variable keeps adding the previous score.
But starts deducting only if the individual misses the test on the testing year (denoted in the "needs.testing" variable).
For instance, individual "C" complies in year 1990. In 1991 she doesn't need to get tested, and hence keeps her score of 1. Then, she misses the 1992 test, and 1 is subtracted from her cumulative score, getting a score of 0 in 1992. Then she keeps missing test getting a -1 at the end of the study.
Also, I need to assign different penalties (i.e. different numbers). In this example, it's just 1. However, I need to be able to penalize using other numbers such as 0.5, 0.1, and others.
Thanks!
Using case_when
library(dplyr)
df1 %>%
group_by(person.id) %>%
mutate(res = cumsum(case_when(needs.testing == "Yes" ~ 1- 2 *(test.compliance < 1), TRUE ~ 0)))
base R
do.call(rbind, by(dat, dat$person.id,
function(z) transform(z, res = cumsum(ifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)))
))
# year person.id needs.testing test.compliance penalized.compliance.cum.sum res
# A.1 1990 A Yes 1 1 1
# A.2 1991 A No 0 1 1
# A.3 1992 A Yes 1 2 2
# A.4 1993 A No 0 2 2
# A.5 1994 A Yes 1 3 3
# A.6 1995 A No 0 3 3
# B.7 1990 B Yes 1 1 1
# B.8 1991 B No 0 1 1
# B.9 1992 B Yes 1 2 2
# B.10 1993 B No 0 2 2
# B.11 1994 B Yes 0 1 1
# B.12 1995 B No 0 1 1
# C.13 1990 C Yes 1 1 1
# C.14 1991 C No 0 1 1
# C.15 1992 C Yes 0 0 0
# C.16 1993 C No 0 0 0
# C.17 1994 C Yes 0 -1 -1
# C.18 1995 C No 0 -1 -1
by splits a frame up by the INDICES (dat$person.id here), where in the function z is the data for just that group. This allows us to operate on the data without fearing the person changing in a vector.
by returns a list, and the canonical base-R way to combine lists into a frame is either rbind(a, b) when only two frames, or do.call(rbind, list(...)) when there may be more than two frames in the list.
The 1-2*(.) is just a trick to waffle between +1 and -1 based on test.compliance.
This has the side-effect of potentially changing the order of the rows. For instance, if it were ordered first by year then person.id, then the by-group calculations will still be good, but the output will be grouped by person.id (and ordered by year within the group). Minor, but note it if you need order to be something.
dplyr
library(dplyr)
dat %>%
group_by(person.id) %>%
mutate(res = cumsum(if_else(needs.testing == "Yes", 1-2*(test.compliance < 1), 0))) %>%
ungroup()
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[, res := cumsum(fifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)), by = .(person.id)]
This might do the trick for you?
df <- data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
library("dplyr")
penalty <- -1
df %>%
group_by(person.id) %>%
mutate(cumsum = cumsum(ifelse(needs.testing == "Yes" & test.compliance == 0, penalty, test.compliance)))
## A tibble: 18 x 6
## Groups: person.id [3]
# year person.id needs.testing test.compliance penalized.compliance.cum.sum cumsum
# <int> <chr> <chr> <dbl> <dbl> <dbl>
# 1 1990 A Yes 1 1 1
# 2 1991 A No 0 1 1
# 3 1992 A Yes 1 2 2
# 4 1993 A No 0 2 2
# 5 1994 A Yes 1 3 3
# 6 1995 A No 0 3 3
# 7 1990 B Yes 1 1 1
# 8 1991 B No 0 1 1
# 9 1992 B Yes 1 2 2
#10 1993 B No 0 2 2
#11 1994 B Yes 0 1 1
#12 1995 B No 0 1 1
#13 1990 C Yes 1 1 1
#14 1991 C No 0 1 1
#15 1992 C Yes 0 0 0
#16 1993 C No 0 0 0
#17 1994 C Yes 0 -1 -1
#18 1995 C No 0 -1 -1
You can then easily adjust the penalty variable to be whatever penalty you want.

Cumulative sum for 2 criteria in R

I have a database where I want to calculate the cumulative sum of 2 criteria
dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
"honda","honda","honda","honda",
"lada","lada","lada","lada"),
year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))
You can see down the data:
dfdata
car year id
1 toyota 2000 a
2 toyota 2000 b
3 toyota 2001 a
4 toyota 2001 c
5 toyota 2002 a
6 honda 2001 d
7 honda 2001 d
8 honda 2002 d
9 honda 2002 e
10 lada 2003 f
11 lada 2004 f
12 lada 2005 f
13 lada 2006 f
Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.
I want the sum of cars companies I've seen in one year
I want the cumulative sum of cars companies I've seen across the years
I want the cumulative sum of the cars companies I've seen more than once (counting the ones I've seen in the same year and the other years AND another column counting the ones that I've seen ONLY in the other years)
Here is how I got point 1. and point 2.
dfdata %>%
group_by(car, year) %>%
dplyr::summarise(nb = n()) %>%
dplyr::mutate(cs = cumsum(nb)) %>%
ungroup()
nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.
# A tibble: 9 x 4
car year nb cs
<fct> <dbl> <int> <int>
1 honda 2001 2 2
2 honda 2002 2 4
3 lada 2003 1 1
4 lada 2004 1 2
5 lada 2005 1 3
6 lada 2006 1 4
7 toyota 2000 2 2
8 toyota 2001 2 4
9 toyota 2002 1 5
But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.
Final output should be based on grouping ID (to answer point 3):
car year nb cs curetrap curetrap.no.same.year
1 honda 2001 2 2 1 0
2 honda 2002 2 4 2 1
3 lada 2003 1 1 0 0
4 lada 2004 1 2 1 1
5 lada 2005 1 3 2 2
6 lada 2006 1 4 3 3
7 toyota 2000 2 2 0 0
8 toyota 2001 2 4 1 1
9 toyota 2002 1 5 2 2
This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.
How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?
So far this is what I have:
tab.df = cbind(table(dfdata$id,dfdata$year),
car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)
2000 2001 2002 2003 2004 2005 2006 car
a 1 1 1 0 0 0 0 toyota
b 1 0 0 0 0 0 0 toyota
c 0 1 0 0 0 0 0 toyota
d 0 2 1 0 0 0 0 honda
e 0 0 1 0 0 0 0 honda
f 0 0 0 1 1 1 1 lada
Which shows all the times I've seen a car in a year for a certain ID.
You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.
The following code gives the result you want
dfdata %>%
group_by(car, id) %>%
arrange(year, .by_group=TRUE) %>%
dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1]) %>%
group_by(car, year) %>%
dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year)) %>%
dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>%
ungroup()
NB: duplicating variable cs = nb is just a trick to write easily the subsequent call to mutate_at

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Resources