Increase efficiency of dplyr summarising - r

I am trying to sort and make a new table from a large data set (>60k; NDw) with a sample here
Season ENo HNo Month Day Year Group
638447 2011 A903851 1881023 10 6 2011 Ducks
589219 2010 C409324 3648019 10 8 2010 Ducks
137451 2006 M576033 2506116 10 13 2006 Ducks
883040 2013 P886755 43313010 10 17 2013 Ducks
851378 2013 C700399 36413199 11 5 2013 Geese
552791 2010 M902312 2508141 11 16 2010 Ducks
152368 2006 M599973 2496101 10 3 2006 Ducks
395393 2008 C412049 3646096 10 28 2008 Ducks
857709 2013 C671619 36413012 9 15 2013 Ducks
67354 2005 C349762 3643011 10 22 2005 Geese
67126 2005 C427496 3643037 11 25 2005 Geese
62260 2005 C349776 3643023 10 7 2005 Ducks
847364 2013 C570491 36411001 10 5 2013 Ducks
447414 2009 A686943 1808206 11 3 2009 Geese
474743 2009 M813353 2509214 10 24 2009 Ducks
439477 2009 A746048 1639142 10 26 2009 Ducks
781218 2012 P792862 4142177 11 27 2012 Geese
806946 2013 M052893 20712036 11 5 2013 Ducks
174932 2006 C450351 3645098 12 5 2006 Geese
828816 2013 M054683 25012010 9 30 2013 Ducks
I want to group by Season and HNo and get a number of new variables. These include how many groups each Season/HNo is in, a count of rows total, in each group, and each group during each month. The result would look like this, but with all months.
Season HNo groupN total.envelopes ducks geese Octducks
1 2005 1253041 1 2 2 0 2
2 2005 1254026 1 5 5 0 5
3 2005 1254063 2 26 23 3 0
4 2005 1254115 2 14 10 4 10
5 2005 1274023 2 39 28 11 28
I have code that works but it runs slow and I feel like there should be a better way to code this block. Maybe I'm wrong, and it's not a large issue, just wanted to learn how to make my code more efficient. Here is what I use to get the above output.
NDw1 = NDw %>%
group_by(Season,HNo) %>%
summarise(groupN = n_distinct(Group),
total.envelopes=n(),
ducks = length(ENo[Group %in% 'Ducks']),
geese = length(ENo[Group %in% 'Geese']),
Octducks = length(ENo[Group=='Ducks' & Month == 10]))
The entire code has lines for Aug-Jan ducks and geese. I tried to use count rather than length but it didn't work with a factor variable as is ENo. Any thoughts would be appreciated. Thanks for your time and help.

Related

R - manipulating time series data

I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

Create groups based on time period

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.
Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.
cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

Resources