I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)
I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!
This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017
For my data frame I want to filter the row with the highest year followed by the highest month.
Sample data frame:
df <- data.frame(ID = c(1:5),
year = c(2018,2018,2018,2018,2019),
month = c(9,10,11,12,11))
I tried the following but this does not return any record (which makes sense, because the max year and max month are in different rows). Does anybody have the answer? Obviously my desired output would be row 5.
df %>% filter(year == max(year) & month == max(month))
We can do:
df %>%
filter(year==max(year) & lag(month) == max(month))
ID year month
1 5 2019 11
This will give back the highest year containing the highest month value within that year.
df <- df %>%
arrange(desc(year), desc(month))
ID year month
1 5 2019 11
2 4 2018 12
3 3 2018 11
4 2 2018 10
5 1 2018 9
df[1,] # first row
ID year month
1 5 2019 11
I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)
Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows
I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.
Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000
Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))