I'm trying to create a new column that is conditionally based on several other columns. Here is my data. I am trying to create a year over year difference column.
> person <- c(rep("A", 4), rep("B", 1), rep("C",3), rep("D",1))
> score <- c(1,1,2,4,1,1,2,2,3)
> year <- c(2017, 2016, 2015, 2014, 2015, 2017, 2015, 2014, 2017)
This function would look for the previous year's data for that individual person, and that score from their current score. If there is no previous year data, then it returns NA. So for my data, I would get a new column "difference" that has values 0, -1, -2, NA, NA, NA, 0, NA, NA.
Would love to see dplyr answer, but vanilla r solutions welcome.
By using dplyr
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(per = ifelse(year - lag(year) == 1, score - lag(score), NA)) %>%
arrange(person, -year)
# A tibble: 9 x 4
# Groups: person [4]
person score year per
<fctr> <dbl> <dbl> <dbl>
1 A 1 2017 0
2 A 1 2016 -1
3 A 2 2015 -2
4 A 4 2014 NA
5 B 1 2015 NA
6 C 1 2017 NA
7 C 2 2015 0
8 C 2 2014 NA
9 D 3 2017 NA
Just to answer the question you put forward under Wen's answer.
you can check out chapter 5 of this book (http://r4ds.had.co.nz/transform.html)to figure out every function and symbol used in Wen's answer.
Also you can read this(http://varianceexplained.org/r/teach-tidyverse/) to get a basic sense of basic r versus tidyverse.
Related
I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr 1.1.0, we can use temporary grouping with .by -
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)
I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!
This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017
I have a data frame in R, in which I have year-wise transaction data for multiple individuals. I want a new data frame in which I want columns based on some conditions like the total revenue for an individual every year in a particular category.
for example
ID year a b c d
1 2015 2 4 6 8
1 2015 4 6 8 10
1 2016 7 6 9 5
2 2015 7 5 6 2
2 2016 3 4 5 2
I want a data frame
I get a column with total values for ID 1 in 2015, ID 1 in 2016, ID 2 in 2015 and so on. and I wanted to add another condition also like total only for those columns which have a value greater than 5 for column a.
please give your suggestions... any help will be appreciated
So based on your question, I used the package dplyr which is incredibly helpful if you don't already have it.
First group, your data by ID and then year. Then create sums for your 4 columns based on these groupings:
mydata <- data.frame("ID" = c(1,1,1,2,2),
"year" = c(2015, 2015, 2016, 2015, 2016),
"a" = c(2,4,7,7,3),
"b" = c(4,6,6,5,4),
"c" = c(6,8,9,6,5),
"d" = c(8, 10, 5, 2, 2))
mydata %>% group_by(ID, year) %>% summarise(a = sum(a), b = sum(b),
c = sum(c), d = sum(d))
To only calculate sums greater than 5, simply specify that in the 'summaries' portion as follows:
mydata %>% group_by(ID, year) %>%
summarise(a = sum(a[a > 5]), b = sum(b[b > 5]),
c = sum(c[c > 5]), d = sum(d[d > 5]))
I hope this helps!
I need help creating a cohort dropout table from raw data.
I have a dataset that looks like this:
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
I want to calculate the dropout rate by cohort, and get a table like this:
cohortdt<-data.table(
cohort =c(1,2,3),
drop_rateY1 =c(.60,0.0,0.0),
droprate_Y2 =c (.50,.33,0.0))
For cohort 1, the dropout rate at the end of Y1 is 60%. (i.e. 60 percent of students who were originally enrolled dropped out at the end of year 1. The value in Y2 means that 50% of those who remained at the end of year 1, dropped out at the end of year 2.
How can create a table like this from the raw data?
Here is one solution:
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(year_1_drop_rate = 1 - (`2015` / `2014`),
year_2_drop_rate = 1 - (`2016` / `2015`)) %>%
replace_na(list(year_1_drop_rate = 0.0,
year_2_drop_rate = 0.0)) %>%
select(cohort, year_1_drop_rate, year_2_drop_rate)
Which returns:
# A tibble: 3 x 3
cohort year_1_drop_rate year_2_drop_rate
<dbl> <dbl> <dbl>
1 1 0.6 0.5000000
2 2 0.0 0.3333333
3 3 0.0 0.0000000
group by year
count cohort by year groups
ungroup
spread year in columns 2014, 2015, and 2016
mutate twice to get dropout rates for year 1 and year 2
replace_na to 0
select cohort, year_1_drop_rate, and year_2_drop_rate
This solution takes a tidy dataset and makes it untidy by spreading the year variable (i.e. 2014, 2015, and 2016 are separate columns).
I have a simple data.table solution :
x <- DT[,.N, by = .(cohort,year)]
count the number of student each year on each cohort and create the new data.table x
x[,drop := (1-N/(c(NA,N[-.N])))*100,by = cohort]
here I make the ratio between the number of student and the number of student the year after (c(NA,N[-.N]) is the shifted vector of N), that gives you the percentage of lost student each year
x[,.SD,by = cohort]
cohort year N drop
1: 1 2014 10 NA
2: 1 2015 4 60.00000
3: 1 2016 2 50.00000
4: 2 2015 6 NA
5: 2 2016 4 33.33333
6: 3 2016 9 NA
Hope it helps
I'm just getting started using dplyr and I have the following two problems, which should be easy to solve with group_by, but I don't get it.
I have data that looks like this:
data <- data.frame(cbind("year" = c(2010, 2010, 2010, 2011, 2012, 2012, 2012, 2012),
"institution" = c("a", "a", "b", "a", "a", "a", "b", "b"),
"branch.num" = c(1, 2, 1, 1, 1, 2, 1, 2)))
data
# year institution branch.num
#1 2010 a 1
#2 2010 a 2
#3 2010 b 1
#4 2011 a 1
#5 2012 a 1
#6 2012 a 2
#7 2012 b 1
#8 2012 b 2
The data is structured hierarchical: An institution at the highest level can a several branches, which are numbered starting at 1.
Problem 1: I want to select the rows containing only branches, for which exists a value in every year, that is in the example data only Branch 1 of Institution a, so the selection should be lines 1, 4 and 5.
Pronlem 2: I want to know the average number of branches a institution has over all years. That is in the example for Institution a (2+1+2)/3 = 1.67 and for institution b (1+0+2)/3 = 1.
Here is one solution:
Problem #1:
library(dplyr)
nYears <- n_distinct(data$year)
data %>% group_by(institution, branch.num) %>% filter(n_distinct(year) == nYears)
Source: local data frame [3 x 3]
Groups: institution, branch.num [1]
year institution branch.num
(fctr) (fctr) (fctr)
1 2010 a 1
2 2011 a 1
3 2012 a 1
Problem #2:
data %>% group_by(institution, year) %>% summarise(nBranches = n_distinct(branch.num)) %>% ungroup() %>% group_by(institution) %>% summarise(meanBranches = sum(nBranches)/nYears)
Source: local data frame [2 x 2]
institution meanBranches
(fctr) (dbl)
1 a 1.666667
2 b 1.000000