I'm just getting started using dplyr and I have the following two problems, which should be easy to solve with group_by, but I don't get it.
I have data that looks like this:
data <- data.frame(cbind("year" = c(2010, 2010, 2010, 2011, 2012, 2012, 2012, 2012),
"institution" = c("a", "a", "b", "a", "a", "a", "b", "b"),
"branch.num" = c(1, 2, 1, 1, 1, 2, 1, 2)))
data
# year institution branch.num
#1 2010 a 1
#2 2010 a 2
#3 2010 b 1
#4 2011 a 1
#5 2012 a 1
#6 2012 a 2
#7 2012 b 1
#8 2012 b 2
The data is structured hierarchical: An institution at the highest level can a several branches, which are numbered starting at 1.
Problem 1: I want to select the rows containing only branches, for which exists a value in every year, that is in the example data only Branch 1 of Institution a, so the selection should be lines 1, 4 and 5.
Pronlem 2: I want to know the average number of branches a institution has over all years. That is in the example for Institution a (2+1+2)/3 = 1.67 and for institution b (1+0+2)/3 = 1.
Here is one solution:
Problem #1:
library(dplyr)
nYears <- n_distinct(data$year)
data %>% group_by(institution, branch.num) %>% filter(n_distinct(year) == nYears)
Source: local data frame [3 x 3]
Groups: institution, branch.num [1]
year institution branch.num
(fctr) (fctr) (fctr)
1 2010 a 1
2 2011 a 1
3 2012 a 1
Problem #2:
data %>% group_by(institution, year) %>% summarise(nBranches = n_distinct(branch.num)) %>% ungroup() %>% group_by(institution) %>% summarise(meanBranches = sum(nBranches)/nYears)
Source: local data frame [2 x 2]
institution meanBranches
(fctr) (dbl)
1 a 1.666667
2 b 1.000000
Related
I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr 1.1.0, we can use temporary grouping with .by -
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)
I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!
This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017
I'm trying to create a new column that is conditionally based on several other columns. Here is my data. I am trying to create a year over year difference column.
> person <- c(rep("A", 4), rep("B", 1), rep("C",3), rep("D",1))
> score <- c(1,1,2,4,1,1,2,2,3)
> year <- c(2017, 2016, 2015, 2014, 2015, 2017, 2015, 2014, 2017)
This function would look for the previous year's data for that individual person, and that score from their current score. If there is no previous year data, then it returns NA. So for my data, I would get a new column "difference" that has values 0, -1, -2, NA, NA, NA, 0, NA, NA.
Would love to see dplyr answer, but vanilla r solutions welcome.
By using dplyr
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(per = ifelse(year - lag(year) == 1, score - lag(score), NA)) %>%
arrange(person, -year)
# A tibble: 9 x 4
# Groups: person [4]
person score year per
<fctr> <dbl> <dbl> <dbl>
1 A 1 2017 0
2 A 1 2016 -1
3 A 2 2015 -2
4 A 4 2014 NA
5 B 1 2015 NA
6 C 1 2017 NA
7 C 2 2015 0
8 C 2 2014 NA
9 D 3 2017 NA
Just to answer the question you put forward under Wen's answer.
you can check out chapter 5 of this book (http://r4ds.had.co.nz/transform.html)to figure out every function and symbol used in Wen's answer.
Also you can read this(http://varianceexplained.org/r/teach-tidyverse/) to get a basic sense of basic r versus tidyverse.
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
I have a similar data frame as df that looks like a registry of entries and exits in a system.
df = data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
> df
id entry exit
1 A 2011 2013
2 B 2014 2015
My aim is to represent my df in long format. gather() from tidyr enables to do something like this.
df_long = df %>% gather(registry, time, entry:exit) %>% arrange(id)
> df_long
id registry time
1 A entry 2011
2 A exit 2013
3 B entry 2014
4 B exit 2015
Yet, I am stuck on how I could incorporate additional rows that would represent the time that my observations (id) are effectively in the system. My desired data.frame then would look something like this:
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2013
5 B 2014
6 B 2015
Any idea of how I could do this is more than welcome and really appreciated.
Here's a way to get toward your desired solution:
df1 <- data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
setNames(stack(by(df1, df1$id, function(x) x$entry : x$exit))[,c(2,1)],
c('id','time'))
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2014
5 B 2015
UPDATE: Another solution based on plyr incorporating the comment above could be:
df1 <- data.frame(id = c("A", "B"), region = c("country.1", "country.2"), entry = c(2011, 2014), exit = c(2013, 2015))
library(plyr)
ddply(df1, .(id,region), summarize, time=seq(entry, exit))
That yields:
id region time
1 A country.1 2011
2 A country.1 2012
3 A country.1 2013
4 B country.2 2014
5 B country.2 2015
I have a data set that is in a long format, and I can’t seem to get it the right shape for analysis. Perhaps this shape is appropriate — my experience has been almost entirely with wide format data, so this data file is not making sense to me. (Reproducible data file at end of the post.)
> head(df,10)
ID attributes values
1 1 AU AAA
2 1 AU BBB
3 1 YR 2014
4 2 AU CCC
5 2 AU DDD
6 2 AU EEE
7 2 AU FFF
8 2 AU GGG
9 2 YR 2013
10 3 AU HHH
The attributes column contain variables of interest to me, and I want to perform a series of aggregation functions. For example, I would like to:
1.Obtain a count of the number of authors (AU) for each ID. For example:
ID N.AU
1 2
2 5
3 1
4 2
5 5
6 1
Compute the median number of authors (AU) by year (YR)
YR Median.N.AU
2013 5.0
2014 1.5
For both of these examples, I have tried dplry with group_by and summaries, but haven’t cracked the code. I have also tried dcast. My hope is to come up with a solution that I can easily generalize to a larger data frame that has many more attributes that take on either a single value or multiple values. Any help or pointers to a similar solution would be greatly appreciated.
attributes = c("AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR",
"AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR")
ID = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6)
values = c("AAA", "BBB", "2014", "CCC", "DDD", "EEE", "FFF", "GGG", "2013", "HHH", "2014",
"III", "JJJ", "2014", "KKK", "LLL", "MMM", "NNN", "OOO", "2013", "PPP", "2014")
df <- data.frame(ID, attributes, values)
I think you're getting confused because you actually have two tables of
data linked by a common ID:
library(dplyr)
df <- tbl_df(df)
years <- df %>%
filter(attributes == "YR") %>%
select(id = ID, year = values)
years
#> Source: local data frame [6 x 2]
#>
#> id year
#> 1 1 2014
#> 2 2 2013
#> 3 3 2014
#> 4 4 2014
#> 5 5 2013
#> .. .. ...
authors <- df %>%
filter(attributes == "AU") %>%
select(id = ID, author = values)
authors
#> Source: local data frame [16 x 2]
#>
#> id author
#> 1 1 AAA
#> 2 1 BBB
#> 3 2 CCC
#> 4 2 DDD
#> 5 2 EEE
#> .. .. ...
Once you have the data in this form, it's easy to answer the questions
you're interested in:
Authors per paper:
n_authors <- authors %>%
group_by(id) %>%
summarise(n = n())
Or
n_authors <- authors %>% count(id)
Median authors per year:
n_authors %>%
left_join(years) %>%
group_by(year) %>%
summarise(median(n))
#> Joining by: "id"
#> Source: local data frame [2 x 2]
#>
#> year median(n)
#> 1 2013 5.0
#> 2 2014 1.5
Here's a possible data.table solution
I would also suggest to create some aggregated data set with separated columns. For example:
library(data.table)
(subdf <- as.data.table(df)[, .(N.AU = sum(attributes == "AU"),
Year = values[attributes == "YR"]) , ID])
# ID N.AU Year
# 1: 1 2 2014
# 2: 2 5 2013
# 3: 3 1 2014
# 4: 4 2 2014
# 5: 5 5 2013
# 6: 6 1 2014
Calculating median per year
subdf[, .(Median.N.AU = median(N.AU)), keyby = Year]
# Year Median.N.AU
# 1: 2013 5.0
# 2: 2014 1.5
I misunderstood the structure of your dataset initially. Thanks to the comments below I realize your data needs to be restructured.
# split the data out
df1 <- df[df$attributes == "AU",]
df2 <- df[df$attributes == "YR",]
# just keeping the columns with data as opposed to the label
df3 <- merge(df1, df2, by="ID")[,c(1,3,5)]
# set column names for clarification
colnames(df3) <- c("ID","author","year")
# get author counts
num.authors <- count(df3, vars=c("ID","year"))
ID year freq
1 1 2014 2
2 2 2013 5
3 3 2014 1
4 4 2014 2
5 5 2013 5
6 6 2014 1
summaryBy(freq ~ year, data = num.authors, FUN = list(median))
year freq.median
1 2013 5.0
2 2014 1.5
The nice thing about summaryBy is that you can add in which ever function has been defined in the list and you will get another column containing the other metric (e.g. mean, sd, etc.)