I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr 1.1.0, we can use temporary grouping with .by -
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)
Related
I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!
This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017
I have a data frame in R, in which I have year-wise transaction data for multiple individuals. I want a new data frame in which I want columns based on some conditions like the total revenue for an individual every year in a particular category.
for example
ID year a b c d
1 2015 2 4 6 8
1 2015 4 6 8 10
1 2016 7 6 9 5
2 2015 7 5 6 2
2 2016 3 4 5 2
I want a data frame
I get a column with total values for ID 1 in 2015, ID 1 in 2016, ID 2 in 2015 and so on. and I wanted to add another condition also like total only for those columns which have a value greater than 5 for column a.
please give your suggestions... any help will be appreciated
So based on your question, I used the package dplyr which is incredibly helpful if you don't already have it.
First group, your data by ID and then year. Then create sums for your 4 columns based on these groupings:
mydata <- data.frame("ID" = c(1,1,1,2,2),
"year" = c(2015, 2015, 2016, 2015, 2016),
"a" = c(2,4,7,7,3),
"b" = c(4,6,6,5,4),
"c" = c(6,8,9,6,5),
"d" = c(8, 10, 5, 2, 2))
mydata %>% group_by(ID, year) %>% summarise(a = sum(a), b = sum(b),
c = sum(c), d = sum(d))
To only calculate sums greater than 5, simply specify that in the 'summaries' portion as follows:
mydata %>% group_by(ID, year) %>%
summarise(a = sum(a[a > 5]), b = sum(b[b > 5]),
c = sum(c[c > 5]), d = sum(d[d > 5]))
I hope this helps!
I'm trying to create a new column that is conditionally based on several other columns. Here is my data. I am trying to create a year over year difference column.
> person <- c(rep("A", 4), rep("B", 1), rep("C",3), rep("D",1))
> score <- c(1,1,2,4,1,1,2,2,3)
> year <- c(2017, 2016, 2015, 2014, 2015, 2017, 2015, 2014, 2017)
This function would look for the previous year's data for that individual person, and that score from their current score. If there is no previous year data, then it returns NA. So for my data, I would get a new column "difference" that has values 0, -1, -2, NA, NA, NA, 0, NA, NA.
Would love to see dplyr answer, but vanilla r solutions welcome.
By using dplyr
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(per = ifelse(year - lag(year) == 1, score - lag(score), NA)) %>%
arrange(person, -year)
# A tibble: 9 x 4
# Groups: person [4]
person score year per
<fctr> <dbl> <dbl> <dbl>
1 A 1 2017 0
2 A 1 2016 -1
3 A 2 2015 -2
4 A 4 2014 NA
5 B 1 2015 NA
6 C 1 2017 NA
7 C 2 2015 0
8 C 2 2014 NA
9 D 3 2017 NA
Just to answer the question you put forward under Wen's answer.
you can check out chapter 5 of this book (http://r4ds.had.co.nz/transform.html)to figure out every function and symbol used in Wen's answer.
Also you can read this(http://varianceexplained.org/r/teach-tidyverse/) to get a basic sense of basic r versus tidyverse.
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
I have a similar data frame as df that looks like a registry of entries and exits in a system.
df = data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
> df
id entry exit
1 A 2011 2013
2 B 2014 2015
My aim is to represent my df in long format. gather() from tidyr enables to do something like this.
df_long = df %>% gather(registry, time, entry:exit) %>% arrange(id)
> df_long
id registry time
1 A entry 2011
2 A exit 2013
3 B entry 2014
4 B exit 2015
Yet, I am stuck on how I could incorporate additional rows that would represent the time that my observations (id) are effectively in the system. My desired data.frame then would look something like this:
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2013
5 B 2014
6 B 2015
Any idea of how I could do this is more than welcome and really appreciated.
Here's a way to get toward your desired solution:
df1 <- data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
setNames(stack(by(df1, df1$id, function(x) x$entry : x$exit))[,c(2,1)],
c('id','time'))
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2014
5 B 2015
UPDATE: Another solution based on plyr incorporating the comment above could be:
df1 <- data.frame(id = c("A", "B"), region = c("country.1", "country.2"), entry = c(2011, 2014), exit = c(2013, 2015))
library(plyr)
ddply(df1, .(id,region), summarize, time=seq(entry, exit))
That yields:
id region time
1 A country.1 2011
2 A country.1 2012
3 A country.1 2013
4 B country.2 2014
5 B country.2 2015
I am struggling with creating a new variable in my data.frame. I apology for the question title that might not be very clear. I have a database that looks like this:
obs year type
1 2015 A
2 2015 A
3 2015 B
4 2014 A
5 2014 B
I want to add to the current data.frame a column (freq2015) that gives the number of rows by type for 2015 and report the result disregarding the considered year so long as the type is the same. Here is the output I am looking for:
obs year type freq2015
1 2015 A 2 (there are 2 obs. of type A in 2015)
2 2015 A 2 (there are 2 obs. of type A in 2015)
3 2015 B 1 (there is 1 obs. of type B in 2015)
4 2014 A 2 (there are 2 obs. of type A in 2015)
5 2014 B 1 (there are 1 obs. of type B in 2015)
I know how to add to my data.frame the number of rows by type by year using dplyr:
data <- data %>%
group_by(year, type) %>%
mutate(freq = n())
But then, for year=="2014" the added column will display the count of 2014 rows by race instead of that of 2015.
I know how to isolate into a new data.frame the number of rows by race for 2015:
data2015 <- dat[dat$year==2015,] %>%
group_by(type) %>%
mutate(freq2015 = n())
But I don't know how to add a column (with the count of rows by race for 2015) for the entire data.frame conditional on the type being the same (as shown in the example). I am looking for a solution that would prevent me from explicitly using the "type" variable modalities. That is, I don't want to use a code telling R: do this if type==A, do that otherwise. The reason for this restriction is that I have far too many types.
Any ideas? Thank you in advance.
If you group_by using only type, you can sum the rows when year == 2015.
data %>%
group_by(type) %>%
mutate(freq2015 = sum(year == 2015))
Source: local data frame [5 x 4]
Groups: type [2]
obs year type freq2015
<int> <int> <fctr> <int>
1 1 2015 A 2
2 2 2015 A 2
3 3 2015 B 1
4 4 2014 A 2
5 5 2014 B 1
Using the data table we could do:
setDT(df)
setkey(df,type)
df[ df[ year==2015, .(freq2015=.N), by = type]]
Result:
obs year type freq2015
1: 1 2015 A 2
2: 2 2015 A 2
3: 4 2014 A 2
4: 3 2015 B 1
5: 5 2014 B 1
You could use a left_join(), as follows:
temp <- data %>%
filter(year==2015) %>%
group_by(type) %>%
summarize(freq = n())# %>%
data <- data %>% left_join(temp, "type")
We can do this with base R using ave (without any external packages) and it is reasonably fast as well.
df1$freq2015 <- with(df1, ave(year == 2015, type, FUN = sum))
df1$freq2015
#[1] 2 2 1 2 1