This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
I have a similar data frame as df that looks like a registry of entries and exits in a system.
df = data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
> df
id entry exit
1 A 2011 2013
2 B 2014 2015
My aim is to represent my df in long format. gather() from tidyr enables to do something like this.
df_long = df %>% gather(registry, time, entry:exit) %>% arrange(id)
> df_long
id registry time
1 A entry 2011
2 A exit 2013
3 B entry 2014
4 B exit 2015
Yet, I am stuck on how I could incorporate additional rows that would represent the time that my observations (id) are effectively in the system. My desired data.frame then would look something like this:
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2013
5 B 2014
6 B 2015
Any idea of how I could do this is more than welcome and really appreciated.
Here's a way to get toward your desired solution:
df1 <- data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
setNames(stack(by(df1, df1$id, function(x) x$entry : x$exit))[,c(2,1)],
c('id','time'))
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2014
5 B 2015
UPDATE: Another solution based on plyr incorporating the comment above could be:
df1 <- data.frame(id = c("A", "B"), region = c("country.1", "country.2"), entry = c(2011, 2014), exit = c(2013, 2015))
library(plyr)
ddply(df1, .(id,region), summarize, time=seq(entry, exit))
That yields:
id region time
1 A country.1 2011
2 A country.1 2012
3 A country.1 2013
4 B country.2 2014
5 B country.2 2015
Related
I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr 1.1.0, we can use temporary grouping with .by -
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)
I have a data frame in R, in which I have year-wise transaction data for multiple individuals. I want a new data frame in which I want columns based on some conditions like the total revenue for an individual every year in a particular category.
for example
ID year a b c d
1 2015 2 4 6 8
1 2015 4 6 8 10
1 2016 7 6 9 5
2 2015 7 5 6 2
2 2016 3 4 5 2
I want a data frame
I get a column with total values for ID 1 in 2015, ID 1 in 2016, ID 2 in 2015 and so on. and I wanted to add another condition also like total only for those columns which have a value greater than 5 for column a.
please give your suggestions... any help will be appreciated
So based on your question, I used the package dplyr which is incredibly helpful if you don't already have it.
First group, your data by ID and then year. Then create sums for your 4 columns based on these groupings:
mydata <- data.frame("ID" = c(1,1,1,2,2),
"year" = c(2015, 2015, 2016, 2015, 2016),
"a" = c(2,4,7,7,3),
"b" = c(4,6,6,5,4),
"c" = c(6,8,9,6,5),
"d" = c(8, 10, 5, 2, 2))
mydata %>% group_by(ID, year) %>% summarise(a = sum(a), b = sum(b),
c = sum(c), d = sum(d))
To only calculate sums greater than 5, simply specify that in the 'summaries' portion as follows:
mydata %>% group_by(ID, year) %>%
summarise(a = sum(a[a > 5]), b = sum(b[b > 5]),
c = sum(c[c > 5]), d = sum(d[d > 5]))
I hope this helps!
I'm trying to create a new column that is conditionally based on several other columns. Here is my data. I am trying to create a year over year difference column.
> person <- c(rep("A", 4), rep("B", 1), rep("C",3), rep("D",1))
> score <- c(1,1,2,4,1,1,2,2,3)
> year <- c(2017, 2016, 2015, 2014, 2015, 2017, 2015, 2014, 2017)
This function would look for the previous year's data for that individual person, and that score from their current score. If there is no previous year data, then it returns NA. So for my data, I would get a new column "difference" that has values 0, -1, -2, NA, NA, NA, 0, NA, NA.
Would love to see dplyr answer, but vanilla r solutions welcome.
By using dplyr
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(per = ifelse(year - lag(year) == 1, score - lag(score), NA)) %>%
arrange(person, -year)
# A tibble: 9 x 4
# Groups: person [4]
person score year per
<fctr> <dbl> <dbl> <dbl>
1 A 1 2017 0
2 A 1 2016 -1
3 A 2 2015 -2
4 A 4 2014 NA
5 B 1 2015 NA
6 C 1 2017 NA
7 C 2 2015 0
8 C 2 2014 NA
9 D 3 2017 NA
Just to answer the question you put forward under Wen's answer.
you can check out chapter 5 of this book (http://r4ds.had.co.nz/transform.html)to figure out every function and symbol used in Wen's answer.
Also you can read this(http://varianceexplained.org/r/teach-tidyverse/) to get a basic sense of basic r versus tidyverse.
I have the following data frame.
ID Year
A 2001
A 2002
A 2003
B 2009
B 2010
I would like to create a third column in which I substract the minimum year of the corresponding ID to the year and then add one.
In short, I would like to have this:
ID Year New
A 2001 1
A 2002 2
A 2003 3
B 2009 1
B 2010 2
I am pretty new to R and dplyr and havent found the way to do that without a loop..
Thank you in advance
In dplyr you need to use group_by and mutate like so:
df <- read.table(text = "ID Year
A 2001
A 2002
A 2003
B 2009
B 2010", header = T)
df <- df %>%
group_by(ID) %>%
mutate(New = Year - min(Year) + 1)
df
# ID Year New
# A 2001 1
# A 2002 2
# A 2003 3
# B 2009 1
# B 2010 2
Using the tidyverse:
library(tidyverse)
data <- tribble(~ID, ~year,
"A", 2001,
"A", 2002,
"A", 2003,
"B", 2009,
"B", 2010
)
data %>% group_by(ID) %>%
mutate(new = year - min(year)+1)
Using ddply:
library(plyr)
df<-data.frame(ID=c("A","A","A","B","B"), Year=c(2001,2002,2003,2009,2010))
ddply(df, .(ID), transform, New=Year-min(Year)+1)
I am struggling with creating a new variable in my data.frame. I apology for the question title that might not be very clear. I have a database that looks like this:
obs year type
1 2015 A
2 2015 A
3 2015 B
4 2014 A
5 2014 B
I want to add to the current data.frame a column (freq2015) that gives the number of rows by type for 2015 and report the result disregarding the considered year so long as the type is the same. Here is the output I am looking for:
obs year type freq2015
1 2015 A 2 (there are 2 obs. of type A in 2015)
2 2015 A 2 (there are 2 obs. of type A in 2015)
3 2015 B 1 (there is 1 obs. of type B in 2015)
4 2014 A 2 (there are 2 obs. of type A in 2015)
5 2014 B 1 (there are 1 obs. of type B in 2015)
I know how to add to my data.frame the number of rows by type by year using dplyr:
data <- data %>%
group_by(year, type) %>%
mutate(freq = n())
But then, for year=="2014" the added column will display the count of 2014 rows by race instead of that of 2015.
I know how to isolate into a new data.frame the number of rows by race for 2015:
data2015 <- dat[dat$year==2015,] %>%
group_by(type) %>%
mutate(freq2015 = n())
But I don't know how to add a column (with the count of rows by race for 2015) for the entire data.frame conditional on the type being the same (as shown in the example). I am looking for a solution that would prevent me from explicitly using the "type" variable modalities. That is, I don't want to use a code telling R: do this if type==A, do that otherwise. The reason for this restriction is that I have far too many types.
Any ideas? Thank you in advance.
If you group_by using only type, you can sum the rows when year == 2015.
data %>%
group_by(type) %>%
mutate(freq2015 = sum(year == 2015))
Source: local data frame [5 x 4]
Groups: type [2]
obs year type freq2015
<int> <int> <fctr> <int>
1 1 2015 A 2
2 2 2015 A 2
3 3 2015 B 1
4 4 2014 A 2
5 5 2014 B 1
Using the data table we could do:
setDT(df)
setkey(df,type)
df[ df[ year==2015, .(freq2015=.N), by = type]]
Result:
obs year type freq2015
1: 1 2015 A 2
2: 2 2015 A 2
3: 4 2014 A 2
4: 3 2015 B 1
5: 5 2014 B 1
You could use a left_join(), as follows:
temp <- data %>%
filter(year==2015) %>%
group_by(type) %>%
summarize(freq = n())# %>%
data <- data %>% left_join(temp, "type")
We can do this with base R using ave (without any external packages) and it is reasonably fast as well.
df1$freq2015 <- with(df1, ave(year == 2015, type, FUN = sum))
df1$freq2015
#[1] 2 2 1 2 1