How to expand dataframe - r

I have this kind of data:
df <- data.frame(year=c(1999,1999,1999,2000,2000,2001,2011,2011,2011,2011), class=c("A","B","C","A","C","A","B","C","D","E"),
n=c(10,20,30,12,15,40,50,55,60,5), occurs=c(0,1,3,4,2,0,0,11,12,2))
> df
year class n occurs
1 1999 A 10 0
2 1999 B 20 1
3 1999 C 30 3
4 2000 A 12 4
5 2000 C 15 2
6 2001 A 40 0
7 2011 B 50 0
8 2011 C 55 11
9 2011 D 60 12
10 2011 E 5 2
I would like to expand this data like this:
year class n occurs
1 1999 A 1 0
1 1999 A 2 0
1 1999 A 3 0
...
1 1999 A 10 0
2 1999 B 0 0
2 1999 B 1 0
2 1999 B 2 0
...
2 1999 B 20 1
3 1999 C 1 1
3 1999 C 1 1
3 1999 C 1 0
3 1999 C 1 0
.. the rest of occurs is seq of zeros...because `n-occurs` = 27 zeros and seq of 3x `1`.
I want to expand the rows n times as indicated by column n and so that the occurs column expands to flag 0 or 1 n-times according to the number of occurs columns number so if column occurs has interger 5 and column n = 10 then there will be n rows (year and class the same) and flags occurs 5 times zero and 5 times number one.
EDIT: Please note the new sequence of occurs (seq only of 0 and 1) is based on n-occurs for number of 0s and number of 1 is determined by number occurs.

Consider do.call and lapply calls using the data.frame() constructor with build of occurs:
df_List <- lapply(seq(nrow(df)), FUN=function(d){
occ <- c(rep(1, df$occurs[[d]]), rep(0, df$n[[d]]-df$occurs[[d]]))
data.frame(year=df$year[[d]], class=df$class[[d]], n=seq(df$n[[d]]), occurs=occ)
})
finaldf <- do.call(rbind, df_List)
head(finaldf, 20)
# year class n occurs
# 1 1999 A 1 0
# 2 1999 A 2 0
# 3 1999 A 3 0
# 4 1999 A 4 0
# 5 1999 A 5 0
# 6 1999 A 6 0
# 7 1999 A 7 0
# 8 1999 A 8 0
# 9 1999 A 9 0
# 10 1999 A 10 0
# 11 1999 B 1 1
# 12 1999 B 2 0
# 13 1999 B 3 0
# 14 1999 B 4 0
# 15 1999 B 5 0
# 16 1999 B 6 0
# 17 1999 B 7 0
# 18 1999 B 8 0
# 19 1999 B 9 0
# 20 1999 B 10 0

Here is a base R method that is closely related to the linked post here and in my comment above. The answer is provides the method for generating the first two columns of the data.frame.
dat <- data.frame(df[1:2][rep(1:nrow(df), df$n),],
n=sequence(df$n),
occurs=unlist(mapply(function(x, y) rep(0:1, c(x-y, y)), df$n, df$occurs)))
Here, the first 2 columns are generated using that answer. n is generated using sequence, and occurs uses mapply and rep, returning a vector with unlist. This puts the 1s at the end. You could use 1:0 to put the 1s at the beginning or feed the resulting vector to sample within mapply to get a random ordering of 1s and 0s.
We can check that the data.frame has the proper number of rows:
nrow(dat) == sum(df$n)
[1] TRUE
The first 15 observations of
head(dat, 15)
year class n occurs
1 1999 A 1 0
1.1 1999 A 2 0
1.2 1999 A 3 0
1.3 1999 A 4 0
1.4 1999 A 5 0
1.5 1999 A 6 0
1.6 1999 A 7 0
1.7 1999 A 8 0
1.8 1999 A 9 0
1.9 1999 A 10 0
2 1999 B 1 0
2.1 1999 B 2 0
2.2 1999 B 3 0
2.3 1999 B 4 0
2.4 1999 B 5 0

Related

Create sequence by condition in the case when condition changes

the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0

Coding Changes in Variables in R

I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 1
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 1
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 1
36 d 2005 1
37 d 2006 0
38 d 2007 0
39 d 2008 0
40 d 2009 0
The unit of analysis is ID. The IDs repeat across each year in the data. The variable of interest column represents changes to the IDs, wherein some years they are a 0 and other years they are a 1
I want to create an additional column that codes changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change, while also ignoring changes from (1 to 0) (as seen when the ID is equal to "d").
Any code that can help me achieve this solution would be greatly appreciated!
Perferability I'd like the data to look like this:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 1 3
18 b 2007 1 4
19 b 2008 1 5
20 b 2009 1 6
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 1 3
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 1 2
36 d 2005 1 3
37 d 2006 0 NA
38 d 2007 0 NA
39 d 2008 0 NA
40 d 2009 0 NA
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 7),
rep(0,6), rep(1, 4),
rep(0,2), rep(1, 4), rep(0,4))
data.frame(ID, Year, Variable_of_Interest)
Thank you for your help!
We could create a function :
library(dplyr)
get_sequence <- function(x) {
inds <- which(x == 1 & lag(x) == 0)[1]
vals <- seq_along(x) - inds
inds <- which(x == 0 & lag(x) == 1)[1]
if(!is.na(inds)) vals[inds:length(x)] <- NA
return(vals)
}
and apply it for each ID :
df %>% group_by(ID) %>% mutate(Solution = get_sequence(Variable_of_Interest))
# ID Year Variable_of_Interest Solution
#1 a 2000 0 -5
#2 a 2001 0 -4
#3 a 2002 0 -3
#4 a 2003 0 -2
#5 a 2004 0 -1
#6 a 2005 1 0
#7 a 2006 1 1
#8 a 2007 1 2
#9 a 2008 1 3
#10 a 2009 1 4
#11 b 2000 0 -3
#...
#...
#33 d 2002 1 0
#34 d 2003 1 1
#35 d 2004 1 2
#36 d 2005 1 3
#37 d 2006 0 NA
#38 d 2007 0 NA
#39 d 2008 0 NA
#40 d 2009 0 NA

Creating sequence indicators corresponding to data timing in R

I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 0
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 0
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 0
36 d 2005 1
37 d 2006 1
38 d 2007 1
39 d 2008 1
40 d 2009 1
The unit of analysis is in the ID column. The IDs repeat across each year in the data.
The variable of interest column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.
I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change.
The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.
NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.
Such that, the data looks like:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 0 -1
18 b 2007 1 0
19 b 2008 1 1
20 b 2009 1 2
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 0 NA
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 0 -1
36 d 2005 1 0
37 d 2006 1 1
38 d 2007 1 2
39 d 2008 1 3
40 d 2009 1 4
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
data.frame(ID, Year, Variable_of_Interest)
Thank you so much for your help!
Another option using data.table:
#identify runs
setDT(DF)[, ri := rleid(voi)]
#generate the desired output depending on whether VOI is 1 or 0
DF[, soln := if (voi[1L]==1L) seq.int(.N) - 1L else -rev(seq.int(.N)), .(ID, ri)]
#replace trailing 0 with NA
DF[, soln := if(voi[.N]==0L) replace(soln, ri==ri[.N], NA_integer_) else soln, ID]
data:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
DF <- data.frame(ID, Year, voi=Variable_of_Interest)
Here is a dplyr option:
library(dplyr)
df %>%
group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
group_by(ID) %>%
mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
select(-idx)
Output:
ID Year Variable_of_Interest idx Solution
1 a 2000 0 0 -5
2 a 2001 0 0 -4
3 a 2002 0 0 -3
4 a 2003 0 0 -2
5 a 2004 0 0 -1
6 a 2005 1 1 0
7 a 2006 1 1 1
8 a 2007 1 1 2
9 a 2008 1 1 3
10 a 2009 1 1 4
11 b 2000 0 2 -3
12 b 2001 0 2 -2
13 b 2002 0 2 -1
14 b 2003 1 3 0
15 b 2004 1 3 1
16 b 2005 1 3 2
17 b 2006 0 4 -1
18 b 2007 1 5 0
19 b 2008 1 5 1
20 b 2009 1 5 2
21 c 2000 0 6 -6
22 c 2001 0 6 -5
23 c 2002 0 6 -4
24 c 2003 0 6 -3
25 c 2004 0 6 -2
26 c 2005 0 6 -1
27 c 2006 1 7 0
28 c 2007 1 7 1
29 c 2008 1 7 2
30 c 2009 0 8 NA
31 d 2000 0 8 -2
32 d 2001 0 8 -1
33 d 2002 1 9 0
34 d 2003 1 9 1
35 d 2004 0 10 -1
36 d 2005 1 11 0
37 d 2006 1 11 1
38 d 2007 1 11 2
39 d 2008 1 11 3
40 d 2009 1 11 4

reshape a dataframe R

I am facing a reshaping problem with a dataframe. It has many more rows and columns. Simplified, it structure looks like this:
rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2020 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16
I would like to come out with a dataframe that has one single row for the variable "year", copy the x1, x2, x3 values in subsequent columns, and rename the columns with a combination between the rowname and the x-variable. It should look like this:
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
2000 2 6 11 0 4 2 0 3 5
2010 2 6 11 0 0 0 4 1 8
2020 10 1 7 8 4 10 22 1 16
I thought to use subsequent cbind() functions, but since I have to do it for thousands of rows and hundreds columns, I hope there is a more direct way with the reshape package (with which I am not so familiar yet)
Thanks in advance!
First, I hope that rownames is a data.frame column and not the data.frame's rownames. Otherwise you'll encounter problems due to the non-uniqueness of the values.
I think your main problem is, that your data.frame is not entirely molten:
library(reshape2)
dt <- melt( dt, id.vars=c("year", "rownames") )
head(dt)
year rownames variable value
1 2000 a x1 2
2 2000 b x1 0
3 2000 c x1 0
4 2010 a x1 2
...
dcast( dt, year ~ rownames + variable )
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
1 2000 2 6 11 0 4 2 0 3 5
2 2010 2 6 11 0 0 0 4 1 8
3 2020 10 1 7 8 4 10 22 1 16
EDIT:
As #spdickson points out, there is also an error in your data avoiding a simple aggregation. Combinations of year, rowname have to be unique of course. Otherwise you need an aggregation function which determines the resulting values of non-unique combinations. So we assume that row 6 in your data should read c 2010 4 1 8.
You can try using reshape() from base R without having to melt your dataframe further:
df1 <- read.table(text="rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2010 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16",header=T,as.is=T)
reshape(df1,direction="wide",idvar="year",timevar="rownames")
# year x1.a x2.a x3.a x1.b x2.b x3.b x1.c x2.c x3.c
# 1 2000 2 6 11 0 4 2 0 3 5
# 4 2010 2 6 11 0 0 0 4 1 8
# 7 2020 10 1 7 8 4 10 22 1 16

Combining split() and cumsum()

I am trying to produce stats for cumulative goals by season by a particular soccer player. I have used the cut function to obtain the season from the game dates. I have data which corresponds to this dataframe
df.raw <-
data.frame(Game = 1:20,
Goals=c(1,0,0,2,1,0,3,2,0,0,0,1,0,4,1,2,0,0,0,3),
season = gl(4,5,labels = c("2001", "2002","2003", "2004")))
In real life, the number of games per season may not be constant
I want to end up with data that looks like this
df.seasoned <-
data.frame(Game = 1:20,seasonGame= rep(1:5),
Goals=c(1,0,0,2,1,0,3,2,0,0,0,1,0,4,1,2,0,0,0,3),
cumGoals = c(1,1,1,3,4,0,3,5,5,5,0,1,1,5,6,2,2,2,2,5),
season = gl(4,5,labels = c("2001", "2002","2003", "2004")))
With the goals cumulatively summed within year and a game number for the season
df.raw$cumGoals <- with(df.raw, ave(Goals, season, FUN=cumsum) )
df.raw$seasonGame <- with(df.raw, ave(Game, season, FUN=seq))
df.raw
Or with transform ... the original transform, that is:
df.seas <- transform(df.raw, seasonGame = ave(Game, season, FUN=seq),
cumGoals = ave(Goals, season, FUN=cumsum) )
df.seas
Game Goals season seasonGame cumGoals
1 1 1 2001 1 1
2 2 0 2001 2 1
3 3 0 2001 3 1
4 4 2 2001 4 3
5 5 1 2001 5 4
6 6 0 2002 1 0
7 7 3 2002 2 3
8 8 2 2002 3 5
9 9 0 2002 4 5
10 10 0 2002 5 5
snipped
Another job for ddply and transform (from the plyr package):
ddply(df.raw,.(season),transform,seasonGame = 1:NROW(piece),
cumGoals = cumsum(Goals))
Game Goals season seasonGame cumGoals
1 1 1 2001 1 1
2 2 0 2001 2 1
3 3 0 2001 3 1
4 4 2 2001 4 3
5 5 1 2001 5 4
6 6 0 2002 1 0
7 7 3 2002 2 3
8 8 2 2002 3 5
9 9 0 2002 4 5
10 10 0 2002 5 5
11 11 0 2003 1 0
12 12 1 2003 2 1
13 13 0 2003 3 1
14 14 4 2003 4 5
15 15 1 2003 5 6
16 16 2 2004 1 2
17 17 0 2004 2 2
18 18 0 2004 3 2
19 19 0 2004 4 2
20 20 3 2004 5 5
Here is a solution using data.table which is very fast.
library(data.table)
df.raw.tab = data.table(df.raw)
df.raw.tab[,list(seasonGame = 1:NROW(Goals), cumGoals = cumsum(Goals)),'season']

Resources