I need to create a new variable with the sum of the past three years' amounts for each ID.
If there are not three years' worth of data, there should be an 'NA'.
As an example:
ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
2 2014 6
2 2015 9
3 2012 4
3 2013 7
3 2014 2
3 2015 3
Here's what the result should be:
ID YEAR AMOUNT THREE_YR
1 2010 5 NA
1 2011 2 NA
1 2012 4 11
1 2013 1 7
1 2014 3 8
2 2013 4 NA
2 2014 6 NA
2 2015 9 19
3 2012 4 NA
3 2013 7 NA
3 2014 2 13
3 2015 3 12
How would I do this? Thanks!
We can use functions from dplyr and zoo. dt2 is the final output.
# Create example data frame
dt <- read.table(text = "ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
2 2014 6
2 2015 9
3 2012 4
3 2013 7
3 2014 2
3 2015 3",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(zoo)
# Process the data
dt2 <- dt %>%
group_by(ID) %>%
mutate(THREE_YR = rollsum(AMOUNT, k = 3, fill = NA, align = "right"))
Update: ID groups with less than 3 records.
The OP asked what to do if there are IDs with only one or two rows. Honestly, I did not find a good way to solve this. The only thing I can think of is dividing the original data frame to two groups, apply the rollsum to the group with all records larger than or equal to three. After that, combine all groups.
# Create example data frame
dt <- read.table(text = "ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
3 2012 4
3 2013 7
3 2014 2
3 2015 3",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(zoo)
# Process the data
dt2 <- dt %>%
group_by(ID) %>%
filter(n() >= 3) %>%
mutate(THREE_YR = rollsum(AMOUNT, k = 3, fill = NA, align = "right")) %>%
bind_rows(dt %>% group_by(ID) %>% filter(n() < 3)) %>%
arrange(ID, YEAR)
With the data.table:
library(data.table)
setDT(dt)
setorder(dt,YEAR)
dt[,.(YEAR,AMOUNT,THREE_YR=AMOUNT+shift(AMOUNT,1)+shift(AMOUNT,2)),by=.(ID)]
#ID YEAR AMOUNT THREE_YR
# 1: 1 2010 5 NA
# 2: 1 2011 2 NA
# 3: 1 2012 4 11
# 4: 1 2013 1 7
# 5: 1 2014 3 8
# 6: 3 2012 4 NA
# 7: 3 2013 7 NA
# 8: 3 2014 2 13
# 9: 3 2015 3 12
#10: 2 2013 4 NA
#11: 2 2014 6 NA
#12: 2 2015 9 19
Using zoo::rollapplyr() and aggregate()
This will return NA if there are less than three members in a group.
x <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), YEAR = c(2010L, 2011L, 2012L, 2013L, 2014L, 2013L, 2014L,
2015L, 2012L, 2013L, 2014L, 2015L), AMOUNT = c(5L, 2L, 4L, 1L,
3L, 4L, 6L, 9L, 4L, 7L, 2L, 3L)), .Names = c("ID", "YEAR", "AMOUNT"
), class = "data.frame", row.names = c(NA, -12L))
library(zoo)
rsum <- aggregate(AMOUNT ~ ID, data=x,
FUN=function(x) rollapplyr(x, 3, fill=NA, partial=TRUE,
FUN=function(y) if (length(y) >= 3) sum(y) else NA))
x$rsum <- do.call(c, rsum$AMOUNT)
x
# ID YEAR AMOUNT rsum
# 1 1 2010 5 NA
# 2 1 2011 2 NA
# 3 1 2012 4 11
# 4 1 2013 1 7
# 5 1 2014 3 8
# 6 2 2013 4 NA
# 7 2 2014 6 NA
# 8 2 2015 9 19
# 9 3 2012 4 NA
# 10 3 2013 7 NA
# 11 3 2014 2 13
# 12 3 2015 3 12
# remove one of the 2s
x <- x[-6, ]
rsum <- aggregate(AMOUNT ~ ID, data=x,
FUN=function(x) rollapplyr(x, 3, fill=NA, partial=TRUE,
FUN=function(y) if (length(y) >= 3) sum(y) else NA))
x$rsum <- do.call(c, rsum$AMOUNT)
x
# ID YEAR AMOUNT rsum
# 1 1 2010 5 NA
# 2 1 2011 2 NA
# 3 1 2012 4 11
# 4 1 2013 1 7
# 5 1 2014 3 8
# 7 2 2014 6 NA
# 8 2 2015 9 NA
# 9 3 2012 4 NA
# 10 3 2013 7 NA
# 11 3 2014 2 13
# 12 3 2015 3 12
Related
I've been using wide table format to create a migration variable (year, municipality -> year, municipality, move) and was wondering if I can flip it back into long table format. However, I now 2 groups per year instead of one. I looked through the existing posts on SO, but couldn't find anything similar.
Here's what I have done:
library(tidyverse)
library(rlang)
# sample data
mydata <- data.frame(id = sort(rep(1:10,3)),
year = rep(seq(2009,2011),10),
municip = sample(c(NA,1:3),30,replace=TRUE))
The data looks like this:
id
year
municip
1
2009
2
1
2010
1
1
2011
3
2
2009
1
2
2010
1
2
2011
3
3
2009
NA
3
2010
NA
3
2011
NA
# turn sideways
mydata.wide <- mydata %>%
pivot_wider(names_from = year,
names_prefix = "municip.",
values_from = municip)
Now it looks like this:
id
municip.2009
municip.2010
municip.2011
1
2
1
3
2
1
1
3
3
NA
NA
NA
4
1
NA
3
5
1
NA
2
6
3
2
2
7
2
NA
3
8
3
NA
3
9
NA
1
NA
10
1
NA
2
Then I'm adding a migration variable (in reality this is done for 12 years):
# create migration variable
for (i in 2009:2010){
text.string <- paste0("mydata.wide <- mydata.wide %>%
mutate(move.",i+1," = case_when(
is.na(municip.",i,") & is.na(municip.",i+1,") ~ \"NA\",
is.na(municip.",i,") & !is.na(municip.",i+1,") ~ \"1\",
!is.na(municip.",i,") & !is.na(municip.",i+1,")
& municip.",i," != municip.",i+1," ~ \"3\",
!is.na(municip.",i,") & is.na(municip.",i+1,") ~ \"4\",
TRUE ~ \"2\"
))")
eval(parse_expr(text.string))
}
# NA: missing in both cases
# 1: move into region
# 2: stayed in region
# 3: moved within region
# 4: moved out of region
Now the table looks like this:
id
municip.2009
municip.2010
municip.2011
move.2010
move.2011
1
2
1
3
3
3
2
1
1
3
2
3
3
NA
NA
NA
NA
NA
4
1
NA
3
4
1
5
1
NA
2
4
1
6
3
2
2
3
2
7
2
NA
3
4
1
8
3
NA
3
4
1
9
NA
1
NA
1
4
10
1
NA
2
4
1
What I want to do is to flip it back to create something like this:
id
year
municip
move
1
2009
2
NA
1
2010
1
3
1
2011
3
3
2
2009
1
NA
2
2010
1
2
2
2011
3
3
3
2009
NA
NA
3
2010
NA
NA
3
2011
NA
NA
I'm not sure if this can be done with just pivot_longer on it's own. I tried a couple of variations. Any ideas?
You can try this:
df <- tribble(~id, ~municip.2009, ~municip.2010, ~municip.2011, ~move.2010, ~move.2011,
1, 2, 1, 3, 3, 3,
2, 1, 1, 3, 2, 3,
3, NA, NA, NA, NA, NA,
4, 1, NA, 3, 4, 1,
5, 1, NA, 2, 4, 1,
6, 3, 2, 2, 3, 2,
7, 2, NA, 3, 4, 1,
8, 3, NA, 3, 4, 1,
9, NA, 1, NA, 1, 4,
10, 1, NA, 2, 4, 1
)
df %>%
pivot_longer(cols = -1, names_to = "temp1", values_to = "count") %>%
separate(col = temp1, c("temp2", "year")) %>%
pivot_wider(names_from = temp2, values_from = count)
pivot_longer collects municip and move in the same column; with separate split municip and move by the years; finally with pivot_wider you get the final result.
Don't think sideways, think longways!
Now, I cannot answer your question completly, because I don't really understand what you are calculating. Is it some sort of factor (1-4)? But I believe you can finish this yourself. Consider the following:
> mydata %>% group_by(id) %>%
arrange(year) %>%
mutate(last_year = lag(municip)) %>%
ungroup %>%
arrange(id) %>% as.data.frame # ignore this line, it is simply for the pleasure of seeing the data.frame
id year municip last_year
1 1 2009 3 NA
2 1 2010 2 3
3 1 2011 NA 2
4 2 2009 NA NA
5 2 2010 NA NA
6 2 2011 1 NA
7 3 2009 3 NA
8 3 2010 2 3
9 3 2011 2 2
10 4 2009 2 NA
11 4 2010 NA 2
12 4 2011 1 NA
13 5 2009 3 NA
14 5 2010 NA 3
15 5 2011 2 NA
16 6 2009 1 NA
17 6 2010 3 1
18 6 2011 2 3
19 7 2009 3 NA
20 7 2010 2 3
21 7 2011 2 2
22 8 2009 NA NA
23 8 2010 NA NA
24 8 2011 3 NA
25 9 2009 1 NA
26 9 2010 NA 1
27 9 2011 1 NA
28 10 2009 3 NA
29 10 2010 NA 3
30 10 2011 NA NA
You see? In long-form, you now can simply continue with
%>% mutate(move = case_when(
is.na(.$municip) & is.na(.$last_year) ~ \"NA\",
# etc.
))
Did you want the comparision from year i to the following year? Use the function lead instead of lag.
Lastly, your text-code might not work; when using case_when you have to refer to variables in the piped result with .$.
Something like this?
mydata.wide %>%
pivot_longer(
cols = -id,
names_pattern = "([a-z]+?)\\.(\\d+)",
names_to = c("name", "year"),
values_to = "val",
values_transform = list(val = as.character)
) %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
print(n=30)
A tibble: 30 × 4
id year municip move
<int> <chr> <chr> <chr>
1 1 2009 2 NA
2 1 2010 3 3
3 1 2011 NA 4
4 2 2009 2 NA
5 2 2010 NA 4
6 2 2011 2 1
7 3 2009 1 NA
8 3 2010 2 3
9 3 2011 1 3
10 4 2009 NA NA
11 4 2010 NA NA
12 4 2011 1 1
13 5 2009 NA NA
14 5 2010 2 1
15 5 2011 3 3
16 6 2009 3 NA
17 6 2010 3 2
18 6 2011 3 2
19 7 2009 NA NA
20 7 2010 NA NA
21 7 2011 NA NA
22 8 2009 NA NA
23 8 2010 2 1
24 8 2011 NA 4
25 9 2009 3 NA
26 9 2010 2 3
27 9 2011 NA 4
28 10 2009 2 NA
29 10 2010 3 3
30 10 2011 1 3
I have a dataframe with rows grouped by Year. Variables don't always have observations in each year but when they do, there are 3 observations in that year but appear in different rows.
> na_data
Year Peter Paul John
1 2011 1 NA NA
2 2011 2 NA NA
3 2011 3 NA NA
4 2011 NA 1 NA
5 2011 NA 2 NA
6 2011 NA 3 NA
7 2012 1 NA NA
8 2012 NA 3 NA
9 2012 2 NA NA
10 2012 NA 2 NA
11 2012 3 NA NA
12 2012 NA 1 NA
13 2013 NA 1 4
14 2013 NA 2 5
15 2013 NA 3 6
16 2013 1 NA NA
17 2013 2 NA NA
18 2013 3 NA NA
I want to remove the NAs in each column by group. Such that the output looks like this:
final_data
Year Peter Paul John
[1,] 2011 1 1 NA
[2,] 2011 2 2 NA
[3,] 2011 3 3 NA
[4,] 2012 1 3 NA
[5,] 2012 2 2 NA
[6,] 2012 3 1 NA
[7,] 2013 1 1 4
[8,] 2013 2 2 5
[9,] 2013 3 3 6
So far I have used a loop but I am looking for a cleaner solution if anyone can help that would be great. My solution:
cleaned_list <- vector("list", length(unique(full_data$Year)))
names(cleaned_list) <- unique(full_data$Year)
for(yr in unique(na_data$Year)) {
temp <- matrix(NA, nrow = 3, ncol = ncol(na_data),
dimnames = list(NULL, colnames(na_data)))
for(name in colnames(na_data)[-1]){
no_nas <- as.vector(na.omit(na_data[Year==yr, name]))
if (length(no_nas)!=0) temp[,name] <- no_nas
}
temp[,1] <- yr
cleaned_list[[as.character(yr)]] <- temp
}
final_data <- do.call("rbind", cleaned_list)
Data:
na_data <- data.frame(
Year = rep(c(2011,2012,2013), each = 6),
Peter = c(1:3, rep(NA, 3), 1,NA,2,NA,3,NA, rep(NA, 3),1:3),
Paul = c(rep(NA,3), 1:3, NA,3,NA,2,NA, 1, 1:3, rep(NA,3)),
John = c(rep(NA, 12), 4:6, rep(NA, 3))
)
desired <- data.frame(
Year = rep(c(2011,2012,2013), each = 3),
Peter = c(1:3, 1:3, 1:3),
Paul = c( 1:3, 3:1, 1:3),
John = c(rep(NA, 6), 4:6)
) # same as final_data but a dataframe
Here is one possible solution using data.table package:
library(data.table)
setDT(na_data)[, lapply(.SD, function(x) if(length(y<-na.omit(x))) y else first(x)), by=Year]
# Year Peter Paul John
# 1: 2011 1 1 NA
# 2: 2011 2 2 NA
# 3: 2011 3 3 NA
# 4: 2012 1 3 NA
# 5: 2012 2 2 NA
# 6: 2012 3 1 NA
# 7: 2013 1 1 4
# 8: 2013 2 2 5
# 9: 2013 3 3 6
dplyr equivalent:
library(dplyr)
na_data |>
group_by(Year) |>
summarise(across(.fns = ~ if(length(y<-na.omit(.x))) y else first(.x)))
# # A tibble: 9 x 4
# # Groups: Year [3]
# Year Peter Paul John
# <dbl> <dbl> <dbl> <int>
# 1 2011 1 1 NA
# 2 2011 2 2 NA
# 3 2011 3 3 NA
# 4 2012 1 3 NA
# 5 2012 2 2 NA
# 6 2012 3 1 NA
# 7 2013 1 1 4
# 8 2013 2 2 5
# 9 2013 3 3 6
Convert to long form, remove the NA's, add a sequence number n, convert back and remove n.
library(dplyr)
library(tidyr)
na_data %>%
pivot_longer(-Year) %>%
drop_na %>%
group_by(Year, name) %>%
mutate(n = 1:n()) %>%
ungroup %>%
pivot_wider %>%
select(-n)
giving:
# A tibble: 9 x 4
Year Paul Peter John
<dbl> <dbl> <dbl> <dbl>
1 2011 1 1 NA
2 2011 2 2 NA
3 2011 3 3 NA
4 2012 1 1 NA
5 2012 2 2 NA
6 2012 3 3 NA
7 2013 1 1 4
8 2013 2 2 5
9 2013 3 3 6
I would like to find out how many columns have an entry in each row:
For example:
Date A B
1990 NA NA
1991 1 NA
1992 2 2
1993 3 3
1994 4 NA
1995 5 3
1996 NA NA
1997 7 8
1998 8 2
1999 NA NA
2000 8 4
Column C here would be the result I am wanting.
Date A B C
1990 NA NA 0
1991 1 NA 1
1992 2 2 2
1993 3 3 2
1994 4 NA 1
1995 5 3 2
1996 NA NA 0
1997 7 8 2
1998 8 2 2
1999 NA NA 0
2000 8 4 2
Many Thanks
Try this (you can choose the columns in df, here I excluded first column):
df$C <- apply(df[,-1],1,function(x) length(which(!is.na(x))))
df
Date A B C
1 1990 NA NA 0
2 1991 1 NA 1
3 1992 2 2 2
4 1993 3 3 2
5 1994 4 NA 1
6 1995 5 3 2
7 1996 NA NA 0
8 1997 7 8 2
9 1998 8 2 2
10 1999 NA NA 0
11 2000 8 4 2
Some data:
df <- structure(list(Date = 1990:2000, A = c(NA, 1L, 2L, 3L, 4L, 5L,
NA, 7L, 8L, NA, 8L), B = c(NA, NA, 2L, 3L, NA, 3L, NA, 8L, 2L,
NA, 4L)), row.names = c(NA, -11L), class = "data.frame")
structure(list(Date = 1990:2000, A = c(NA, 1L, 2L, 3L, 4L, 5L,
NA, 7L, 8L, NA, 8L), B = c(NA, NA, 2L, 3L, NA, 3L, NA, 8L, 2L,
NA, 4L)), row.names = c(NA, -11L), class = "data.frame")
a tidyverse
library(tidyverse)
df %>%
rowwise() %>%
mutate(C = sum(!is.na(across(A:B)))) %>%
ungroup
# A tibble: 11 x 4
Date A B C
<int> <int> <int> <dbl>
1 1990 NA NA 0
2 1991 1 NA 1
3 1992 2 2 2
4 1993 3 3 2
5 1994 4 NA 1
6 1995 5 3 2
7 1996 NA NA 0
8 1997 7 8 2
9 1998 8 2 2
10 1999 NA NA 0
11 2000 8 4 2
Or simply combining the mutate with akruns rowSums approach and dropping column ..1
df %>%
mutate(C = rowSums(!is.na(across(-1))))
You can try c_across()
df <- data.frame(obs = 1:5, COL_A = 6:10, COL_B = 11:15, COL_C = c(10, NA, 21, NA, 7))
df2 <- df %>%
rowwise() %>%
mutate(TOTAL = sum(c_across(COL_A:COL_C), na.rm = TRUE))
# A tibble: 5 x 5
# Rowwise:
# obs COL_A COL_B COL_C TOTAL
# <int> <int> <int> <dbl> <dbl>
# 1 1 6 11 10 27
# 2 2 7 12 NA 19
# 3 3 8 13 21 42
# 4 4 9 14 NA 23
# 5 5 10 15 7 32
We can use rowSums on a logical matrix (created with is.na)
df1$C <- rowSums(!is.na(df1[c('A', 'B')])
NOTE: Added the rowSums approach first here
I would like to sum a column (by ID) depending on another variable (group). If we take for instance:
ID t group
1 12 1
1 14 1
1 2 6
2 0.5 7
2 12 1
3 3 1
4 2 4
I'd like to sum values of column t separately for each ID only if group==1, and obtain:
ID t group sum
1 12 1 26
1 14 1 26
1 2 6 NA
2 0.5 7 NA
2 12 1 12
3 3 1 3
4 2 4 NA
Using dplyr,
df %>%
group_by(ID) %>%
mutate(new = sum(t[group == 1]),
new = replace(new, group != 1, NA))
which gives,
# A tibble: 7 x 4
# Groups: ID [4]
ID t group new
<int> <dbl> <int> <dbl>
1 1 12 1 26
2 1 14 1 26
3 1 2 6 NA
4 2 0.5 7 NA
5 2 12 1 12
6 3 3 1 3
7 4 2 4 NA
Consider base R with ifelse and ave() for conditional inline aggregation.
df$sum <- with(df, ifelse(group == 1, ave(t, ID, group, FUN=sum), NA))
df
# ID t group sum
# 1 1 12.0 1 26
# 2 1 14.0 1 26
# 3 1 2.0 6 NA
# 4 2 0.5 7 NA
# 5 2 12.0 1 12
# 6 3 3.0 1 3
# 7 4 2.0 4 NA
Rextester demo
We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i with the logical expression group ==1, get the sum of 't' and assign (:=) it to 'new'. By default, other rows are assigned to NA by default
library(data.table)
setDT(df)[group == 1, new := sum(t), ID]
df
# ID t group new
#1: 1 12.0 1 26
#2: 1 14.0 1 26
#3: 1 2.0 6 NA
#4: 2 0.5 7 NA
#5: 2 12.0 1 12
#6: 3 3.0 1 3
#7: 4 2.0 4 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 4L), t = c(12,
14, 2, 0.5, 12, 3, 2), group = c(1L, 1L, 6L, 7L, 1L, 1L, 4L)),
class = "data.frame", row.names = c(NA,
-7L))
I have tried to adapt my knowledge about Reshape() to my necessities, but I cannot.
My data.frame has two sets of columns (a and b), which I want to reshape to the long format separatly.
It also has variables I want to keep unmodified. Like this:
id 2010a 2011a 2012a char 2010b 2011b 2012b
1 1 2 3 x 5 6 7
2 1 2 3 y 5 6 7
3 1 2 3 z 5 6 7
4 1 2 3 x 5 6 7
To this long format
id year a b char
1 2010 1 5 x
2 2010 1 5 y
3 2010 1 5 z
4 2010 1 5 x
1 2011 2 6 x
2 2011 2 6 y
3 2011 2 6 z
4 2011 2 6 x
1 2012 3 7 x
2 2012 3 7 y
3 2012 3 7 z
4 2012 3 7 x
Thank you!
A solution with tidyr:
library(tidyr)
library(dplyr)
dt_final <- gather(dt_initial, key = year, value = value, -id) %>%
separate(col=year, into=c("year", "name"), sep=-1) %>%
spread(key = name, value = value) %>%
arrange(id, year)
What about this?
library(data.table)
data2 <- melt(setDT(data), id.vars = "id", variable.name = "year")
data2[, l := substr(year, 6,6)][, year := gsub("[a-zA-Z]", "", year)]
dcast(data2, id + year ~ l, value.var = "value")[order(year, id)]
id year a b
1: 1 2010 1 5
2: 2 2010 1 5
3: 3 2010 1 5
4: 4 2010 1 5
5: 1 2011 2 6
6: 2 2011 2 6
7: 3 2011 2 6
8: 4 2011 2 6
9: 1 2012 3 7
10: 2 2012 3 7
11: 3 2012 3 7
12: 4 2012 3 7
Data:
data <- data.frame(
id = 1:4,
`2010a` = c(1L, 1L, 1L, 1L),
`2011a` = c(2L, 2L, 2L, 2L),
`2012a` = c(3L, 3L, 3L, 3L),
`2010b` = c(5L, 5L, 5L, 5L),
`2011b` = c(6L, 6L, 6L, 6L),
`2012b` = c(7L, 7L, 7L, 7L)
)