This question already has answers here:
Moving average of previous three values in R
(3 answers)
Closed 6 years ago.
I would like to find a dplyr way to take average for the next 3 rows. Say I have a data frame:
data <- structure(list(x = 1:6, y = c(32.1056789265246, 3.48493686329687, 8.21300282100191, 6.72266588891445, 27.7353607044612, 18.5963631547696)), .Names = c("x", "y"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
A tibble: 6 × 2
x y
<int> <dbl>
1 1 12.8230546
2 2 3.4083329
3 3 0.4825815
4 4 13.6714485
5 5 8.9829427
6 6 2.5997503
I want to generate a new data frame that has 3 rows with first one the average from row 2,3,4 and next from 3,4,5 and last one from 4,5,6.
A for loop is probably the easiest way but I would appreciate if there is some more elegant dplyr way to go...Thanks!
You can use the rollmean() function from zoo package with lapply to loop through columns, remove the first row if you don't need it:
library(zoo)
as.data.frame(lapply(data, rollmean, 3))
# x y
#1 2 14.601206
#2 3 6.140202
#3 4 14.223676
#4 5 17.684797
If you don't need the first row:
as.data.frame(lapply(data[-1,], rollmean, 3))
# x y
#1 3 6.140202
#2 4 14.223676
#3 5 17.684797
You can use the RcppRoll package to do that as follows:
require(RcppRoll)
roll_mean(data$y[-1], 3) ## 6.140202 14.223676 17.684797
As i am note sure what output you are looking for you could do:
require(dplyr)
data %>%
mutate(rmean = roll_meanl(y, 3)) %>%
filter(between(x, 2, 4)) %>%
select(-y)
Which results in:
# A tibble: 3 × 2
x rmean
<int> <dbl>
1 2 6.140202
2 3 14.223676
3 4 17.684797
Given that you asked specifically about dplyr, you could try this:
library(dplyr)
data %>%
mutate(av3 = (lead(y, n=1L) + lead(y, n=2L) + lead(y, n=3L))/3)
Which creates:
# A tibble: 6 × 3
x y av3
<int> <dbl> <dbl>
1 1 32.105679 6.140202
2 2 3.484937 14.223676
3 3 8.213003 17.684797
4 4 6.722666 NA
5 5 27.735361 NA
6 6 18.596363 NA
Related
This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed last year.
I have the following data frame:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6))
and I want to record the number of occurrences of each category so the output would be:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
Is there a simple way to do this?
# load package
library(data.table)
# set as data.table
setDT(df)
# count by category
df[, category_count := .N, category]
With dplyr:
library(dplyr)
df %>%
group_by(category) %>%
mutate(category_count = n()) %>%
ungroup
# A tibble: 7 × 3
category value category_count
<chr> <dbl> <int>
1 a 1 2
2 b 5 4
3 b 3 4
4 b 6 4
5 b 7 4
6 a 4 2
7 c 6 1
base
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
df$res <- with(df, ave(x = seq(nrow(df)), list(catergory), FUN = length))
df
#> catergory value category_count res
#> 1 a 1 2 2
#> 2 b 5 4 4
#> 3 b 3 4 4
#> 4 b 6 4 4
#> 5 b 7 4 4
#> 6 a 4 2 2
#> 7 c 6 1 1
Created on 2022-02-08 by the reprex package (v2.0.1)
I have several columns in R data.frame, and I want to create a new column based on ranges of values from some already existing column. Those ranges are not regular and are determined by start and end values written in first two columns. I want the calculation to remain vectorized. I don't want a for loop underneath.
required result, achieved with a for loop:
df = data.frame(start=c(2,1,4,4,1), end=c(3,3,5,4,2), values=c(1:5))
for (i in 1:nrow(df)) {
df[i, 'new'] <- sum(df[df[i, 'start']:df[i, 'end'], 'values'])
}
df
Here is a base R one-liner.
mapply(function(x1, x2, y){sum(y[x1:x2])}, df[['start']], df[['end']], MoreArgs = list(y = df[['values']]))
#[1] 5 6 9 4 3
And another one.
sapply(seq_len(nrow(df)), function(i) sum(df[['values']][df[i, 'start']:df[i, 'end']]))
#[1] 5 6 9 4 3
here is an option with map2
library(purrr)
library(dplyr)
df %>%
mutate(new = map2_dbl(start, end, ~ sum(values[.x:.y])))
-output
# start end values new
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or with rowwise
df %>%
rowwise %>%
mutate(new =sum(.$values[start:end])) %>%
ungroup
-output
# A tibble: 5 x 4
# start end values new
# <dbl> <dbl> <int> <int>
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or using data.table
library(data.table)
setDT(df)[, new := sum(df$values[start:end]), seq_len(nrow(df))]
I want to create a new column that labels each unique combination of values across x, y, z columns. My current work-around to achieve that is this:
> library(tidyverse)
>
> set.seed(100)
> df = tibble(x = sample.int(5, 50, replace = T), y = sample.int(5, 50, replace = T), z = sample.int(5, 50, replace = T))
> df
# A tibble: 50 x 3
x y z
<int> <int> <int>
1 2 4 4
2 3 4 4
3 1 3 5
4 2 1 4
5 4 2 5
6 4 5 2
7 2 3 4
8 3 5 4
9 2 4 1
10 5 5 2
# … with 40 more rows
>
> df2 = df %>% distinct(x,y,z) %>% rowid_to_column("unique_id") %>% left_join(df)
Joining, by = c("x", "y", "z")
> df2
# A tibble: 50 x 4
unique_id x y z
<int> <int> <int> <int>
1 1 2 4 4
2 2 3 4 4
3 3 1 3 5
4 4 2 1 4
5 4 2 1 4
6 5 4 2 5
7 5 4 2 5
8 6 4 5 2
9 6 4 5 2
10 7 2 3 4
# … with 40 more rows
What is a better/more efficient way to do this on a fairly large dataset? I'd like to stay within tidyverse but also open to other suggestions.
You could use rleidv from data.table
df$unique_id <- data.table::rleidv(df)
In dplyr, we can use group_indices function for this purpose which generates a unique id for each group of values.
library(dplyr)
df %>% mutate(unique_id = group_indices(., x, y, z))
In the devel version of dplyr, we can use cur_group_id
library(dplyr)
df %>%
group_by_all() %>%
mutate(unique_id = cur_group_id())
Or using .GRP from data.table
library(data.table)
setDT(df)[, unique_id := .GRP, names(df)]
How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6