counting the number of observations row wise using dplyr - r

I have a dataset look like this -
sample <- tibble(x = c (1,2,3,NA), y = c (5, NA,2, NA))
sample
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 1 5
2 2 NA
3 3 2
4 NA NA
Now I want create a new variable Z, which will count how many observations are in each row. For example for the sample dataset above the first value of new variable Z should be 2 because both x and y have values. Similarly, for 2nd row the value of Z is 1 as there is one missing value and for 4th row, the value is 0 as there is no observations in the row.
The expected dataset looks like this -
x y z
<dbl> <dbl> <dbl>
1 1 5 2
2 2 NA 1
3 3 2 2
4 NA NA 0
I want to do this on few number of variables, not the whole dataset.

Using base R. First line checks all columns, second one checks columns by name, third might not work as good if the number of columns is substantial.
sample$z1 <- rowSums(!is.na(sample))
sample$z2 <- rowSums(!is.na(sample[c("x", "y")]))
sample$z3 <- is.finite(sample$x) + is.finite(sample$y)
> sample
# A tibble: 4 x 5
x y z1 z2 z3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 5 2 2 2
2 2 NA 1 1 1
3 3 2 2 2 2
4 NA NA 0 0 0

We can use
library(dplyr)
sample %>%
rowwise %>%
mutate(z = sum(!is.na(cur_data()))) %>%
ungroup
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <int>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
If it is select columns
sample %>%
rowwise %>%
mutate(z = sum(!is.na(select(cur_data(), x:y))))
Or with rowSums on a logical matrix
sample %>%
mutate(z = rowSums(!is.na(cur_data())))
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0

apply function with selected columns example:
set.seed(7)
vals <- sample(c(1:20, NA, NA), 20)
sample <- matrix(vals, ncol = 5)
# Select columns 1, 3, 4
cols <- c(1, 3, 4)
rowcnts <- apply(sample[ , cols], 1, function(x) length(x[!is.na(x)]))
sample <- cbind(sample, rowcnts)
> sample
rowcnts
[1,] 10 15 16 NA 12 2
[2,] 19 8 14 18 9 3
[3,] 7 17 6 4 1 3
[4,] 2 3 13 NA 5 2

Related

Dplyr recursively grow a dataframe

I have the following values
values <- seq(1,3)
I would like to have the resulting dataframe
# A tibble: 6 x 2
facet values
<dbl> <dbl>
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
The facet is a column which is stating the iteration of the recursive append.
My current solution is
facet1 <- values %>% head(1) %>% tibble(facet = 1, values = .)
facet2 <- values %>% head(2) %>% tibble(facet = 2, values = .)
facet3 <- values %>% head(3) %>% tibble(facet = 3, values = .)
bind_rows(facet1, facet2, facet3)
A more general solution needed [Edit]
The current solutions will not work for my use-case because they are exploiting the fact that in my previous example sequences of the facets equal the values.
Here is a more general reproducible example where the values are much different from the facets.
set.seed(42)
values <- rnorm(3,0,.2)
df_recursive <- tribble(~facet, ~values,
1, 0.27,
2, 0.27,
2, -0.11,
3, 0.27,
3, -0.11,
3, 0.07)
# A tibble: 6 x 2
facet values
<dbl> <dbl>
1 1 0.27
2 2 0.27
3 2 -0.11
4 3 0.27
5 3 -0.11
6 3 0.07
Here is an option:
library(tidyverse)
values <- seq(1,3)
map_dfr(values, ~tibble(facet = .x, values = 1:.x))
#> # A tibble: 6 x 2
#> facet values
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 2 2
#> 4 3 1
#> 5 3 2
#> 6 3 3
EDIT:
These approaches can all be adapted to your use case. For example:
set.seed(42)
values <- rnorm(3,0,.2)
map_dfr(1:length(values), ~tibble(facet = .x, values = values[1:.x]))
#> # A tibble: 6 x 2
#> facet values
#> <int> <dbl>
#> 1 1 0.274
#> 2 2 0.274
#> 3 2 -0.113
#> 4 3 0.274
#> 5 3 -0.113
#> 6 3 0.0726
One possible solution:
library(dplyr)
tibble(facet=1:3) %>%
group_by(facet) %>%
summarise(values = seq_len(facet)) %>%
ungroup()
# A tibble: 6 x 2
facet values
<int> <int>
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
library(data.table)
data.table(facet=1:3)[, .(values = seq_len(facet)), by=facet]
facet values
<int> <int>
1: 1 1
2: 2 1
3: 2 2
4: 3 1
5: 3 2
6: 3 3
A base R solution, toying around with rep and sequence:
v <- seq(1, 3)
data.frame(facet = rep(v, v), values = sequence(v))
facet values
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
This can be adapted to any vector:
set.seed(42)
values <- rnorm(3, 0, .2)
v <- seq_along(values)
data.frame(facet = rep(v, v), values = values[sequence(v)])
facet values
1 1 0.27419169
2 2 0.27419169
3 2 -0.11293963
4 3 0.27419169
5 3 -0.11293963
6 3 0.07262568
And a base R version of AndS.'s solution:
do.call(rbind.data.frame, lapply(v, \(x) data.frame(facet = x, values = seq(x))))

How to conditionally update a R tibble using multiple conditions of another tibble

I have two tables. I would like to update the first table using a second table using multiple conditions. In base R I would use if...else type constructs to do this but would like to know how to achieve this using dplyr.
The table to be updated (have a field added) looks like this:
> Intvs
# A tibble: 12 x 3
Group From To
<chr> <dbl> <dbl>
1 A 0 1
2 A 1 2
3 A 2 3
4 A 3 4
5 A 4 5
6 A 5 6
7 B 0 1
8 B 1 2
9 B 2 3
10 B 3 4
11 B 4 5
12 B 5 6
The tibble that I would like to use to make the update looks like this:
>Zns
# A tibble: 2 x 4
Group From To Zone
<chr> <chr> <dbl> <dbl>
1 A X 1 5
2 B Y 3 4
I would like to update the Intvs tibble with the Zns tibble using the fields == Group, >= From, and <= To to control the update. The expected output should look like this
> Intvs
# A tibble: 12 x 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
What is the most efficient way to do this using dplyr?
The code below should make the dummy tables Intv and Zns
# load packages
require(tidyverse)
# Intervals table
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
names(Zns) <- c("Group", "From", "To", "Zone")
Using non-equi join from data.table
library(data.table)
setDT(Intvs)[Zns, Zone := Zone, on = .(Group, From >= From, To <= To)]
-output
> Intvs
Group From To Zone
<char> <num> <num> <char>
1: A 0 1 <NA>
2: A 1 2 X
3: A 2 3 X
4: A 3 4 X
5: A 4 5 X
6: A 5 6 <NA>
7: B 0 1 <NA>
8: B 1 2 <NA>
9: B 2 3 <NA>
10: B 3 4 Y
11: B 4 5 <NA>
12: B 5 6 <NA>
This is the closest I get. It is not giving the expected output:
library(dplyr)
left_join(Intvs, Zns, by="Group") %>%
group_by(Group) %>%
mutate(Zone1 = case_when(From.x <= Zone & From.x >= To.y ~ From.y)) %>%
select(Group, From=From.x, To=To.x, Zone = Zone1)
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 X
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 Y
12 B 5 6 NA
Not sure why your first row does not give NA, since 0 - 1 is not in the range of 1 - 5.
First left_join the two dataframes using the Group column. Here I assign the suffix "_Zns" to values from the Zns dataframe. Then use a single case_when or (ifelse) statement to assign NA to rows that do not fit the range. Finally, drop the columns that end with Zns.
library(dplyr)
left_join(Intvs, Zns, by = "Group", suffix = c("", "_Zns")) %>%
mutate(Zone = case_when(From >= From_Zns & To <= To_Zns ~ Zone,
TRUE ~ NA_character_)) %>%
select(-ends_with("Zns"))
# A tibble: 12 × 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
Data
Note that I have changed your column name order in the Zns dataframe.
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
colnames(Zns) <- c("Group", "Zone", "From", "To")

Roll max in R. From first row to current row

I would like to calculate max value from first row to current row
df <- data.frame(id = c(1,1,1,1,2,2,2), value = c(2,5,3,2,4,5,4), result = c(NA,2,5,5,NA,4,5))
I have tried grouping by id with dplyr and using rollmax function from zoo but did not success
1) rollmax is used with a fixed width but here we have a variable width so using rollapplyr, which seems close to the approach of the question, we have:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(out = lag(rollapplyr(value, 1:n(), max))) %>%
ungroup
giving:
# A tibble: 7 x 4
# Groups: id [2]
id value result out
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
2) It is also possible to perform the grouping via the width (second) argument of rollapplyr like this eliminating dplyr. In this case the widths are 1, 2, 3, 4, 1, 2, 3 and Max is like max except it does not use the last element of its argument x. (An alternate expression for the width would be seq_along(id) - match(id, id) + 1).
library(zoo)
Max <- function(x) if (length(x) == 1) NA else max(head(x, -1))
transform(df, out = rollapplyr(value, sequence(rle(id)$lengths), Max))
giving:
id value result out
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
A data.table option using shift + cummax
> setDT(df)[, result2 := shift(cummax(value)), id][]
id value result result2
1: 1 2 NA NA
2: 1 5 2 2
3: 1 3 5 5
4: 1 2 5 5
5: 2 4 NA NA
6: 2 5 4 4
7: 2 4 5 5
library(dplyr)
df |>
group_by(id) |>
mutate(result = lag(cummax(value)))
# # A tibble: 7 x 3
# # Groups: id [2]
# id value result
# <dbl> <dbl> <dbl>
# 1 1 2 NA
# 2 1 5 2
# 3 1 3 5
# 4 1 2 5
# 5 2 4 NA
# 6 2 5 4
# 7 2 4 5
Here is a base R solution. This would just get you the cumulative maximum:
df$result = ave(df$value, df$i, FUN=cummax)
To get the cumulative maximum with the lag you wanted:
df$result = ave(df$value, df$i, FUN=function(x) c(NA,cummax(x[-(length(x))])))

Unnest a data frame and fill new rows with NAs

Let's say I have a nested df, and I want to unnest the columns:
df <- tibble::tribble(
~x, ~y, ~nestdf,
1, 2, tibble::tibble(a=1:2, b=3:4),
3, 4, tibble::tibble(a=3:5, b=5:7)
)
tidyr::unnest(df, nestdf)
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 1 2 2 4
#3 3 4 3 5
#4 3 4 4 6
#5 3 4 5 7
The result has the x and y columns extended to match the dimensions of nestdf, with the new rows using the existing values. However, I want the new rows to contain NA, like so:
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
Is it possible to do this with unnest? Either the first or last row for each group can be kept as non-NA, I don't mind.
Repeating rows, and binding with an unnest of the nested list column(s):
nr <- sapply(df$nestdf, nrow) - 1
cbind(
df[rep(rbind(seq_along(nr), NA), rbind(1, nr)), c("x","y")],
unnest(df["nestdf"], cols=everything())
)
# x y a b
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
One way would be to change the duplicates to NA.
df1 <- tidyr::unnest(df, nestdf)
cols <- c('x', 'y')
df1[duplicated(df1[cols]), cols] <- NA
df1
# x y a b
# <dbl> <dbl> <int> <int>
#1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7
If the values in columns x and y can repeat you can create a row number to identify them uniquely -
library(dplyr)
library(tidyr)
df1 <- df %>% mutate(row = row_number()) %>% unnest(nestdf)
cols <- c('x', 'y', 'row')
df1[duplicated(df1[cols]), cols] <- NA
df1 <- select(df1, -row)
You could convert x and y to lists first:
library(tidyverse)
df <- tibble::tribble(
~x, ~y, ~nestdf,
1, 2, tibble::tibble(a=1:2, b=3:4),
3, 4, tibble::tibble(a=3:5, b=5:7)
)
df %>%
mutate_at(vars(x:y), ~map2(., nestdf, ~.x[seq(nrow(.y))])) %>%
unnest(everything())
# A tibble: 5 x 4
#x y a b
#<dbl> <dbl> <int> <int>
# 1 1 2 1 3
#2 NA NA 2 4
#3 3 4 3 5
#4 NA NA 4 6
#5 NA NA 5 7

How to combine the values of various columns in a tibble by the same row ID

So I have a tibble (data frame) like this (the actual data frame is like 100+ rows)
sample_ID <- c(1, 2, 2, 3)
A <- c(NA, NA, 1, 3)
B <- c(1, 2, NA, 1)
C <- c(5, 1, NA, 2)
D <- c(NA, NA, 3, 1)
tibble(sample_ID,A,B,C,D)
# which reads
# A tibble: 4 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 NA 2 1 NA
3 2 1 NA NA 3
4 3 3 1 2 1
As can be seen here, the second and third rows have the same sample ID. I want to combine these two rows so that the tibble looks like
# A tibble: 3 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 1 2 1 3
3 3 3 1 2 1
In other words, I want the rows for sample_ID to be unique (order doesn't matter), and the values of other columns are merged (overwrite NA when possible). Can this be achieved in a simple way, such as using gather and spread? Many thanks.
We can use summarise_each after grouping by 'sample_ID'
library(dplyr)
df %>%
group_by(sample_ID) %>%
summarise_each(funs(na.omit))
# A tibble: 3 × 5
# sample_ID A B C D
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 NA 1 5 NA
#2 2 1 2 1 3
#3 3 3 1 2 1

Resources