Get a value based on the value of another column in R - dplyr - r

i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
day = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2,NA,1,6,10,2,NA,NA,NA,NA,NA))
and i want to reach this result:
month day flow dayofminflow
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
I was using this solution, but it returns NA when the first value is NA:
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
And this solution returns an error when all data is NA:
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup

You would just change the default na.rm = TRUE in min() from the first solution to ignore NAs?
df %>%
group_by(month) %>%
mutate(dayofminflow = day[which(min(flow, na.rm = TRUE) == flow)][1])
# A tibble: 20 x 4
# Groups: month [4]
month day flow dayofminflow
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
11 3 1 NA 2
12 3 2 1 2
13 3 3 6 2
14 3 4 10 2
15 3 5 2 2
16 4 1 NA NA
17 4 2 NA NA
18 4 3 NA NA
19 4 4 NA NA
20 4 5 NA NA
Though you get a warning no non-missing arguments to min; returning Inf from month 4 since all flow values are NA.

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

Fill missing values (NA) before the first non-NA value by group

I have a data frame grouped by 'id' and a variable 'age' which contains missing values, NA.
Within each 'id', I want to replace missing values of 'age', but only "fill up" before the first non-NA value.
data <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
age=c(NA,6,NA,8,NA,NA,NA,NA,3,8,NA,NA,NA,7,NA,9))
id age
1 1 NA
2 1 6 # first non-NA in id = 1. Fill up from here
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 NA
8 2 NA
9 2 3 # first non-NA in id = 2. Fill up from here
10 2 8
11 2 NA
12 3 NA
13 3 NA
14 3 7 # first non-NA in id = 3. Fill up from here
15 3 NA
16 3 9
Expected output:
1 1 6
2 1 6
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 3
8 2 3
9 2 3
10 2 8
11 2 NA
12 3 7
13 3 7
14 3 7
15 3 NA
16 3 9
I tried using fill with .direction = "up" like this:
library(dplyr)
library(tidyr)
data1 <- data %>% group_by(id) %>%
fill(!is.na(age[1]), .direction = "up")
You could use cumall(is.na(age)) to find the positions before the first non-NA value.
library(dplyr)
data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()
# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9
Another option (agnostic about where the missing and non-missing values start) could be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(is.na(age)), rep(seq_along(lengths), lengths)),
age2 = ifelse(rleid == min(rleid[is.na(age)]),
age[rleid == (min(rleid[is.na(age)]) + 1)][1],
age))
id age rleid age2
<dbl> <dbl> <int> <dbl>
1 1 NA 1 6
2 1 6 2 6
3 1 NA 3 NA
4 1 8 4 8
5 1 NA 5 NA
6 1 NA 5 NA
7 2 NA 1 3
8 2 NA 1 3
9 2 3 2 3
10 2 8 2 8
11 2 NA 3 NA
12 3 NA 1 7
13 3 NA 1 7
14 3 7 2 7
15 3 NA 3 NA
16 3 9 4 9

Create run-length ID for subset of values

In this type of dataframe:
df <- data.frame(
x = c(3,3,1,12,2,2,10,10,10,1,5,5,2,2,17,17)
)
how can I create a new column recording the run-length ID of only a subset of x values, say, 3-20?
My own attempt only succeeds at inserting NA where the run-length count should be interrupted; but internally it seems the count is uninterrupted:
library(data.table)
df %>%
mutate(rle = ifelse(x %in% 3:20, rleid(x), NA))
x rle
1 3 1
2 3 1
3 1 NA
4 12 3
5 2 NA
6 2 NA
7 10 5
8 10 5
9 10 5
10 1 NA
11 5 7
12 5 7
13 2 NA
14 2 NA
15 17 9
16 17 9
The expected result:
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
In base R:
df[df$x %in% 3:20, "rle"] <- data.table::rleid(df[df$x %in% 3:20, ])
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With left_join:
left_join(df, df %>%
filter(x %in% 3:20) %>%
distinct() %>%
mutate(rle = row_number()))
Joining, by = "x"
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With data.table:
library(data.table)
setDT(df)
df[x %between% c(3,20),rle:=rleid(x)][]
x rle
<num> <int>
1: 3 1
2: 3 1
3: 1 NA
4: 12 2
5: 2 NA
6: 2 NA
7: 10 3
8: 10 3
9: 10 3
10: 1 NA
11: 5 4
12: 5 4
13: 2 NA
14: 2 NA
15: 17 5
16: 17 5

Filter to remove all rows before a particular value in a specific column, while this particular value occurs several time

I would like to filter to remove all rows before a particular value in a specific column. For example, in the data frame below, I would like to remove all rows before "1" that appears in column x, for as much as "1" occurs. Please note that the value of "1" repeats many times and I want to remove the "NA" rows before the "1" in column x, regarding column a.
Thanks
a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA
the desired output would be like this:
a b x
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 5 1
3 6 0
3 7 NA
Does this solve your problem?
library(tidyverse)
dat <- read.table(text = "a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA", header = TRUE)
dat %>%
group_by(a) %>%
filter(cummax(!is.na(x)) == 1)
#> # A tibble: 13 × 3
#> # Groups: a [3]
#> a b x
#> <int> <int> <int>
#> 1 1 3 1
#> 2 1 4 0
#> 3 1 5 0
#> 4 1 6 NA
#> 5 1 7 NA
#> 6 2 3 1
#> 7 2 4 NA
#> 8 2 5 0
#> 9 2 6 0
#> 10 2 7 NA
#> 11 3 5 1
#> 12 3 6 0
#> 13 3 7 NA
Created on 2021-12-07 by the reprex package (v2.0.1)

Delete rows when all numbers within a cycle of another variable equal to NA

My data are as follow:
Row x y
1 1 2
2 2 3
3 3 4
4 4 3
5 5 NA
6 1 NA
7 2 NA
8 3 NA
9 4 NA
10 5 7
11 1 NA
12 2 NA
13 3 NA
14 4 NA
15 5 NA
I wish to delete Row 11 to 15 since y are NA for ALL cycles of x (y euqal to NA whatever value x takes for Row 11 to 15). I am not going to delete other rows since there is at lease one number of y not NA when x moves from 1 to 5 (Like from Row 6 to 10, y is 7 when x is 5, thus I keep Row 6 to 10). I wish to know how should I write a R code to accompolish this.
using base R, Taking into assumption that x is arranged and that all start from 1.
subset(df,!ave(is.na(y),cumsum(c(1,diff(x)<0)),FUN=all))
Row x y
1 1 1 2
2 2 2 3
3 3 3 4
4 4 4 3
5 5 5 NA
6 6 1 NA
7 7 2 NA
8 8 3 NA
9 9 4 NA
10 10 5 7
using tidyverse:
df%>%
group_by(m = cumsum(c(1,diff(x)<0)))%>%
filter(!all(is.na(y)))
# A tibble: 10 x 4
# Groups: m [2]
Row x y m
<int> <int> <int> <dbl>
1 1 1 2 1
2 2 2 3 1
3 3 3 4 1
4 4 4 3 1
5 5 5 NA 1
6 6 1 NA 2
7 7 2 NA 2
8 8 3 NA 2
9 9 4 NA 2
10 10 5 7 2
of course you can unselect then remove m

Resources