Count the sequence of numbers while skipping missing values - r

I have a series of dates and I want to count each record the sequence of dates, while skipping missing values.
Essentially, I want to see the following result, where a are my dates and b is my index of the date record. You can see that row 5 is my 4th record, and visit 7 is my 5th record.
tibble(a = c(12, 24, 32, NA, 55, NA, 73), b = c(1, 2, 3, NA, 4, NA, 5))
a b
<dbl> <dbl>
1 12 1
2 24 2
3 32 3
4 NA NA
5 55 4
6 NA NA
7 73 5
It seems that group_by() %>% mutate(sq = sequence(n())) doesn't work in this case, because I don't know how to filter out the missing values while counting. I need to keep those missing values because my data is pretty large.
Is a separate operation of filtering the data, getting the sequence, and using left_join my best option?

library(dplyr)
dat <- tibble(a = c(12, 24, 32, NA, 55, NA, 73))
dat %>%
mutate(sq = ifelse(is.na(a), NA, cumsum(!is.na(a))))
#> # A tibble: 7 x 2
#> a sq
#> <dbl> <int>
#> 1 12 1
#> 2 24 2
#> 3 32 3
#> 4 NA NA
#> 5 55 4
#> 6 NA NA
#> 7 73 5

Cumulatively sum an indicator of non-NA and then add 0*a to effectively NA out any component that was originally NA while adding 0 to the rest (so not changing them).
a <- c(12, 24, 32, NA, 55, NA, 73)
cumsum(!is.na(a)) + 0 * a
## [1] 1 2 3 NA 4 NA 5

Maybe you can try replace + seq_along like below
within(
df,
b <- replace(a, !is.na(a), seq_along(na.omit(a)))
)

We could specify the i as non-NA logical vector, and create the 'b' by assigning the sequence of rows
library(data.table)
setDT(dat)[!is.na(a), b := seq_len(.N)]
-output
dat
# a b
#1: 12 1
#2: 24 2
#3: 32 3
#4: NA NA
#5: 55 4
#6: NA NA
#7: 73 5

Related

how to use mutate_if to change values

How can I use mutate_if() to change values of b to NA in case a > 25
I can do it with ifelse but I feel mutate_if is created for such task.
library(tidyverse)
tbl <- tibble(a = c(10, 20, 30, 40, 10, 60),
b = c(12, 23, 34, 45, 56, 67))
In this small example, I'm not sure that you actually need mutate_if(). mutate_if is designed to use the _if part to determine which columns to subset and work on, rather than an if condition when modifying a value.
Rather, you can use mutate_at() to select your columns to operate on - either based on their exact name or by using vars(contains('your_string')).
See the help page for more info on the mutate_* functions: https://dplyr.tidyverse.org/reference/mutate_all.html
Here are 3 options, using mutate() and mutate_at():
# using mutate()
tbl %>%
mutate(
b = ifelse(a > 25, NA, b)
)
# mutate_at - we select only column 'b'
tbl %>%
mutate_at(vars(c('b')), ~ifelse(a > 25, NA, .))
# select only columns with 'b' in the col name
tbl %>%
mutate_at(vars(contains('b')), ~ifelse(a > 25, NA, .))
Which all produce the same output:
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
I know it's not mutate_if, but I suspect you don't actually need it.
The mutate_if() variant applies a predicate function (a function that
returns TRUE or FALSE) to determine the relevant subset of columns. So the mutate_if condition will apply to all columns and in the example provided below, you can see it uses. Examples of usage is performing an mathematical operation on numeric fields, etc.
https://dplyr.tidyverse.org/reference/mutate_all.html
function (.tbl, .predicate, .funs, ...)
library(dplyr)
# Below code gets the job done but as Hugh Allan explained it is probably not
the right approach
tbl %>%
mutate_if(colnames(tbl) != 'a', ~ifelse(a > 25, NA, .))
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
We can use replace
library(dplyr)
tbl %>%
mutate(b = replace(b, a > 25, NA))
-output
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
And for completion here is base R and data.table variant.
tbl$b[tbl$a > 25] <- NA
tbl
# a b
# <dbl> <dbl>
#1 10 12
#2 20 23
#3 30 NA
#4 40 NA
#5 10 56
#6 60 NA
In data.table -
library(data.table)
setDT(tbl)
tbl[a > 25, b := NA]
tbl

How to fill gaps associating same value to specific factor? R [duplicate]

This question already has answers here:
Efficiently fill NAs by group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 1 year ago.
Assume I have a big data frame with 'Sample_n', 'Station', 'Lat' and 'Lon'.
Each station (from "1" to "100", for example) has a different amount of samples (random amount from 1 to 99, for example).
Although there are no gaps in 'Sample_n' and 'Station', 'Lat' and 'Lon' are full of NAs, but there is at least a full coordinate (lat+lon) for each station, placed at a random row.
I would like to fill the gaps of 'Lat' and 'Lon', associating the right values to the right station.
A sample of your data would be of help. My best guess is that you can get what you want using the function fill from tidyr (also using functions in dplyr in the example):
library(tidyr)
library(dplyr)
df <- tibble(Sample_n=rep(1:3, each = 3), Station = rep(letters[1:3], each = 3),
lat = c(NA, 50, NA, 40, NA, NA, NA, 55, NA),
lon = c(NA, 150, NA, 140, NA, NA, NA, 155, NA))
df
# A tibble: 9 x 4
Sample_n Station lat lon
<int> <chr> <dbl> <dbl>
1 1 a NA NA
2 1 a 50 150
3 1 a NA NA
4 2 b 40 140
5 2 b NA NA
6 2 b NA NA
7 3 c NA NA
8 3 c 55 155
9 3 c NA NA
df %>% group_by(Sample_n, Station) %>%
fill(lat, lon, .direction="updown")
# A tibble: 9 x 4
# Groups: Sample_n, Station [3]
Sample_n Station lat lon
<int> <chr> <dbl> <dbl>
1 1 a 50 150
2 1 a 50 150
3 1 a 50 150
4 2 b 40 140
5 2 b 40 140
6 2 b 40 140
7 3 c 55 155
8 3 c 55 155
9 3 c 55 155

fill in NA by outcome of formula between previous and following non-NA values in R

I have the following dataframe:
day <- c(1,2,3,4,5,6,7,8,9, 10, 11)
totalItems <- c(700, NA, 32013, NA, NA, NA, 39599, NA, NA, NA, 107542)
df <- data.frame(day, totalItems)
I need to create another variable/column where NAs are replaced by the outcome of a formula: (following available non NA - previous available non NA) / (row number of following non NA - row number of previous non NA), in order to get this final dataframe :
day <- c(1,2,3,4,5,6,7,8,9, 10, 11)
totalItems <- c(700, NA, 32013, NA, NA, NA, 39599, NA, NA, NA, 107542)
estimatedDaily <- c(700, 15656, 15656, 1897, 1897, 1897, 1897, 16986, 16986, 16986, 16986)
df.new <- data.frame(day, totalItems, estimatedDaily)
I tried to juggle with tidyr::replace_na() but I couldn't figure out how to define the formula to be able to identify the previous and the following available non NA.
Many thanks in advance for helping.
You can create groups in your data based on presence of NA values.
library(dplyr)
df1 <- df %>% mutate(group = cumsum(lag(!is.na(totalItems), default = TRUE)))
df1
# day totalItems group
#1 1 700 1
#2 2 NA 2
#3 3 32013 2
#4 4 NA 3
#5 5 NA 3
#6 6 NA 3
#7 7 39599 3
#8 8 NA 4
#9 9 NA 4
#10 10 NA 4
#11 11 107542 4
Keep only the rows in df1 which has value in it apply the formula to each group and join it with df1 to get same number of rows back.
df1 %>%
group_by(group) %>%
slice(n()) %>%
ungroup %>%
transmute(group, estimatedDaily = (totalItems - lag(totalItems, default = 0))/
(day - lag(day, default = 0))) %>%
left_join(df1, by = 'group') %>%
select(-group)
# estimatedDaily day totalItems
# <dbl> <dbl> <dbl>
# 1 700 1 700
# 2 15656. 2 NA
# 3 15656. 3 32013
# 4 1896. 4 NA
# 5 1896. 5 NA
# 6 1896. 6 NA
# 7 1896. 7 39599
# 8 16986. 8 NA
# 9 16986. 9 NA
#10 16986. 10 NA
#11 16986. 11 107542

Combining/joining rows within the same dataframe based on grouping R [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 3 years ago.
I am executing a map_df function that results in a dataframe similar to the df below.
name <- c('foo', 'foo', 'foo', 'bar', 'bar', 'bar')
year <- c(19, 19, 19, 18, 18, 18)
A <- c(1, NA, NA, 2, NA, NA)
B <- c(NA, 3, NA, NA, 4, NA)
C <- c(NA, NA, 2, NA, NA, 5)
df <- data.frame(name, year, A, B, C)
name year A B C
1 foo 19 1 NA NA
2 foo 19 NA 3 NA
3 foo 19 NA NA 2
4 bar 18 2 NA NA
5 bar 18 NA 4 NA
6 bar 18 NA NA 5
Based on a my unique groups within the df, in this case: name + year, I want to merge the data into the same row. Desired result:
name year A B C
1 foo 19 1 3 2
2 bar 18 2 4 5
I can definitely accomplish this with a mix of filtering and joins, but with my actual dataframe that would be a lot of code and inefficient. I'm looking for a more elegant way to "squish" this dataframe.
library(dplyr)
df %>%
group_by(name, year) %>%
summarise_all(mean, na.rm = TRUE)
This is a dplyr answer. It works, if your data really looks like the one you posted.
Output:
name year A B C
<fct> <dbl> <dbl> <dbl> <dbl>
1 bar 18 2 4 5
2 foo 19 1 3 2

Is there an R code or function that can help me with if statement with multiple condition

I'm cleaning up my dataset which is a data frame with two columns, obtained after I joined multiple data frames. I'm trying to find a code or a logic way to tell R to create a third column using the following rules:
If both columns contain non-NA values, then the third column contains their average.
If one column contains a NA, then the third column is the value of the column without the missing value.
For example:
df1 <-
data.frame(Var1 = c(34, 23, 23, NA, 32),
Var2 = c(NA, 34, NA, 35, 55))
df1
# Var1 Var2
# 1 34 NA
# 2 23 34
# 3 23 NA
# 4 NA 35
# 5 32 55
The result I want is:
# Var1 Var2 Var3
# 1 34 NA 34.0
# 2 23 34 28.5
# 3 23 NA 23.0
# 4 NA 35 35.0
# 5 32 55 43.5
We need rowMeans here (assuming the no value as NA)
df1$Var3 <- rowMeans(df1, na.rm = TRUE)
If the value is above 100 or below 1, change it to NA (not clear about that condition) and then do the rowMeans
rowMeans(replace(df1, df1 < 1| df1 > 100, NA), na.rm = TRUE)

Resources