Tackling missing data in R - r

I have encountered this problem in a project that I'm currently doing.
I have a sparse dataframe and I need to calculate the difference between the first and the last observation per row under some conditions:
Conditions:
If the row only contains NA's then the difference is 0.
If the row contains only 1 observation then the difference is 0.
If row elements (>= 2) are non-NA's then their difference is the difference between the first and the last (tail - head).
The dataframe that I have:
S1 S2 S3 S4 S5
1 NA NA NA NA NA
2 NA 3 NA 5 NA
3 1 NA NA NA 5
4 1 NA 2 NA 7
5 2 NA NA NA NA
6 NA NA 3 4 NA
7 NA NA 3 NA NA
The dataframe that I need:
S1 S2 S3 S4 S5 diff
1 NA NA NA NA NA 0
2 NA 3 NA 5 NA 2
3 1 NA NA NA 5 4
4 1 NA 2 NA 7 6
5 2 NA NA NA NA 0
6 NA NA 3 4 NA 1
7 NA NA 3 NA NA 0
What I've written up till now:
last_minus_first <- function(x, y = na.omit(x)) tail(y, 1) - y[1]
But it doesn't resolve for the fact if the row contains all NA's.
Any help would be much appreciated.

I would suggest using a defined function with apply(). Here the code:
#Data
df <- structure(list(S1 = c(NA, NA, 1L, 1L, 2L, NA, NA), S2 = c(NA,
3L, NA, NA, NA, NA, NA), S3 = c(NA, NA, NA, 2L, NA, 3L, 3L),
S4 = c(NA, 5L, NA, NA, NA, 4L, NA), S5 = c(NA, NA, 5L, 7L,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))
Code:
#Function
myown <- function(x)
{
#Check NA
i <- sum(!is.na(x))
#Compute
if(i<=1)
{
y <- 0
} else
{
#Detect positions
j1 <- max(which(!is.na(x)))
j2 <- min(which(!is.na(x)))
#Diff
y <- x[j1]-x[j2]
}
return(y)
}
#Apply function by row
df$NewVar <- apply(df,1,myown)
Output:
S1 S2 S3 S4 S5 NewVar
1 NA NA NA NA NA 0
2 NA 3 NA 5 NA 2
3 1 NA NA NA 5 4
4 1 NA 2 NA 7 6
5 2 NA NA NA NA 0
6 NA NA 3 4 NA 1
7 NA NA 3 NA NA 0

Here's an easier (in my mind) way to handle this, using rowwise from the dplyr package to do calculations along rows.
df %>%
dplyr::rowwise() %>%
dplyr::mutate(max_pop = max(which(!is.na(dplyr::c_across(S1:S5)))),
min_pop = min(which(!is.na(dplyr::c_across(S1:S5)))),
diff = tidyr::replace_na(dplyr::c_across()[max_pop] - dplyr::c_across()[min_pop], 0))
I've broken that mutate call down into the various parts to show what we're doing, but essentially, it goes across all columns in a row to find the last populated column (max_pop), the first populated column (min_pop) and then uses those values to retrieve the values therein.
You have to specify columns for max_pop and min_pop above because creating new interim columns affects the column indexing. c_across() defaults to using all columns, though, so you can actually do this all in one mutate call without specifying any columns.
df %>%
rowwise() %>%
mutate(diff = replace_na(c_across()[max(which(!is.na(c_across())))] - c_across()[min(which(!is.na(c_across())))], 0))

A vectorized option in base R would be to extract the values based on row/column index and then subtract
df1$NewVar <- df1[cbind(seq_len(nrow(df1)), max.col(!is.na(df1), 'last'))] -
df1[cbind(seq_len(nrow(df1)), max.col(!is.na(df1), 'first'))]
df1$NewVar[is.na(df1$NewVar)] <- 0
df1
# S1 S2 S3 S4 S5 NewVar
#1 NA NA NA NA NA 0
#2 NA 3 NA 5 NA 2
#3 1 NA NA NA 5 4
#4 1 NA 2 NA 7 6
#5 2 NA NA NA NA 0
#6 NA NA 3 4 NA 1
#7 NA NA 3 NA NA 0
data
df1 <- structure(list(S1 = c(NA, NA, 1L, 1L, 2L, NA, NA), S2 = c(NA,
3L, NA, NA, NA, NA, NA), S3 = c(NA, NA, NA, 2L, NA, 3L, 3L),
S4 = c(NA, 5L, NA, NA, NA, 4L, NA), S5 = c(NA, NA, 5L, 7L,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))

Related

How do I conduct a row sums for loop across specific columns using [row,col] distance indexing

Re-purposing a previous question but hopefully with more clarity + dput().
I'm working with the data below that almost similar to a key:value pairing such that every "type" variable has a corresponding variable that contains a "total" value across each row.
structure(list(type3a1 = c(2L, 6L, 5L, NA, 1L, 3L, NA), type3b1 = c(NA,
3L, 1L, 5L, 6L, 3L, NA), type3a1_arc = c(1L, 2L, 5L, 4L, 5L,
4L, NA), type3b1_arc = c(2L, 2L, 3L, 4L, 1L, 1L, NA), testing = c("Yes",
NA, "No", "No", NA, "Yes", NA), cars = c(5L, 12L, 1L, 6L, NA,
2L, NA), house = c(5L, 4L, 0L, 5L, 0L, 10L, NA), type3a2 = c(50L,
NA, 20L, 4L, 5L, NA, NA), type3b2 = c(10L, 10L, 15L, 1L, 3L,
1L, NA), type3a2_arc = c(50L, 25L, 30L, 10L, NA, 10L, NA), type3b2_arc = c(NA,
20L, 10L, 50L, 5L, 1L, NA), X = c(NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
I am trying to do a summation loop that goes through every row, and scans each "type" variable (i.e. type3a1, type3b1, type3c1, etc.). Each "type" has a matching variable that contains its "total" value (i.e. type3a2, type3b2, type3c2, etc.)
Process:
Check if the "type" variable contains values in (1,2,3,4 or 5).
If that type column's [row,col] value is in (1:5), then move 7 columns from its current [row,col] index to grab it's total value and ready for summation.
After checking every "type" variable, sum all the gathered "total" values and plop into a new overall totals column.
Essentially, I want to end up with a total value like the one below:
The first row shows a total of 100 since type3b1 has a value of "NA" which is not in (1:5). Hence, its total pairing (i.e. +7 columns away = cell value of "10") is not accounted for in the row summation.
My approach this time compared to a previous attempt is using a for loop and relying on indexing based on how far a column was away from another column. I was having a lot of trouble approaching this using dplyr / mutate methods and was having a lot of issues with the variability in the type:total name pairings (i.e. no pattern in the naming conventions, very messy data)...
# Matching pairing variables (i.e. type_vars:"type3a1" with total_vars:"type3a2")
type_vars <- c("type3a1", "type3b1", "type3a1_arc", "type3b1_arc")
total_vars <- c("type3a2", "type3b2", "type3a2_arc", "type3b2_arc")
valid_list <- c(1,2,3,4,5)
totals = list()
for(row in 1:nrow(df)) {
sum = 0
for(col in type_vars) {
if (df[row,col] %in% valid_list) {
sum <- sum + (df[row,col+7])
}
}
totals <- sum
}
I'm hoping this is the right approach but in either case, the code gives me an error at the sum <- sum + (df[row,col+7]) line where: Error in col + 7 : non-numeric argument to binary operator.
It's weird since if I were to do this manually and just indicate df[1,1+2], it gives me a value of "1" which is the value of the intersect [row1, type3a1_arc] in the df above.
Any help or assistance would be appreciated.
The error you received is because col in your original for loop iterates through
type_vars which is a character data type. One way around this is to reference
column indices of type_vars using the which() function.
Here is a solution with just a couple of modifications to your for loop:
totals <- c()
for(row in 1:nrow(df)) {
sum = 0
for(col in which(names(df) %in% type_vars)) {
if (df[row,col] %in% valid_list) {
sum <- sum(c(sum, (df[row,col+7])), na.rm=T)
}
}
totals[row] <- sum
}
df$totals <- totals
df$totals
[1] 100 55 75 61 10 12 0
Here is one way with tidyverse - loop across the columns with names that matches the 'type' followed by one or more digits (\\d+), a letter ([a-z]) and the number 2, then get the corresponding column name by replacing the column name (cur_column()) substring digit 2 with 1, get the value using cur_data(), create a logical vector with %in%, negate (!) and replace those not in 1:5 to NA, then wrap with rowSums and na.rm = TRUE to get the total
library(dplyr)
library(stringr)
df1 %>%
mutate(total = rowSums(across(matches('^type\\d+[a-z]2'), ~
replace(.x, !cur_data()[[str_replace(cur_column(),
"(\\d+[a-z])\\d+", "\\11")]] %in% 1:5, NA)), na.rm = TRUE))
-output
type3a1 type3b1 type3a1_arc type3b1_arc testing cars house type3a2 type3b2 type3a2_arc type3b2_arc X total
1 2 NA 1 2 Yes 5 5 50 10 50 NA NA 100
2 6 3 2 2 <NA> 12 4 NA 10 25 20 NA 55
3 5 1 5 3 No 1 0 20 15 30 10 NA 75
4 NA 5 4 4 No 6 5 4 1 10 50 NA 61
5 1 6 5 1 <NA> NA 0 5 3 NA 5 NA 10
6 3 3 4 1 Yes 2 10 NA 1 10 1 NA 12
7 NA NA NA NA <NA> NA NA NA NA NA NA NA 0
Or may also use two across (assuming the columns are in order)
df1 %>%
mutate(total = rowSums(replace(across(8:11),
!across(1:4, ~ .x %in% 1:5), NA), na.rm = TRUE))
-output
type3a1 type3b1 type3a1_arc type3b1_arc testing cars house type3a2 type3b2 type3a2_arc type3b2_arc X total
1 2 NA 1 2 Yes 5 5 50 10 50 NA NA 100
2 6 3 2 2 <NA> 12 4 NA 10 25 20 NA 55
3 5 1 5 3 No 1 0 20 15 30 10 NA 75
4 NA 5 4 4 No 6 5 4 1 10 50 NA 61
5 1 6 5 1 <NA> NA 0 5 3 NA 5 NA 10
6 3 3 4 1 Yes 2 10 NA 1 10 1 NA 12
7 NA NA NA NA <NA> NA NA NA NA NA NA NA 0
Or using base R
df1$total <- rowSums(mapply(\(x, y) replace(y, !x %in% 1:5, NA),
df1[1:4], df1[8:11]), na.rm = TRUE)
df1$total
[1] 100 55 75 61 10 12 0
Here’s a base R solution:
valid_vals <- sapply(type_vars, \(col) df[, col] %in% valid_list)
temp <- df[, total_vars]
temp[!valid_vals] <- NA
df$total <- rowSums(temp, na.rm = TRUE)
df$total
# [1] 100 55 75 61 10 12 0

Creating a new column based on several exiting columns

Hoping to create the new column D based on three existing columns: "A" "B" and "C". The dataset also have other variables E, F, G, etc.
Whenever A or B or C has a value, other two columns have NAs (E,F, G, not affected by them). The new variable "D" I need should import whatever the existing values from any of the A,B, or C columns.
A B C D E F G
1 NA NA 1
NA 2 NA 2
NA 4 NA 4
NA NA 2 2
NA NA 3 3
Any simple codes within any packages that can do the trick? Thank you in advance!
I have seen other codes that can do the work but their datasets only have A,B and C, but my data set has other existing columns, so I need codes that can specify the A, B and C columns.
One option is to use coalesce on the 'A', 'B', 'C' to create the 'D' - coalesce will return the column with the first non-NA value per each row
library(dplyr)
df1 <- df1 %>%
mutate(D = coalesce(A, B, C), .after = 'C')
A base R way to do it is to use pmax:
Data:
df <- data.frame(A = c(1, NA, NA, NA, NA),
B = c(NA, 2, 4, NA, NA),
C = c(NA, NA, NA, 2, 3))
Code:
df$D <- pmax(df$A, df$B, df$C, na.rm = TRUE)
# or
df$D <- with(df, pmax(A, B, C, na.rm = TRUE))
Output:
# A B C D
# 1 1 NA NA 1
# 2 NA 2 NA 2
# 3 NA 4 NA 4
# 4 NA NA 2 2
# 5 NA NA 3 3
Update using across:
df %>%
mutate(D = rowSums(across(A:C), na.rm = TRUE))
OR
We could use mutate with rowSums:
library(dplyr)
df %>%
mutate(D = rowSums(.[1:3], na.rm = TRUE))
A B C D E F G
1 1 NA NA 1 1 1 1
2 NA 2 NA 2 1 1 1
3 NA 4 NA 4 1 1 1
4 NA NA 2 2 1 1 1
5 NA NA 3 3 1 1 1
data:
df <- structure(list(A = c(1L, NA, NA, NA, NA), B = c(NA, 2L, 4L, NA,
NA), C = c(NA, NA, NA, 2L, 3L), D = c(1L, 2L, 4L, 2L, 3L), E = c(1L,
1L, 1L, 1L, 1L), F = c(1L, 1L, 1L, 1L, 1L), G = c(1L, 1L, 1L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))

Impute missing values with a value from previous month (if exists)

I have a dataframe with more than 100 000 rows and 30 000 unique ids.
My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.
The id column and the date column do not have NAs.
Here is an example of the data I have:
df3
id oxygen gluco dias bp date
1 0,25897842 0,20201604 0,17955655 0,14100962 31.7.2019
2 NA NA 0,38582622 0,12918231 31.12.2014
2 0,35817147 0,32943499 NA 0,43667462 30.11.2018
2 0,68557053 0,42898807 0,93897514 NA 31.10.2018
2 NA NA 0,99899076 0,44168223 31.7.2018
2 0,43848054 0,38604586 NA NA 30.4.2013
2 0,15823254 0,06216771 0,07829624 0,69755251 31.1.2016
2 NA NA 0,61645303 NA 29.2.2016
2 0,94671363 0,50682091 0,96770222 0,97403356 31.5.2018
3 NA 0,77352235 0,660479 0,11554399 30.4.2019
3 0,15567703 NA 0,4553325 NA 31.3.2017
3 NA NA 0,22181609 0,08527658 30.9.2017
3 0,93660763 NA NA NA 31.3.2018
3 0,73416759 NA NA 0,78501791 30.11.2018
3 NA NA NA NA 28.2.2019
3 0,84525106 0,54360374 NA 0,40595426 31.8.2014
3 0,76221263 0,62983336 0,84592719 0,10640734 31.8.2013
4 NA 0,29108942 0,3863479 NA 31.1.2018
4 0,74075742 NA 0,38117415 0,58849266 30.11.2018
4 0,09400641 0,68860814 NA 0,88895224 31.8.2014
4 0,72202944 0,49901387 0,19967415 NA 31.8.2018
4 0,98205262 0,85213969 0,34450998 0,98962306 30.11.2013
This is the last code implementation that I have tried:
´´´
df3 %>%
group_by(id) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))
´´´
But apparently "mutate_all() ignored the following grouping variables:
Column id"
You can use the tidyverse for that. Here's an approach:
Change the date column to class Date, then order by date
Prepare the dates and remove the days in Ym
get the time difference in mo
flag the rows which have max one month difference
get groups by cumsum the inverse logic in flag
fill the rows from the same groups
library(dplyr)
library(tidyr)
library(lubridate)
df$date <- as.Date(df$date, format="%d.%m.%Y")
df %>%
arrange(date) %>%
mutate(
Ym = ym(strftime(date, "%Y-%m")),
mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1),
flag = cumsum(!(mo > -2 & mo < 1))) %>%
group_by(id, flag) %>%
fill(names(.), .direction="down") %>%
ungroup() %>%
select(-c("Ym","mo","flag")) %>%
print(n=nrow(.))
Output
# A tibble: 22 × 6
id oxygen gluco dias bp date
<int> <chr> <chr> <chr> <chr> <date>
1 2 0,43848054 0,38604586 NA NA 2013-04-30
2 3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
3 4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
4 3 0,84525106 0,54360374 NA 0,40595426 2014-08-31
5 4 0,09400641 0,68860814 NA 0,88895224 2014-08-31
6 2 NA NA 0,38582622 0,12918231 2014-12-31
7 2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
8 2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
9 3 0,15567703 NA 0,4553325 NA 2017-03-31
10 3 NA NA 0,22181609 0,08527658 2017-09-30
11 4 NA 0,29108942 0,3863479 NA 2018-01-31
12 3 0,93660763 NA NA NA 2018-03-31
13 2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
14 2 NA NA 0,99899076 0,44168223 2018-07-31
15 4 0,72202944 0,49901387 0,19967415 NA 2018-08-31
16 2 0,68557053 0,42898807 0,93897514 NA 2018-10-31
17 2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
18 3 0,73416759 NA NA 0,78501791 2018-11-30
19 4 0,74075742 NA 0,38117415 0,58849266 2018-11-30
20 3 NA NA NA NA 2019-02-28
21 3 NA 0,77352235 0,660479 0,11554399 2019-04-30
22 1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31
Data
df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842",
NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254",
NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759",
NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641",
"0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499",
"0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091",
"0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336",
"0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"),
dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076",
NA, "0,07829624", "0,61645303", "0,96770222", "0,660479",
"0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719",
"0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223",
NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658",
NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266",
"0,88895224", NA, "0,98962306"), date = structure(c(18108,
16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016,
17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865,
16313, 17774, 16039), class = "Date")), row.names = c(NA,
-22L), class = "data.frame")

Delete/overwrite rows by partial matching

I need to check if rows are partially duplicated and delete/overwrite those where 2 columns match a different row where 3 values are present. one problem is, that the "real" dataframe contains a couple of list columns which makes some operations unfeasible. Best case would be if any row where a match can be found would be checked independently of column numbers - meaning only the row with the most columns having non NA values (out of all which include matching column values) is kept.
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA # this row has only 2 values which match values from row 11 but the last value is na
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14 # this row has values in all 3 columns
12 14 NA NA
13 8 11 15 # so does this row
14 16 NA NA
15 17 NA NA
16 18 NA NA
17 19 NA NA
18 20 NA NA
The result should be the same data frame - just without row 7 or where row 7 is overwritten by row 11.
This should be easy to do but for some reason i didn't manage it (except with a convoluted for loop that is hard to generalize should more columns be added at a later time). Is there a straight forward way to do this?
dput of above df:
structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L), o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA), o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)), row.names = c(NA, -18L), class = "data.frame")
If there is already an answer for something like this, please let me know.
I thought of using dplyr:
library(dplyr)
df %>%
mutate(rn = row_number(),
count_na = rowSums(across(o1:o3, is.na))) %>%
group_by(o1, o2) %>%
slice_min(count_na) %>%
arrange(rn) %>%
ungroup() %>%
select(o1:o3)
This returns
# A tibble: 17 x 3
o1 o2 o3
<int> <int> <int>
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 10 NA NA
8 12 NA NA
9 13 NA NA
10 5 9 14
11 14 NA NA
12 8 11 15
13 16 NA NA
14 17 NA NA
15 18 NA NA
16 19 NA NA
17 20 NA NA
This solution is based on the following ideas:
For every row we count the number of NAs in this row.
We group for o1 and o2 to create groups of data that belong together. Here is a possible flaw: perhaps it is a better approach to group by o1 only or do some other grouping. This depends on the structure of your data: should 1, <NA>, <NA> be overwritten by 1, 2, <NA>?
After grouping, we select the row with the smallest number of NAs.
Finally we do some clean up: removing the auxiliary columns, arranging the data and ungrouping.
A partial solution to detect the duplicates, it remains to specify which rows to delete, ran out of time. I've went ahead and "duplicated" a couple more rows.
df=read.table(text="
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14
12 14 NA NA
13 8 11 15
14 16 NA NA
15 7 1 2
16 18 NA NA
17 7 1 3
18 20 NA NA",h=T)
The main trick is to calculate a distance matrix and check which rows have a distance of zero, since dist will automatically estimate a pairwise distance, removing missing values.
tmp=as.matrix(dist(df))
diag(tmp)=NA
tmp[lower.tri(tmp)]=NA
tod=data.frame(which(tmp==0,arr.ind=T))
resulting in
row col
X7 7 11
X6 6 15
X6.1 6 17
Here's another way which considers all columns, should work with any number of columns and regardless of their names or positions
library(dplyr)
mydf <- structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L),
o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA),
o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)),
row.names = c(NA, -18L),
class = "data.frame")
columns <- names(mydf)
dummy_cols <- paste0(columns, "_dummy")
mydf %>%
# duplicate the dataframe
cbind(mydf %>% `names<-`(dummy_cols)) %>%
# arrange across all columns
arrange(across(columns)) %>%
# fill NAs downwards
tidyr::fill(dummy_cols, .direction = "down") %>%
# create a dummy ID
tidyr::unite(id_dummy, dummy_cols, sep = "") %>%
# group by the id
group_by(id_dummy) %>%
# get the first row of each
filter(row_number()==1) %>%
ungroup() %>%
select(columns)
P.S. also replaces 1 - NA - NA with 1 - 2 - NA and replaces 1 - NA - NA with 1 - NA - 3

Increasingly count the number of times that a certain condition is met (R)

I would like to know how to increasingly count the number of times that a column in my data.frame satisfies a condition. Let's consider a data.frame such as:
x hour count
1 0 NA
2 1 NA
3 2 NA
4 3 NA
5 0 NA
6 1 NA
...
I would like to have this output:
x hour count
1 0 1
2 1 NA
3 2 NA
4 3 NA
5 0 2
6 1 NA
...
With the count column increasing by 1 everytime the condition hour==0 is met.
Is there a smart and efficient way to perform this? Thanks
You can use seq_along on the rows where hour == 0.
i <- x$hour == 0
x$count[i] <- seq_along(i)
x
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
Data:
x <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L), count = c(NA,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
You can use cumsum to count incremental number of 0 occurrences and replace counts where hour values is not 0 to NA.
library(dplyr)
df %>%
mutate(count = cumsum(hour == 0),
count = replace(count, hour != 0 , NA))
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
Using data.table
library(data.table)
setDT(df)[hour == 0, count := seq_len(.N)]
df
# x hour count
#1: 1 0 1
#2: 2 1 NA
#3: 3 2 NA
#4: 4 3 NA
#5: 5 0 2
#6: 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))

Resources