I have a data frame with data that looks like this:
Part Number Vendor Name Position Repair
123 ABC 1 2
NA <NA> 2 4
NA <NA> 3 1
NA <NA> 4 5
NA <NA> 5 6
NA <NA> 6 3
123 XYZ 1 4
NA <NA> 2 5
NA <NA> 3 7
NA <NA> 4 1
NA <NA> 5 2
NA <NA> 6 3
NA <NA> 7 6
I have a part number and vendor name grouped. Whenever position column > 3 and Repair ==1, retrieve subsequent rows.
Suppose in the given example for Part number =123 and vendor name=ABC, the repair==1 is at third position [position=3]-> All the rows that belong to part=123 and vendor name =ABC should be excluded.
Part=123 and vendor name=XYZ, the repair ==1 is at the fourth position. So retrieve 4th,5th,6th and 7th rows.
Condition to be considered is consider rows where Position >3 and Repair ==1, retrieve all subsequent rows.
Sample data:
Input <- structure(list(`Part Number` = c(123, NA, NA, NA, NA, NA, 123,
NA, NA, NA, NA, NA, NA), `Vendor Name` = c("ABC", NA, NA, NA,
NA, NA, "XYZ", NA, NA, NA, NA, NA, NA), Position = c(1, 2, 3,
4, 5, 6, 1, 2, 3, 4, 5, 6, 7), Repair = c(2, 4, 1, 5, 6, 3, 4,
5, 7, 1, 2, 3, 6)), .Names = c("Part Number", "Vendor Name", "Position",
"Repair"), row.names = c(NA, -13L), class = c("tbl_df", "tbl",
"data.frame"))
I've tried the following but it hasn't resulted in what I wanted:
output_table <- Input %>% group_by(`Part Number`,`Vendor Name`) %>%
mutate(rn=row_number()) %>% filter(rn>=which(pivot$Repair==1)) #Here I'm able to filter subsequent rows where repair==1 but how to exclude the rows which doesn't fall under the mentioned conditions.
output_table <- Input[Input$Position >3 & Input$Repair==1,] # gives me rows matching the condition but I need subsequent rows once the condition is met
Your format seems like it is geared towards presentation (reports) vice for data processing. Any processing like this should really be done before you do things like remove repeating rows for visual-grouping.
Ultimately, the only part you need here within group_by is the use of cumany. The rest of the mutating code is to accommodate the NA fields.
Input %>%
# assuming order is "safe to assume"
mutate_at(vars(`Part Number`, `Vendor Name`), zoo::na.locf) %>%
group_by(`Part Number`,`Vendor Name`) %>%
filter(cumany(Position > 3 & Repair == 1)) %>%
# return the first two columns to NA
mutate(toprow = row_number() == 1L) %>%
ungroup() %>%
mutate_at(vars(`Part Number`, `Vendor Name`), ~ if_else(toprow, ., .[NA])) %>%
select(-toprow)
# # A tibble: 4 x 4
# `Part Number` `Vendor Name` Position Repair
# <dbl> <chr> <dbl> <dbl>
# 1 123 XYZ 4 1
# 2 NA <NA> 5 2
# 3 NA <NA> 6 3
# 4 NA <NA> 7 6
If you are doing more processing on the data, I'd suggest you don't undo "dragging the labels down", instead just doing:
Input %>%
# assuming order is "safe to assume"
mutate_at(vars(`Part Number`, `Vendor Name`), zoo::na.locf) %>%
group_by(`Part Number`,`Vendor Name`) %>%
filter(cumany(Position > 3 & Repair == 1)) %>%
ungroup()
# # A tibble: 4 x 4
# `Part Number` `Vendor Name` Position Repair
# <dbl> <chr> <dbl> <dbl>
# 1 123 XYZ 4 1
# 2 123 XYZ 5 2
# 3 123 XYZ 6 3
# 4 123 XYZ 7 6
With dplyr and tidyr you can do this as follows:
library(dplyr)
library(tidyr)
Input %>%
fill(`Part Number`, `Vendor Name`) %>% # fill down missing values
group_by(`Part Number`, `Vendor Name`) %>% # group by `Part Number` & `Vendor Name`
filter( cumsum(Position>3 & Repair==1) >= 1) # select only rows where the cumulative sum of true/false condition >= 1
Output for that should be what you are looking for:
# A tibble: 4 x 4
`Part Number` `Vendor Name` Position Repair
<dbl> <chr> <dbl> <dbl>
1 123 XYZ 4 1
2 123 XYZ 5 2
3 123 XYZ 6 3
4 123 XYZ 7 6
Related
I have a dataset that I want to convert any duplicates across columns to be NA. I've found answers to help with just looking for duplicates in one column, and I've found ways to remove duplicates entirely (e.g., distinct()). Instead, I have this data:
library(dpylr)
test <- tibble(job = c(1:6),
name = c("j", "j", "j", "c", "c", "c"),
id = c(1, 1, 2, 1, 5, 1))
And want this result:
library(dpylr)
answer <- tibble(job = c(1:6),
id = c("j", NA, "j", "c", NA, "c"),
name = c(1, NA, 2, 1, NA, 5))
And I've tried a solution like this using duplicated(), but it fails:
#Attempted solution
library(dpylr)
test %>%
mutate_at(vars(id, name), ~case_when(
duplicated(id, name) ~ NA,
TRUE ~ .
))
I'd prefer to use tidy solutions, but I can be flexible as long as the answer can be piped.
We could create a helper and then identify duplicates and replace them with NA in an ifelse statement using across:
library(dplyr)
test %>%
mutate(helper = paste(id, name)) %>%
mutate(across(c(name, id), ~ifelse(duplicated(helper), NA, .)), .keep="unused")
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 NA NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 NA NA
If we want to convert to NA, create a column that includes all the columns with paste or unite and then mutate with across
library(dplyr)
library(tidyr)
test %>%
unite(full_nm, -job, remove = FALSE) %>%
mutate(across(-c(job, full_nm), ~ replace(.x, duplicated(full_nm), NA))) %>%
select(-full_nm)
-output
# A tibble: 6 × 3
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 <NA> NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 <NA> NA
I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3
I have the following example data:
Example <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
col1
1
NA
NA
4
NA
NA
6
NA
NA
NA
6
8
NA
2
NA
I want to fill the NAs with value from above, but only if the NAs are between 2 identical values. In this example the first NA gap from 1 to 4 should not be filled with 1s. But the gap between the first 6 and the second 6 should be filled, with 6s. All other values should stay NA.
Therefore, afterwards it should look like:
col1
1
NA
NA
4
NA
NA
6
6
6
6
6
8
NA
2
NA
But in reality I do not have only 15 observations, but over 50000. Therefore I need a efficient solution, which is more difficult than I thought. I tried to use the Fill function but was not able to come up with a solution.
One dplyr and zoo option could be:
df %>%
mutate(cond = na.locf0(col1) == na.locf0(col1, fromLast = TRUE),
col1 = ifelse(cond, na.locf0(col1), col1)) %>%
select(-cond)
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
Here is a dply solution:
First I create the data in tibble format:
df <- tibble(
x = c(1, NA_real_, NA_real_,
4, NA_real_, NA_real_,
6, NA_real_, NA_real_, NA_real_,
6, 8, NA_real_, 2, NA_real_)
)
Next, I create two grouping variables which will be helpful in identifying the first and the last non-NA value.
I then save these reference values to ref_start and ref_end.
In the end I overwrite the values of x:
df %>%
mutate(gr1 = cumsum(!is.na(x))) %>%
group_by(gr1) %>%
mutate(ref_start = first(x)) %>%
ungroup() %>%
mutate(gr2 = lag(gr1, default = 1)) %>%
group_by(gr2) %>%
mutate(ref_end = last(x)) %>%
ungroup() %>%
mutate(x = if_else(is.na(x) & ref_start == ref_end, ref_start, x))
# A tibble: 15 x 1
x
<dbl>
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
df <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
library(data.table)
library(magrittr)
setDT(df)[!is.na(col1), n := .N, by = col1] %>%
.[, n := nafill(n, type = "locf")] %>%
.[n == 2, col1 := nafill(col1, type = "locf")] %>%
.[, n := NULL] %>%
.[]
#> col1
#> 1: 1
#> 2: NA
#> 3: NA
#> 4: 4
#> 5: NA
#> 6: NA
#> 7: 6
#> 8: 6
#> 9: 6
#> 10: 6
#> 11: 6
#> 12: 8
#> 13: NA
#> 14: 2
#> 15: NA
Created on 2021-10-11 by the reprex package (v2.0.1)
Here is a tidyverse approach using dplyr and tidyr:
Logic:
Create an id column
Remove all na rows
Flag if next value is the same
right_join with first Example df
fill down flag and corresponding col1.y
mutate with an ifelse
library(dplyr)
library(tidyr)
Example <- Example %>%
mutate(id=row_number())
Example %>%
na.omit() %>%
mutate(flag = ifelse(col1==lead(col1), TRUE, FALSE)) %>%
right_join(Example, by="id") %>%
arrange(id) %>%
fill(col1.y, .direction="down") %>%
fill(flag, .direction="down") %>%
mutate(col1.x = ifelse(flag==TRUE, col1.y, col1.x), .keep="unused") %>%
select(col1 = col1.x)
Output:
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
The solution above with data.table (from Yuriy Saraykin) works only for the example. As Daniel Hendrick comments : Seems as the NAs get filled after the begining and ending value, where it should really end. Like if the data would be: (6, NA, NA, 6, NA, 8) your dplyr solution would give out: (6, 6, 6, 6, 6, 8).
Here is an another proposition with data.table:
library(data.table)
df <- data.table(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, NA, NA, 8, NA, 2, NA))
cond = nafill(df$col1, type = "locf") == nafill(df$col1, type = "nocb")
df[which(cond==T), col1 := nafill(df$col1, type = "locf")[which(cond==T)]]
df$col1
[1] 1 NA NA 4 NA NA 6 6 6 6 6 NA NA 8 NA 2 NA
I got column like this with some duplicated values
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1), date = c(NA, NA,
NA, "2011/01/01", "2011/02/01", "2012/01/01", "2012/01/01", "2012/05/01"
)), class = "data.frame", row.names = c(NA, -8L))
I want to keep only one of the duplicated values, like this
structure(list(id2 = c(1, 1, 1, 1, 1, 1, 1),
date2 = c(NA, NA, NA, "2011/01/01", "2011/02/01", "2012/01/01", "2012/05/01")),
class = "data.frame", row.names = c(NA, -7L))
Depending on what you want exactly there are multiple alternatives:
dat %>%
filter(!duplicated(date))
gives
id date
1 1 <NA>
2 1 2011/01/01
3 1 2011/02/01
4 1 2012/01/01
As someone else also suggested, it gives the same result as
dat %>% distinct(date, .keep_all = T)
In contrast to that person I added a column to the distinct function, as I assumed you only want to remove the duplicated dates, not necessary duplicates in other columns (and the .keep_all is than necessary to keep those other columns).
However it is unclear for me if you want to keep all NAs or not. Becuase than you need to add some rows with just the NAs.
if you want all NAs you could for example do:
dat %>%
filter(!is.na(date) & !duplicated(date)) %>%
bind_rows(dat %>% filter(is.na(date)))
which gives
id date
1 1 2011/01/01
2 1 2011/02/01
3 1 2012/01/01
4 1 <NA>
5 1 <NA>
6 1 <NA>
Although there probably is a nicer way to do this.
Edit:
If you want to keep the entries but only want to make the duplicated values NA you can use the duplicated function this way:
dat %>%
mutate(
date1 = case_when(
duplicated(date) ~ NA_character_,
TRUE ~ date
)
)
I generally prefer case_when over if_else due to its readability. But in this case it would be the same.
It results in
id date date1
1 1 <NA> <NA>
2 1 <NA> <NA>
3 1 <NA> <NA>
4 1 2011/01/01 2011/01/01
5 1 2011/02/01 2011/02/01
6 1 2012/01/01 2012/01/01
7 1 2012/01/01 <NA>
8 1 2012/05/01 2012/05/01
I created an extra column for this example. But you could simply overwrite the date column in your actual analysis.
You can use dplyr::distinct:
library(tidyverse)
df <- structure(list(id = c(1, 1, 1, 1, 1, 1), date = c(NA, NA, NA,
"2011/01/01", "2011/02/01", "2012/01/01")), row.names = c(NA, 6L), class = "data.frame")
df
#> id date
#> 1 1 <NA>
#> 2 1 <NA>
#> 3 1 <NA>
#> 4 1 2011/01/01
#> 5 1 2011/02/01
#> 6 1 2012/01/01
df %>%
distinct()
#> id date
#> 1 1 <NA>
#> 2 1 2011/01/01
#> 3 1 2011/02/01
#> 4 1 2012/01/01
I have a dataset where I have to fill NA values using the previous value and a sum of current value in another column. Basically, my data looks like
library(lubridate)
library(tidyverse)
library(zoo)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
# A tibble: 8 x 4
Id Time av Value
<dbl> <date> <dbl> <dbl>
1 2012-09-01 18 121
1 2012-09-02 NA NA
1 2012-09-03 NA NA
1 2012-09-04 NA NA
2 2012-09-01 21 146
2 2012-09-02 NA NA
2 2012-09-03 NA NA
2 2012-09-04 NA NA
What I want to do is: where the Value is NA, I want to replace it by sum of previous Value and current value of av. If av is NA, it can be replaced with previous value. I use na.locf function from zoo package as
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
However, filling in for Value seems to be difficult. I can do it using for loop as
# Back up the Value column for testing
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This produces the result I want but for a large dataset, I believe there are better ways to do it in R. I tried complete function from dplyr but it adds two additional rows as:
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>% mutate(av = zoo::na.locf(av)) %>%
mutate(num_rows = n()) %>%
complete(nesting(Id), Value = seq(min(Value, na.rm = TRUE),
(min(Value, na.rm = TRUE) + max(num_rows) * min(na.omit(av))), min(na.omit(av))))
The output has two extra rows; 10 instead of 8
# A tibble: 10 x 5
# Groups: Id [2]
Id Value Time av num_rows
<dbl> <dbl> <date> < dbl> <int>
1 121 2012-09-01 18 4
1 139 NA NA NA
1 157 NA NA NA
1 175 NA NA NA
1 193 NA NA NA
2 146 2012-09-01 21 4
2 167 NA NA NA
2 188 NA NA NA
2 209 NA NA NA
2 230 NA NA NA
Any help to do it faster without loops would be greatly appreciated.
In the question av starts with a non-NA in each group and is followed by NAs so if this is the general pattern then this will work. Note that it is good form to close any group_by with ungroup; however, we did not do that below so that we could compare df2 with df1.
df2 <- df %>%
group_by(Id) %>%
mutate(Value_backup = Value,
av = first(av),
Value = first(Value) + cumsum(av) - av)
identical(df1, df2)
## [1] TRUE
Note
For reproducibility first run this (taken from question except we only load needed packages):
library(dplyr)
library(tibble)
library(lubridate)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "
2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}