Conditional replacement while match on a variable - r

I want to replace the NA values for observations within a particular sub-group, but the sequence of the observations in that group is not ordered properly. So I am wondering if there exists some dplyr or plyr command that would allow me to replace missing values in a column belonging to one dataframe using the values from the same column from another dataframe while matching on the values of that "key" column.
Here's what I got. Hope someone could shed light on this. Thanks.
## data frame that contains missing values in "diff" column
df <- data.frame(type = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3),
diff = c(0.1, 0.3, NA, NA, NA, NA, NA, 0.2, 0.7, NA, 0.5, NA),
name = c("A", "B", "C", "D", "E", "A", "B", "C", "F", "A", "B", "C"))
## replace with values from this smaller data frame
df2 <- data.frame(diff_rep = c(0.3, 0.2, 0.4), name = c("A", "B", "C"))
## replace using ifelse
df$diff <- ifelse(is.na(df$diff) & (df$type == 2), df2$diff_rep , df$diff)
df
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 0.3 D
5 2 0.2 E
6 2 0.4 A
7 2 0.3 B
8 2 0.2 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
## desired output
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 NA D
5 2 NA E
6 2 0.3 A
7 2 0.2 B
8 2 0.4 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C

Assuminhg row 9 is a mistake, you can use a left join first and then use ifelse() and coalesce() to get your desired result. coalesce() returns the first non-missing value
left_join(df, df2, by = "name") %>%
mutate(diff_wanted = if_else(type == 2,
coalesce(diff, diff_rep),
diff),
diff_wanted = ifelse(name %in% df2$name,
diff_wanted,
NA)) %>%
select(type, diff_wanted, name)

Related

Logic for filtering dependent on two columns [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))

find column value and name based on minimum value in other column

I have a data.table that looks like this
library( data.table )
dt <- data.table( p1 = c("a", "b", "c", "d", "e", "f", "g"),
p2 = c("b", "c", "d", "a", "f", "g", "h"),
p3 = c("z", "x", NA, NA, "y", NA, "s"),
t1 = c(1, 2, 3, NA, 5, 6, 7),
t2 = c(7, 6, 5, NA, 3, 2, NA),
t3 = c(8, 3, NA, NA, 2, NA, 1) )
# p1 p2 p3 t1 t2 t3
# 1: a b z 1 7 8
# 2: b c x 2 6 3
# 3: c d <NA> 3 5 NA
# 4: d a <NA> NA NA NA
# 5: e f y 5 3 2
# 6: f g <NA> 6 2 NA
# 7: g h s 7 NA 1
It has p-columns, representing names, and t-columns, representing values.
t1 is the value corresponding to p1, t2 to p2, etc..
On each row, values of p-columns are unique (or NA). The same goes for the values in the t-columns.
What I want to do is to create three new columns:
t_min, the minimum value of all t-columns for each row (exclude NA's)
p_min, if t_min exists (is not NA), the corresponding value of the p-column... so if the t2-column has the t-min value, the corresponding value of column p2.
p_col_min, the name of the column with the value if p_min. So if the p_min value comes from colum p2, then "p2".
I prefer a data.table, since my actual data contains a lot more rows and columns. I know melting is an option, but I would like to preserve my memory with this data, so lesser memory used is better (production data contains several million rows and >200 columns).
So far I've found a way to create the t_min-column using the following:
t_cols = dt[ , .SD, .SDcols = grep( "t[1-3]", names( dt ), value = TRUE ) ]
dt[ !all( is.na( t_cols ) ),
t_min := do.call( pmin, c( .SD, list( na.rm = TRUE ) ) ),
.SDcols = names( t_cols ) ]
But I cannot wrap my head around creating the p_min and p_col_min columns. I suppose which.min() comes into play somewhere, but I cannot figure it out. Probably something simple I'm overlooking (it always seems to be.. ;-) ).
desired output
dt.desired <- data.table( p1 = c("a", "b", "c", "d", "e", "f", "g"),
p2 = c("b", "c", "d", "a", "f", "g", "h"),
p3 = c("z", "x", NA, NA, "y", NA, "s"),
t1 = c(1, 2, 3, NA, 5, 6, 7),
t2 = c(7, 6, 5, NA, 3, 2, NA),
t3 = c(8, 3, NA, NA, 2, NA, 1),
t_min = c(1,2,3,NA,2,2,1),
p_min = c("a","b","c",NA,"y","g","s"),
p_col_min = c("p1","p1","p1",NA,"p3","p2","p3") )
# p1 p2 p3 t1 t2 t3 t_min p_min p_col_min
# 1: a b z 1 7 8 1 a p1
# 2: b c x 2 6 3 2 b p1
# 3: c d <NA> 3 5 NA 3 c p1
# 4: d a <NA> NA NA NA NA <NA> <NA>
# 5: e f y 5 3 2 2 y p3
# 6: f g <NA> 6 2 NA 2 g p2
# 7: g h s 7 NA 1 1 s p3
I cannot guarantee whether this is a solution efficient enough for your working data, but this is what I would try first:
m1 <- as.matrix(dt[, grep('^t', names(dt)), with = FALSE])
m2 <- as.matrix(dt[, grep('^p', names(dt)), with = FALSE])
t_min <- apply(m1, 1, min, na.rm = TRUE)
t_min[is.infinite(t_min)] <- NA_real_
p_min_index <- rep(NA_integer_, length(t_min))
p_min_index[!is.na(t_min)] <- apply(m1[!is.na(t_min), ], 1, which.min)
dt[, t_min := t_min]
dt[, p_min := m2[cbind(seq_len(nrow(m2)), p_min_index)] ]
dt[, p_min_col := grep('^p', names(dt), value = TRUE)[p_min_index] ]
# p1 p2 p3 t1 t2 t3 t_min p_min p_min_col
# 1: a b z 1 7 8 1 a p1
# 2: b c x 2 6 3 2 b p1
# 3: c d <NA> 3 5 NA 3 c p1
# 4: d a <NA> NA NA NA NA <NA> <NA>
# 5: e f y 5 3 2 2 y p3
# 6: f g <NA> 6 2 NA 2 g p2
# 7: g h s 7 NA 1 1 s p3
In addition, It looks like that the 2nd row in your desired output is incorrect?
A simple and efficient approach is to loop through the "t*" columns and track all respective values in a single pass.
First initialize appropriate vectors:
p.columns = which(startsWith(names(dt), "p"))
t.columns = which(startsWith(names(dt), "t"))
p_col_min = integer(nrow(dt))
p_min = character(nrow(dt))
t_min = rep_len(Inf, nrow(dt))
and iterate while updating:
for(i in seq_along(p.columns)) {
cur.min = which(dt[[t.columns[i]]] < t_min)
p_col_min[cur.min] = p.columns[i]
t_min[cur.min] = dt[[t.columns[i]]][cur.min]
p_min[cur.min] = dt[[p.columns[i]]][cur.min]
}
Finally fill with NAs where needed:
whichNA = is.infinite(t_min)
is.na(t_min) = is.na(p_min) = is.na(p_col_min) = whichNA
t_min
#[1] 1 2 3 NA 2 2 1
p_min
#[1] "a" "b" "c" NA "y" "g" "s"
p_col_min
#[1] 1 1 1 NA 3 2 3
Here's another route:
dt[, t_min := do.call(pmin, c(.SD, na.rm = TRUE)), .SDcols = patterns('t[[:digit:]]')]
dt[!is.na(t_min),
c('p_min', 'p_min_col') := {
arr_ind = .SD[, which(t_min == .SD, arr.ind = TRUE), .SDcols = patterns('t[[:digit:]]')]
arr_ind = arr_ind[order(arr_ind[, 1]), ]
p_m = .SD[, as.matrix(.SD)[arr_ind], .SDcols = patterns('p')]
p_m_c = grep('^p', names(.SD), value = TRUE)[arr_ind[, 2]]
list(p_m, p_m_c)
}
]
Here is another option:
ri <- dt[, .I[rowSums(is.na(.SD))==ncol(.SD)], .SDcols=t1:t3]
dt[-ri, c("t_min", "p_min", "p_col_min") := {
pmat <- .SD[, .SD, .SDcols=p1:p3]
tmat <- as.matrix(.SD[, .SD, .SDcols=t1:t3])
i <- max.col(-replace(tmat, is.na(tmat), Inf), "first")
y <- cbind(seq_len(.N), i)
.(t_min = tmat[y],
p_min = as.matrix(pmat)[y],
p_col_min = names(pmat)[i])
}]
dt
output:
p1 p2 p3 t1 t2 t3 t_min p_min p_col_min
1: a b z 1 7 8 1 a p1
2: b c x 2 6 3 2 b p1
3: c d <NA> 3 5 NA 3 c p1
4: d a <NA> NA NA NA NA <NA> <NA>
5: e f y 5 3 2 2 y p3
6: f g <NA> 6 2 NA 2 g p2
7: g h s 7 NA 1 1 s p3

Generate unique dyad identifiers for unorder pairs

The dataframe I am working on is coded in dyadic format where each observation (i.e., row) contains a source node (from) and a target node (to) along with other some dyadic covariates (such as dyadic correlation, corr).
For simplicity sake, I want to treat each dyad as un-ordered and generate a unique identifier for each dyad like the one (i.e., df1) elow:
# original data
df <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5))
from to corr
1 A B 0.50
2 A C 0.70
3 A D 0.20
4 B C 0.15
5 C B 0.15
6 A B 0.50
7 D A 0.20
8 E A 0.45
9 F A 0.54
10 B A 0.50
# desired format
df1 <- data.frame(
from = c("A", "A", "A", "B", "C", "A", "D", "E", "F", "B"),
to = c("B", "C", "D", "C", "B", "B", "A", "A", "A", "A"),
corr = c(0.5, 0.7, 0.2, 0.15, 0.15, 0.5, 0.2, 0.45, 0.54, 0.5),
dyad = c(1, 2, 3, 4, 4, 1, 3, 5, 6, 1))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 3
4 B C 0.15 4
5 C B 0.15 4
6 A B 0.50 1
7 D A 0.20 3
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
where dyad A-B/B-A, A-D/D-A are treated as identical pairs and are assigned with the same dyad identifiers.
While it's easy to extract a list of un-ordered pairs from the original data, it's hard to map them onto the original dataframe to generate un-ordered dyad identifiers. Could anyone offer some insights on this?
One dplyr option could be:
df %>%
mutate(dyad = group_indices(., paste0(pmax(from, to), pmin(from, to))))
from to corr dyad
1 A B 0.50 1
2 A C 0.70 2
3 A D 0.20 4
4 B C 0.15 3
5 C B 0.15 3
6 A B 0.50 1
7 D A 0.20 4
8 E A 0.45 5
9 F A 0.54 6
10 B A 0.50 1
Or:
df %>%
mutate(dyad = dense_rank(paste0(pmax(from, to), pmin(from, to))))
However, if you need to assign the identifiers in a specific order (meaning that the identifiers hold some information on their own), then the solution from #Ronak Shah could be better for you.
One way using apply could be to sort and paste the value in two column, convert them to factor and then integer to get a unique number for each combination.
df$temp <- apply(df[1:2], 1, function(x) paste(sort(x), collapse = "_"))
df$dyad <- as.integer(factor(df$temp, levels = unique(df$temp)))
df$temp <- NULL
df
# from to corr dyad
#1 A B 0.50 1
#2 A C 0.70 2
#3 A D 0.20 3
#4 B C 0.15 4
#5 C B 0.15 4
#6 A B 0.50 1
#7 D A 0.20 3
#8 E A 0.45 5
#9 F A 0.54 6
#10 B A 0.50 1

Forward and backward difference between rows with missing values

Here is the sample dataframe:
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
How do I get forward and backward difference between rows over id category? There should be two new columns: one difference between between current raw and previous, and the other should be difference between current raw and next raw. If the previous raw is NA then it should calculate the difference between current row and the first previous raw that contains real number. The same holds for the other forward difference case.
Many thanks!!
require(magrittr)
df$backdiff <- c(NA, sapply(2:nrow(df),
function(i){
df$num[i] - df$num[(i-1):1] %>% .[!is.na(.)][1]
}))
df$forward.diff <- c(sapply(2:nrow(df) - 1,
function(i){
df$num[i] - df$num[(i+1):nrow(df)] %>% .[!is.na(.)][1]
}), NA)
One solution could be achieved by using fill function from tidyr to create two columns (one each for prev and next calculation) where NA values are removed.
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
library("tidyverse")
df %>% mutate(dup_num_prv = num, dup_num_nxt = num) %>%
group_by(id) %>%
fill(dup_num_prv, .direction = "down") %>%
fill(dup_num_nxt, .direction = "up") %>%
mutate(prev_diff = ifelse(is.na(num), NA, num - lag(dup_num_prv))) %>%
mutate(next_diff = ifelse(is.na(num), NA, num - lead(dup_num_nxt))) %>%
as.data.frame()
# Result is shown in columns 'prev_diff' and 'next_diff'
# id num dup_num_prv dup_num_nxt prev_diff next_diff
#1 A 1 1 1 NA -5
#2 A NA 1 6 NA NA
#3 A 6 6 6 5 3
#4 A 3 3 3 -3 NA
#5 B 7 7 7 NA 5
#6 B NA 7 2 NA NA
#7 B NA 7 2 NA NA
#8 B 2 2 2 -5 NA
Note: There are few queries which OP needs to clarify. The solution can be fine-tuned afterwards. dup_num_prv and dup_num_nxtare kept just for understanding purpose. These column can be removed.

filtering a dataframe in R based on how many elements in a Row are filled out

I have the following data frame (dput at end):
> d
a b d
1 1 NA NA
2 NA NA NA
3 2 2 2
4 3 3 NA
I want to filter the rows that have at least two items that are not NA. I wish to get the result -- how do I do that?:
> d
a b d
3 2 2 2
4 3 3 NA
> dput(d)
structure(list(a = c(1, NA, 2, 3), b = c(NA, NA, 2, 3), d = c(NA,
NA, 2, NA)), .Names = c("a", "b", "d"), row.names = c(NA, -4L
), class = "data.frame")
We can get the rowSums of the logical matrix (is.na(d)), use that to create a logical vector (..<2) to subset the rows.
d[rowSums(is.na(d))<2,]
# a b d
#3 2 2 2
#4 3 3 NA
Or as #DavidArenburg mentioned, it can be also done with Reduce
df[Reduce(`+`, lapply(df, is.na)) < 2, ]

Resources