I have a Data Frame with a variable with different values for another variable.
Like this:
DataFrame
So, I need a subset when the value of S contain all the possible values of B. In this example, el subset is conformed by S = a and S = b:
Subset
Any idea? Thanks!!
An option would be to group by 'S' and filter the rows having all the unique values of the column 'B' %in% 'B'
library(dplyr)
un1 <- unique(df1$B)
df1 %>%
group_by(S) %>%
filter(all(un1 %in% B))
# A tibble: 8 x 2
# Groups: S [2]
# S B
# <fct> <dbl>
#1 a 1
#2 a 2
#3 a 3
#4 a 4
#5 d 1
#6 d 2
#7 d 3
#8 d 4
Or with data.table
library(data.table)
setDT(df1)[, .SD[all(un1 %in% B)], S]
Or using base R
df1[with(df1, ave(B, S, FUN = function(x) all(un1 %in% x)) == 1),]
data
df1 <- data.frame(S = rep(letters[1:4], c(4, 3, 2, 4)),
B = c(1:4, c(1, 3, 4), 1:2, 1:4))
Related
I have
a <- c(rep("A", 3), rep("B", 3), rep("C",2), rep("D", 1))
b <- c(1,1,2,4,1,1,2,2,5)
df <-data.frame(a,b)
Based on df$a, i would like to return only the values that do not have a duplicate (those rows that have a single occurence of df$a), in this example it would be 1 D 5
I have tried duplicate(), !duplicate() and unique() but none outputs what I need.
Cleanest way with dplyr:
library(dplyr)
df %>% group_by(a) %>%
filter(n() == 1)
Output:
# A tibble: 1 x 2
# Groups: a [1]
a b
<chr> <dbl>
1 D 5
One option
df[!(df$a %in% df$a[duplicated(df$a)]),]
a b
9 D 5
Using data.table
library(data.table)
setDT(df)
df[, tmp:= .N, by = a][tmp == 1, -"tmp"]
a b
1: D 5
With Base R ,
x <- table(df[,1])
df[rep(x<2,x),]
gives,
# a b
# 9 D 5
I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2
This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))
I have a table with ID and other columns. I want to group the data by Ids and get the unique values of all columns.
from above table group by ID and get unique(Alt1, Alt2, Alt3)
Resul should be in vector form
A -> 1,2,3,5
B ->1,3,4,5,7
We can get data in long format and for each ID make a list of unique values.
library(dplyr)
library(tidyr)
df1 <- df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(value = list(unique(value))) %>%
unnest(value)
df1
# ID value
# <fct> <dbl>
# 1 A 1
# 2 A 3
# 3 A 2
# 4 A 5
# 5 B 1
# 6 B 4
# 7 B 5
# 8 B 3
# 9 B 6
#10 B 7
We can store it as a list if needed using split.
split(df1$value, df1$ID)
#$A
#[1] 1 3 2 5
#$B
#[1] 1 4 5 3 6 7
data.table equivalent of the above would be :
library(Data.table)
setDT(df)
df2 <- melt(df, id.vars = 'ID')[, .(value = list(unique(value))), ID]
unique values are present in df2$value as a vector.
data
df <- data.frame(ID = c('A', 'A', 'B', 'B'),
Alt1 = c(1, 2, 1, 3),
Alt2 = c(3, 5, 4, 6),
Alt3 = c(1, 3, 5, 7))
I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}
We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b