How to calculate the combinations of pairs of columns in a data frame, but restrict it, so that it does not considers combinations among rows?
I have a data frame like the following, where each column is a variable.
ID A B C D E F G H I J
1 12 185 NA NA NA NA NA NA NA NA
2 35 20 11 NA NA NA NA NA NA NA
3 45 NA NA NA NA NA NA NA NA NA
I want an output like this:
Var1
12, 185
35, 20
35, 11
20, 11
45, 45
I tried the following code, but it considers ALL possible pairs of combinations among columns and rows. I want each row to be consider independently from each other. Does someone have an idea? Thanks.
numNetList <- read.csv2("abd.csv", sep=";")
comb <- lapply(numNetList, function(x) if (length(x) > 1)
combn(sort(as.numeric(x)), 2))
combb <- do.call(cbind, comb)
pajek_list <- as.data.frame(table(paste(combb[1,], combb[2,], sep = ',')))
not the efficient method, but solves the problem
func <- function(x){
t = as.character(x[!is.na(x)])
if (length(t)==1)
t = rep(t,2)
t1 = combn(t,2)
}
l = apply(df[-1], 1, func)
l1 <- as.data.frame(l)
colnames(l1) = NULL
l2= data.frame(t(l1))
library(tidyr)
unite(l2, "new_col", X1,X2 ,sep = ",")
# new_col
# 12,185
# 35,20
# 35,11
# 20,11
# 45,45
I would go with a combination of dplyr and tidyr:
library(dplyr)
library(tidyr)
df <- tibble(A = c(12,35,45), B = c(185, 20, NA), C = c(NA, 11, NA))
df %>%
mutate(group = 1:n()) %>%
gather(col, val, -group) %>%
group_by(group) %>%
expand(col, val) %>%
distinct(val) %>%
summarise(val = toString(val))
Related
We are looking to rename columns in a dataframe in R, however the columns may be missing and this throws an error:
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
my_df %>% dplyr::rename(aa = a, bb = b, cc = c)
Error: Can't rename columns that don't exist.
x Column `c` doesn't exist.
our desired output is this, which creates a new column with NA values if the original column does not exist:
> my_df
aa bb c
1 1 4 NA
2 2 5 NA
3 3 6 NA
A possible solution:
library(tidyverse)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
cols <- c(a = NA_real_, b = NA_real_, c = NA_real_)
my_df %>% add_column(!!!cols[!names(cols) %in% names(.)]) %>%
rename(aa = a, bb = b, cc = c)
#> aa bb cc
#> 1 1 4 NA
#> 2 2 5 NA
#> 3 3 6 NA
You can use a named vector with any_of() to rename that won't error on missing variables. I'm uncertain of a dplyr way to then create the missing vars but it's easy enough in base R.
library(dplyr)
cols <- c(aa = "a", bb = "b", cc = "c")
my_df %>%
rename(any_of(cols)) %>%
`[<-`(., , setdiff(names(cols), names(.)), NA)
aa bb cc
1 1 4 NA
2 2 5 NA
3 3 6 NA
Here is a solution using the data.table function setnames. I've added a second "missing" column "d" to demonstrate generality.
library(tidyverse)
library(data.table)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
curr <- names(my_df)
cols <- data.frame(new=c("aa","bb","cc","dd"), old = c("a", "b", "c","d")) %>%
mutate(exist = old %in% curr)
foo <- filter(cols, exist)
bar <- filter(cols, !exist)
setnames(my_df, new = foo$new)
my_df[, bar$old] <- NA
my_df
#> my_df
# aa bb c d
#1 1 4 NA NA
#2 2 5 NA NA
#3 3 6 NA NA
Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA
The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.
tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA
An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))
base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)
Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Let's consider some random data with NA's filled with.
df1=data.frame(sample(0:1,3,replace=T),sample(0:1,3,replace=T),sample(0:1,3,replace=T))
df2=data.frame(rnorm(3),runif(3),rexp(3))
df2[df1==1]<-NA
df2
rnorm.3. runif.3. rexp.3.
1 NA NA NA
2 0.6992316 NA 0.638913
3 0.6520083 0.1090714 NA
I want to replace those NA's with formula : 2*sd(x) + mean(x)
where sd is standard deviation. I want to do it of course with respect to proper columns so the NA in 1 row and 1 column should be replace by formula : 2*sd(0.6992316,0.6520083)+mean(0.6992316,0.6520083) and so on.
I tried to do it by the code : df2[df2==NA]<-2*apply(df2,2,sd,na.rm=T)+apply(df2,2,mean,na.rm=T) but nothing happened. Do you have idea how it can be done ?
I would probably write the (vectorized) function using ifelse then apply to all the columns using mutate(across(everything()))
library(dplyr)
f <- function(x) ifelse(!is.na(x), x,
2 * sd(x, na.rm = TRUE) + mean(x, na.rm = TRUE))
df2 %>%
mutate(across(everything(), f))
#> rnorm.3. runif.3. rexp.3.
#> 1 0.7424038 NA NA
#> 2 0.6992316 NA 0.638913
#> 3 0.6520083 0.1090714 NA
Note that in your example this doesn't do anything for the second two columns because they only have a single non-NA value. Calling sd on a single non-NA value produces NA.
If however, we do it with only one NA in each column (as we get by re-running your code after setting set.seed(1)), we can see this working:
set.seed(1)
df1 <- data.frame(sample(0:1, 3, replace = TRUE),
sample(0:1, 3, replace = TRUE),
sample(0:1, 3, replace = TRUE))
df2 <- data.frame(rnorm(3), runif(3), rexp(3))
df2[df1 == 1] <- NA
df2
#> rnorm.3. runif.3. rexp.3.
#> 1 -1.5399500 0.4976992 1.2132879
#> 2 NA NA 0.5548904
#> 3 -0.2947204 0.9919061 NA
df2 %>% mutate(across(everything(), f))
#> rnorm.3. runif.3. rexp.3.
#> 1 -1.5399500 0.4976992 1.2132879
#> 2 0.8436853 1.4437167 0.5548904
#> 3 -0.2947204 0.9919061 1.8152038
Does this work? The second column has NA still because there is only 1 non-NA value, standard deviation of a single value is NA, adding mean or any value to NA is also NA, hence it's not getting imputed.
library(dplyr)
library(tidyr)
df2 %>% mutate(across(everything(), ~ replace_na(., 2*sd(., na.rm = T) + mean(., na.rm = T))))
rnorm.3. runif.3. rexp.3.
1 -0.3030444 NA 0.07332792
2 -0.2226609 NA 1.76854904
3 -0.3909707 0.9099274 0.95892457
>
I am analyzing some variables for some specific locations. These variables have NA values that should be replaced by the values from adjacent columns. I found a way to do it, but it's not an efficient way if I have more columns. Can you help?
Dataset:
location <- rep(c("A", "B", "C"), times = 2)
v1 <- c(11,92,NA,NA,NA,NA)
v2 <- c(NA,NA,NA,50,NA,NA)
v3 <- c(NA,NA,66,NA,NA,79)
v4 <- c(NA,NA,NA,74,23,88)
df <- data.frame(location,v1,v2,v3,v4)
I tried this approach to create a new column in which NA values are replaced by values from other columns.
library (dplyr)
col_1 <- df %>% mutate(new_col = v1 %>% is.na %>% ifelse(v2,v1))
col_2 <- col_1 %>% mutate(new_col_1 = new_col %>% is.na %>% ifelse(v3,new_col))
col_3 <- col_2 %>% mutate(final_col = new_col_1 %>% is.na %>% ifelse(v4,new_col_1))
It solves the problem but I have two questions:
1. Is there any efficient way to do this, instead of creating three columns?
2. For some cases in v3 and v4 OR v2 and v4, where more than one value is available to replace NA, can I take the mean of those values to replace? How?
Thanks in advance.
We can use coalesce
library(dplyr)
library(purrr)
df %>%
mutate(new = coalesce(!!! rlang::syms(names(.)[-1])))
Or
df %>%
mutate(new = reduce(.[-1], coalesce))
# location v1 v2 v3 v4 new
#1 A 11 NA NA NA 11
#2 B 92 NA NA NA 92
#3 C NA NA 66 NA 66
#4 A NA 50 NA 74 50
#5 B NA NA NA 23 23
#6 C NA NA 79 88 79
I have a list of data frames, all of which are the same dimensions (64 obs, 12 variables). I need to "flatten" these data frames in such a way that I return with 64 x 11 = 704 variables and one observation, deriving all combinations of one column that has all unique values and the column names of the data frame. Examples are provided below.
I have attempted using acast and melt to achieve this. However, the supporting operations both pre and post melt make this approach slow when having to lapply this approach over 100k+ data frames.
Here is an example data frame and my taken approach:
df <- data.frame(var1=c(1,2,3),name=c("these","are","names"),var3=c(4,NA,NA),var4=c(NA,NA,5),var6=c(NA,5,NA))
flattening <- function(df){
rownames(df) <- df$name
df$name <- NULL
df <- melt(as.matrix(df)) %>% group_by(name = paste0(Var1,"_",Var2)) %>% summarise(
value = first(value)
) %>% data.frame()
cnames <- df$name
df <- data.frame(values=df$value) %>% t() %>% data.frame()
names(df) <- cnames
df
}
flattening(df)
The example df looks as such:
var1 name var3 var4 var6
1 1 these 4 NA NA
2 2 are NA NA 5
3 3 names NA 5 NA
I am looking for the expected outcome:
are_var1 are_var3 are_var4 are_var6 names_var1 names_var3 names_var4 names_var6 these_var1 these_var3 these_var4 these_var6
values 2 NA NA 5 3 NA 5 NA 1 4 NA NA
RESULTS UPDATE:
I have a microbenchmark below where expr is the user's handle:
Unit: milliseconds
expr min lq mean median uq max neval cld
old 78.370093 81.038799 90.272721 85.694885 89.304528 1114.03968 500 c
tmfmnk 11.829791 12.697675 13.844833 13.134485 13.623065 34.91430 500 b
s_t 1.476159 1.774409 2.030418 1.873876 2.003681 16.89159 500 a
One dplyr and tidyr option could be:
df %>%
gather(var, val, -2) %>%
mutate(var = paste(name, var, sep = "_")) %>%
select(-name) %>%
spread(var, val)
are_var1 are_var3 are_var4 are_var6 names_var1 names_var3 names_var4 names_var6
1 2 NA NA 5 3 NA 5 NA
these_var1 these_var3 these_var4 these_var6
1 1 4 NA NA
It should be faster than you original approach, however, there are certainly faster possibilities.
You can also use reshape2::melt() then use base R:
library(reshape2)
dats <- melt(df)
rownames(dats) <- paste0(dats$name,'-',dats$variable)
dats <- t(dats)
dats <- dats[-c(1,2),]
dats <- sapply(dats,as.numeric)
dats
these-var1 are-var1 names-var1 these-var3 are-var3 names-var3 these-var4 are-var4 names-var4 these-var6 are-var6
1 2 3 4 NA NA NA NA 5 NA 5
names-var6
NA
edit
Here as data.frame:
dats <- as.data.frame.matrix(t(as.data.frame.numeric(dats)))
Using dcast from data.table which can take multiple value.var columns
library(data.table)
out <- dcast(setDT(df)[, rn := 1], rn ~ name,
value.var = paste0("var", c(1, 3, 4, 6)))[, rn := NULL][]
setnames(out, sub("([^_]+)_([^_]+)", "\\2_\\1", names(out)))
out
# are_var1 names_var1 these_var1 are_var3 names_var3 these_var3 are_var4 names_var4 these_var4 are_var6 names_var6 these_var6
#1: 2 3 1 NA NA 4 NA 5 NA 5 NA NA