Related
given the following reproducible example
my objective is to row-wise substitute the original values with NA in adjacent columns of a data frame; I know it's a problem (with so many variants) already posted but I've not yet found the solution with the approach I'm trying to accomplish: i.e. by applying a function composition
in the reproducible example the column driving the substitution with NA of the original values is column a
this is what I've done so far
the very last code snippet is a failing attempt of what I'm actually searching for...
#-----------------------------------------------------------
# ifelse approach, it works but...
# it's error prone: i.e. copy and paste for all columns can introduce a lot of troubles
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b<-ifelse(is.na(df$a), NA, df$b)
df$c<-ifelse(is.na(df$a), NA, df$c)
df
#--------------------------------------------------------
# extraction and subsitution approach
# same as above
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b[is.na(df$a)]<-NA
df$c[is.na(df$a)]<-NA
df
#----------------------------------------------------------
# definition of a function
# it's a bit better, but still error prone because of the copy and paste
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix<-function(x,y){
ifelse(is.na(x), NA, y)
}
df$b<-fix(df$a, df$b)
df$c<-fix(df$a, df$c)
df
#------------------------------------------------------------
# this approach is not working as expected!
# the idea behind is of function composition;
# lapply does the fix to some columns of data frame
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix2<-function(x){
x[is.na(x[1])]<-NA
x
}
df[]<-lapply(df, fix2)
df
any help for this particular approach?
I'm stuck on how to properly conceive the substitute function passed to lapply
thanx
Using lexical closure
If you use lexical closureing - you define a function which generates first the function you need.
And then you can use this function as you wish.
# given a column all other columns' values at that row should become NA
# if the driver column's value at that row is NA
# using lexical scoping of R function definitions, one can reach that.
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# whatever vector given, this vector's value should be changed
# according to first column's value
na_accustomizer <- function(df, driver_col) {
## Returns a function which will accustomize any vector/column
## to driver column's NAs
function(vec) {
vec[is.na(df[, driver_col])] <- NA
vec
}
}
df[] <- lapply(df, na_accustomizer(df, "a"))
df
## a b c
## 1 1 3 NA
## 2 2 NA 5
## 3 NA NA NA
#
# na_accustomizer(df, "a") returns
#
# function(vec) {
# vec[is.na(df[, "a"])] <- NA
# vec
# }
#
# which then can be used like you want:
# df[] <- lapply(df, na_accustomize(df, "a"))
Using normal functions
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# define it for one column
overtake_NA <- function(df, driver_col, target_col) {
df[, target_col] <- ifelse(is.na(df[, driver_col]), NA, df[, target_col])
df
}
# define it for all columns of df
overtake_driver_col_NAs <- function(df, driver_col) {
for (i in 1:ncol(df)) {
df <- overtake_NA(df, driver_col, i)
}
df
}
overtake_driver_col_NAs(df, "a")
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Generalize for any predicate function
driver_col_to_other_cols <- function(df, driver_col, pred) {
## overtake any value of the driver column to the other columns of df,
## whenever predicate function (pred) is fulfilled.
# define it for one column
overtake_ <- function(df, driver_col, target_col, pred) {
selectors <- do.call(pred, list(df[, driver_col]))
if (deparse(substitute(pred)) != "is.na") {
# this is to 'recorrect' NA's which intrude into the selector vector
# then driver_col has NAs. For sure "is.na" is not the only possible
# way to check for NA - so this edge case is not covered fully
selectors[is.na(selectors)] <- FALSE
}
df[, target_col] <- ifelse(selectors, df[, driver_col], df[, target_col])
df
}
for (i in 1:ncol(df)) {
df <- overtake_(df, driver_col, i, pred)
}
df
}
driver_col_to_other_cols(df, "a", function(x) x == 1)
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA 4 6
## if the "is.na" check is not done, then this would give
## (because of NA in selectorvector):
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA NA NA
## hence in the case that pred doesn't check for NA in 'a',
## these NA vlaues have to be reverted to the original columns' value.
driver_col_to_other_cols(df, "a", is.na)
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Try this function, in input you have your original dataset and in output the cleaned one:
Input
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
> df
a b c
1 1 3 NA
2 2 NA 5
3 NA 4 6
Function
fix<-function(df,var_x,list_y)
{
df[is.na(df[,var_x]),list_y]<-NA
return(df)
}
Output
fix(df,"a",c("b","c"))
a b c
1 1 3 NA
2 2 NA 5
3 NA NA NA
I created a dataset by joining multiple tables to a primary table containing department_id and year. The final data frame had a lot of missing values which I then imputed with 'MISSING' for categorical and with '0'(zero) for continuous variables.
I now want to remove the subset of rows that are populated with either 'MISSING' or '0' (i.e. have no other values), how can I do this in R?
Thanks
I would strongly suggest leaving your NAs the way they are if you can. R has built-in generic functions for dealing with NAs across classes that can make your life much easier. If your missings are indicated by different values for each data type then you'll need to add a comparison for each type of missing, which isn't very efficient.
It's also worth mentioning that the options below are generalizable, i.e. they will work on data frames with any number of columns, so you don't need to add a comparison for each new column.
First, generate some data to test with:
df <- data.frame(num = c(1, 0, 3, 4, 0, 5),
cat = c("a", "b", "c", "d", "MISSING", "MISSING")
)
#### OUTPUT ####
num cat
1 1 a
2 0 b # <- keep
3 3 c
4 4 d
5 0 MISSING # <- drop
6 5 MISSING # <- keep
You can filter using base R or dplyr (among other options):
# Base R option
df[rowSums(df == "MISSING" | df == 0) < ncol(df),]
# Tidyverse option using dplyr
library(dplyr)
filter_all(df, any_vars(!(. == "MISSING" | . == 0)))
The output for both options will look like this:
num cat
1 1 a
2 0 b # <- kept
3 3 c
4 4 d
5 5 MISSING # <- kept
Just for the sake of argument, here's how you can simplify things by leaving NAs as they are. First some new data:
df_na <- data.frame(num = c(1, NA, 3, 4, NA, 5),
cat = c("a", "b", "c", "d", NA, NA)
)
#### OUTPUT ####
num cat
1 1 a
2 NA b # <- keep
3 3 c
4 4 d
5 NA <NA> # <- drop
6 5 <NA> # <- keep
Now we can use the same strategies as above, but we only need to use is.na() rather than adding a comparison for each type of missing value:
# Using base R
df_na[rowSums(is.na(df_na)) < ncol(df_na),]
# Using dplyr
library(dplyr)
filter_all(df_na, any_vars(!is.na(.)))
#### OUTPUT ####
num cat
1 1 a
2 NA b # <- kept
3 3 c
4 4 d
6 5 <NA> # <- kept
You are right that Ott's solution doesn't do what they say it does. Here's his solution implemented correctly, in base R and in dplyr. Note that you will have to duplicate each != 0 clause for each of your columns.
# create some dummy data
data <- data.frame(
numeric = c(1, 2, 3, 0, 0, 0, 4, 5, 6),
categorical = c("MISSING", "A", "B", "MISSING", "C", "MISSING", "D", "MISSING", "E")
)
# base R solution
data[data$numeric != 0 | data$categorical != "MISSING", ]
# dplyr solution
filter(data, numeric != 0 | categorical != "MISSING")
Problem 1 (solved)
How can I sort vector DoB:
DoB <- c(NA, 9, NA, 2, 1, NA)
while keeping the NAs in the same position?
I would like to get:
> DoB
[1] NA 1 NA 2 9 NA
I have tried this (borrowing from this answer)
NAs_index <- which(is.na(DoB))
DoB <- sort(DoB, na.last = NA)
for(i in 0:(length(NAs_index)-1))
DoB <- append(DoB, NA, after=(NAs_index[i+1]+i))
but
> DoB
[1] 1 NA 2 9 NA NA
Answer is
DoB[!is.na(DoB)] <- sort(DoB)
Thanks to #BigDataScientist and #akrun
Now, Problem 2
Say, I have a vector id
id <- 1:6
That I would also like to sort by the same principle, so that the values of id are ordered according to order(DoB), but keeping the NAs fixed in the same position?:
> id
[1] 1 5 3 4 2 6
You could do:
DoB[!is.na(DoB)] <- sort(DoB)
Edit: Concerning the follow up question in the comments:
You can use order() for that and take care of the NAs with the na.last parameter,..
data <- data.frame(DoB = c(NA, 9, NA, 2, 1, NA), id = 1:6)
data$id[!is.na(data$DoB)] <- order(data$DoB, na.last = NA)
data$DoB[!is.na(data$DoB)] <- sort(data$DoB)
We create a logical index and then do the sort
i1 <- is.na(DoB)
DoB[!i1] <- sort(DoB[!i1])
DoB
#[1] NA 1 NA 2 9 NA
I have three tables that I'm attempting to merge into one.
The main table is similar to:
Table1 <- data.frame("Data" = c(1, 2, 3, 4, 5), "Desc" = c("A", "A", "A", "B", "B"))
TableA <- data.frame("Values" = c(6, 2, 3))
TableB <- data.frame("Values" = c(2, 7))
I want to add another column to Table1 with the values from TableA and TableB, but Values coming from TableA must be placed in a row containing "A" in the "Desc" column and TableB values in rows containing "B" in the "Desc" column. The number of rows in Table A equal the number of rows Table1 with "A" and same for TableB.
The resulting Table should look like:
Table1 <- data.frame("Data" = c(1, 2, 3, 4, 5), "Desc" = c("A", "A", "A", "B", "B"), "Values" = c(6, 2, 3, 2, 7))
> Table1
Data Desc Values
1 1 A 6
2 2 A 2
3 3 A 3
4 4 B 2
5 5 B 7
First note that these are "data.frames", not "tables". A "table" is actually a different class in R and they aren't the same thing. This strategy should work
Table1$Values <- NA
Table1$Values[Table1$Desc=="A"] <- TableA$Value
Table1$Values[Table1$Desc=="B"] <- TableB$Value
Table1
# Data Desc Values
# 1 1 A 6
# 2 2 A 2
# 3 3 A 3
# 4 4 B 2
# 5 5 B 7
If you have multiple Table (TableA, TableB, TableC,...etc) and if you need to match the suffix of Table. to Table1 column Desc
ls1 <- ls(pattern="Table")
ls1
#[1] "Table1" "TableA" "TableB"
library(stringr)
indx <- str_extract(ls1[-1], perl('(?<=Table)[A-Z]'))
lst1 <- mget(ls1[-1])
do.call(rbind,
lapply(seq_along(lst1),function(i) {
x1 <- lst1[[i]]
x2 <- Table1[!is.na(match(Table1$Desc, indx[i])),]
x2$Values <- x1$Values
x2}
))
# Data Desc Values
#1 1 A 6
#2 2 A 2
#3 3 A 3
#4 4 B 2
#5 5 B 7
In the first step, after I created the objects (Table.), looked for the object names ls(pattern="Table")
Extracted the suffix LETTERS A, B from the objects that needs to be matched. Used regex lookbehind i.e. (?<=Table)[A-Z] matches a substring (uppercase letter) preceded by the string Table and extract the substring.
mget returns the value of the objects as a list
Loop using lapply. Match the Desc column in Table1 with the extracted suffix and created a new column
I have a data frame that I want to find the row numbers where these rows are in common with another data frame.
To make the question clear, say I have data frame A and data frame B:
dfA <- data.frame(NAME = rep(c("a", "b"), each = 3),
TRIAL = rep(1:3, 2),
DATA = runif(6))
dfB <- data.frame(NAME = c("a", "b"),
TRIAL = c(2, 3))
dfA
# NAME TRIAL DATA
# 1 a 1 0.62948592
# 2 a 2 0.88041819
# 3 a 3 0.02479411
# 4 b 1 0.48031827
# 5 b 2 0.86591315
# 6 b 3 0.93448264
dfB
# NAME TRIAL
# 1 a 2
# 2 b 3
I want to get dfA's row number where dfA and dfB have the same NAME and TRIAL, in this case, row numbers are 2 and 6.
I tried the following code, gives me row 2, 3, 5, 6. It separately matches NAME and TRIAL, doesn't work.
which(dfA$NAME %in% dfB$NAME & dfA$TRIAL %in% dfB$TRIAL)
# 2 3 5 6
Then I tried to create a dummy column and match this col. Works, but the code would be verbose if dfB has many columns...
dfA$dummy <- paste0(dfA$NAME, dfA$TRIAL)
dfB$dummy <- paste0(dfB$NAME, dfB$TRIAL)
which(dfA$dummy %in% dfB$dummy)
# 2 6
I'm wondering if there are better ways to solve the problem, thanks for your help!
You can do:
merge(transform(dfA, row.num = 1:nrow(dfA)), dfB)$row.num
# [1] 2 6
And if the whole goal of finding the indices is so that you can subset dfA, then you can just do merge(dfA, dfB).
Or use duplicated:
apply(dfB, 1, function(x)
which(duplicated(rbind(x, dfA[1:2])))-1)
# [1] 2 6