Conditional replacement of NAs in two dataframes R - r

Probably simple but tricky question especially for larger data sets. Given two dataframes (df1,df2) of equal dimensions as below:
head(df1)
a b c
1 0.8569720 0.45839112 NA
2 0.7789126 0.36591578 NA
3 0.6901663 0.88095485 NA
4 0.7705756 0.54775807 NA
5 0.1743111 0.89087819 NA
6 0.5812786 0.04361905 NA
and
head(df2)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 1
3 0.08982958 0.4453491 2
4 0.75196925 0.6745908 3
5 0.73216793 0.6418483 4
6 0.73640209 0.7448011 5
How can one find all columns where if(all(is.na(df1)), in this case c, go to df2and set all values in matching column (c) to NAs.
Desired output
head(df3)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 NA
3 0.08982958 0.4453491 NA
4 0.75196925 0.6745908 NA
5 0.73216793 0.6418483 NA
6 0.73640209 0.7448011 NA
My actual dataframes have more than 140000 columns.

We can use colSums on the negated logical matrix (is.na(df1)), negate (!) thevector` so that 0 non-NA elements becomes TRUE and all others FALSE, use this to subset the columns of 'df2' and assign it to NA.
df2[!colSums(!is.na(df1))] <- NA
df2
# a b c
#1 0.21210312 0.7670091 NA
#2 0.19767464 0.3050934 NA
#3 0.08982958 0.4453491 NA
#4 0.75196925 0.6745908 NA
#5 0.73216793 0.6418483 NA
#6 0.73640209 0.7448011 NA
Or another option is to loop over the columns and check whether all the elements are NA to create a logical vector for subsetting the columns of 'df2' and assigning it to NA
df2[sapply(df1, function(x) all(is.na(x)))] <- NA
If these are big datasets, another option would be set from data.table (should be more efficient as this does the assignment in place)
library(data.table)
setDT(df2)
j1 <- which(sapply(df1, function(x) all(is.na(x))))
for(j in j1){
set(df2, i = NULL, j = j, value = NA)
}

Related

Comparing two columns in R based on matching values in a common column

So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R?
If it were a single column, I could have used the sqldf function as
sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")
But there are 900 common columns between both the datasets.
You can do this by simply joining the frames on identifier, and then subtracting the one frame from the other.
Here is an example of the approach using data.table
# load the library
library(data.table)
# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]
# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
# column-bind the identifier column with a substraction of the two sub-frames
cbind(
merged[,.(identifier)],
setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)
Here is the same approach using dplyr:
library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
select(merged,identifier),
(select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>%
rename_with(~paste0("diff_",cols), everything())
)
Output (same under either approach):
identifier diff_v1 diff_v2 diff_v3
<char> <num> <num> <num>
1: O 0.5028498 0.7573174 -1.00630610
2: S -2.5631238 -0.7041228 1.33877932
3: N NA NA NA
4: C NA NA NA
5: J NA NA NA
6: R NA NA NA
7: K NA NA NA
8: E NA NA NA
9: X -0.1830764 0.2924459 -0.01860763
10: Y NA NA NA
11: W NA NA NA
12: T -0.4912840 -2.8126285 -1.33661044
13: I NA NA NA
14: L NA NA NA
15: U NA NA NA
16: M -0.3130889 1.1590316 -0.44551660
17: P NA NA NA
18: H NA NA NA
19: B NA NA NA
20: G -2.2817049 2.4156583 -0.34393988
21: Z NA NA NA
22: A -0.1654816 -0.8807393 -1.08534789
23: F NA NA NA
24: V NA NA NA
25: D 1.4653655 0.2604109 -0.17733840
26: Q NA NA NA
identifier diff_v1 diff_v2 diff_v3
Input:
set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))
Using S1 and S2 as an example (see Note at end) we find the common column names using intersect. In the example data in the Note at the end there is only one numeric common column that is not ID but the same code should work even if there are many.
Then using Filter extract the names of the numeric common columns. We have assumed that if a common column in S1 is numeric then it is also numeric in S2 so we only have to check S1. The Filter line could be omitted if we knew that all common columns were numeric.
Next ensure that the ID column is excluded using setdiff. If ID is non numeric the Filter line would have already removed it in which case we could omit the setdiff line.
Now construct the select clause. sprintf creates a character vector of the elements of the select clause and toString collapses it to a comma separated string giving the final select string. Finally run the SQL statement. Note that fn$sqldf turns on string interpolation in the SQL statement and $sel inserts the contents of the sel variable into the SQL string.
library (sqldf)
nms <- intersect(names(S1), names(S2))
nms <- names(Filter(is.numeric, S1[nms]))
nms <- setdiff(nms, "ID")
sel <- toString(sprintf("a.[%s] - b.[%s] as [%s]", nms, nms, nms))
fn$sqldf("select ID, $sel  
from S1 a  
left join S2 b using(ID)")
## ID extra
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 0
## 5 5 0
## 6 6 0
Pipe
The nms<- lines above could alternately be written in terms of pipes:
nms <- names(S1) |>
intersect(names(S2)) |>
subset(S1, select = _) |> 
Filter(f = is.numeric) |>
names() |>
setdiff("ID")
Note
The data frame sleep comes with R. S1 and S2 are used as an example.
S1 <- head(sleep)
S2 <- S1[-2]
S1
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
S2
## extra ID
## 1 0.7 1
## 2 -1.6 2
## 3 -0.2 3
## 4 -1.2 4
## 5 -0.1 5
## 6 3.4 6

How to select n random values from each rows of a dataframe in R?

I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA

How to access variables of datatable when passed as a parameter to a function?

I'm new to R and I have searched over google for solution to the below problem.
I have
DT = data.table(y=c("a",NA,NA), y_1=c(NA,3,6), y_2=c(1,NA,3), y_3=c(1,1,1)).
I want to create a function passing the datatable and the column that needs to be changed.
fun <- function(dt, var)
{
dt[,(var) := ifelse(!(is.na(get(var))), get(paste0(var,"1")),
ifelse(!(is.na(get(paste0(var,"1")), get(paste0(var,"2")...))]
return(dt)
}
I want to replace the values in y variable which are NA's with the values in y_1 if they are not NULL or else replace with y_2 and so on. Like this I want to create a function which can accept different variables but with the same ending.
Update: Uwe, Thanks for pointing that previous question. I found it pretty useful. But, my requirement is slightly different. I need the same update for other variables as well where the values are NA. For example, I need to do it for (x,x_1,x_2,x_3...),(z,z_1,z_2,z_3..) and some other variables apart from y. Is there a way use lapply or function to do that.
Thanks in advance.
The OP is looking for a variant of the locf method (last observation carried forward) which is implemented as zoo::na.locf() for instance. While na.locf() usually is applied on a vector or a column of a data.frame, the OP is looking for a variant which is applied on each row of a data.table but restricted to a specific subset of columns. So, the function is being named na.locl() (last observation carried left).
In addition, the data.table is to be updated in place, e.g., without copying. The columns are named in a specific manner, e.g., x, x_1, x_2, x_3, etc. So, x is kind of a base name for the subset of columns.
The function below will look in each row of a specific subset of columns of a given data.table for the first non-NA column and copies this value to column x.
The implementation is based on this solution. It includes some plausibilty checks.
na.locl <- function(var, dt) {
checkmate::assert_data_table(dt)
checkmate::assert_string(var)
checkmate::assert_choice(var, names(dt))
ans_val = rep_len(NA_real_, nrow(dt))
selected_cols <- unlist(lapply(
var, function(x) stringr::str_subset(names(dt), paste0("^", x, "(_\\d*)?$"))))
for(col in selected_cols) {
i = is.na(ans_val) & (!is.na(dt[[col]]))
ans_val[i] = dt[[col]][i]
}
set(DT, , var, ans_val)
return(invisible(NULL))
}
In addition, the OP has requested to repeat this for other variables. This can be accomplished using lapply() with the na.locl() function. To demonstrate this, sample data are required.
library(data.table)
DT0 <- data.table(y=c("a",NA,NA,NA), y_1=c(NA,3,NA,NA), y_2=c(1,NA,3,NA), y_3=c(1,1,1,NA))
DT <- cbind(DT0, setnames(copy(DT0), stringr::str_replace(names(DT0), "^y", "x")))
DT <- cbind(DT, setnames(copy(DT0), stringr::str_replace(names(DT0), "^y", "zzz")))
DT
# y y_1 y_2 y_3 x x_1 x_2 x_3 zzz zzz_1 zzz_2 zzz_3
#1: a NA 1 1 a NA 1 1 a NA 1 1
#2: NA 3 NA 1 NA 3 NA 1 NA 3 NA 1
#3: NA NA 3 1 NA NA 3 1 NA NA 3 1
#4: NA NA NA NA NA NA NA NA NA NA NA NA
y, x, and zzz are NAexcept for row 1. After applying the function on DT,
dummy <- lapply(c("x", "y", "zzz"), na.locl, dt = DT)
DT
# y y_1 y_2 y_3 x x_1 x_2 x_3 zzz zzz_1 zzz_2 zzz_3
#1: a NA 1 1 a NA 1 1 a NA 1 1
#2: 3 3 NA 1 3 3 NA 1 3 3 NA 1
#3: 3 NA 3 1 3 NA 3 1 3 NA 3 1
#4: NA NA NA NA NA NA NA NA NA NA NA NA
the missing values in columns y, x, and zzz have been replaced by the next non-NA value to the right if available within the subset of columns. Thus, row 4 is all NA as no non-NA (that's three negations in a row) is available in each of the column subsets.

combine into a data frame vectors of different length but preserving the order

I have a character vector, of the following form:
vv<-c(1,2,"c", "2%", 3, "b", "4%") # original vector
From this vector I created some new vectors according to the following lines of code:
# original vector
vv<-c(1,2,"c", "2%", 3, "b", "4%")
# vector without the characters , i.e the c and b
vv1<-vv[-grep("[a-zA-Z]", vv)]
# Steps to create the vector of "integers"
strip_percents <- as.numeric(gsub("%", "", vv1))
no_percents <- as.numeric(vv1[-grep("%", vv1)])
# Vector that collects the strings
strings_vv1 <- vv[grep("[a-zA-Z]", vv)]
# Vector the collects the percentage numbers
perce_vv1 <-vv1[grep("%", vv1)]
perce_vv1 <- as.numeric(gsub("%", "", perce_vv1))/100
My purpose is to combine all these vectors and create a dataframe, but following the structure/order of the original vector. In other words, I want to fill the dataframe with NAs in "right place". So, for example, my data frame I want to look like this:
df<-data.frame(original=vv, numerics=c(1,2,NA,0.02,3,NA,0.04), integers=c(1,2,NA,NA,3,NA,NA), characters=c(NA,NA,"c",NA,NA,"b",NA))
original numerics integers characters
1 1 1.00 1 <NA>
2 2 2.00 2 <NA>
3 c NA NA c
4 2% 0.02 NA <NA>
5 3 3.00 3 <NA>
6 b NA NA b
7 4% 0.04 NA <NA>
Can someone help me with this task ?
We can create a numeric index of elements with % ('i1'), then replace those having % with /100, evaluate the string and assign the output back. Applying as.numeric on the changed vector ('vv1') will result in NA for all non-numeric, similarly we can do as.integer on the original vector ('vv') and get all non-numeric to NA. The elements with letters can be identified with grepl and use ifelse to convert all other elements to NA.
vv1 <- vv
i1 <- grep("%", vv)
library(gsubfn)
vv1[i1] <- sapply(gsubfn(".", list(`%`="/100"), vv[i1]), function(x) eval(parse(text=x)))
vv1 <- as.numeric(vv1)
vv2 <- as.integer(vv)
vv3 <- ifelse(grepl("^[A-Za-z]+$", vv), vv, NA)
data.frame(original=vv, numerics=vv1, integer=vv2, characters=vv3)
# original numerics integer characters
#1 1 1.00 1 <NA>
#2 2 2.00 2 <NA>
#3 c NA NA c
#4 2% 0.02 NA <NA>
#5 3 3.00 3 <NA>
#6 b NA NA b
#7 4% 0.04 NA <NA>

merge multiple data.frame by row in R

I would like to merge multiple data.frame in R using row.names, doing a full outer join. For this I was hoping to do the following:
x = as.data.frame(t(data.frame(a=10, b=13, c=14)))
y = as.data.frame(t(data.frame(a=1, b=2)))
z = as.data.frame(t(data.frame(a=3, b=4, c=3, d=11)))
res = Reduce(function(a,b) merge(a,b,by="row.names",all=T), list(x,y,z))
Warning message:
In merge.data.frame(a, b, by = "row.names", all = T) :
column name ‘Row.names’ is duplicated in the result
> res
Row.names Row.names V1.x V1.y V1
1 1 a 10 1 NA
2 2 b 13 2 NA
3 3 c 14 NA NA
4 a <NA> NA NA 3
5 b <NA> NA NA 4
6 c <NA> NA NA 3
7 d <NA> NA NA 11
What I was hoping to get would be:
V1 V2 V3
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
The following works (up to some final column renaming):
res <- Reduce(function(a,b){
ans <- merge(a,b,by="row.names",all=T)
row.names(ans) <- ans[,"Row.names"]
ans[,!names(ans) %in% "Row.names"]
}, list(x,y,z))
Indeed:
> res
V1.x V1.y V1
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
What happens with a row join is that a column with the original rownames is added in the answer, which in turn does not contain row names:
> merge(x,y,by="row.names",all=T)
Row.names V1.x V1.y
1 a 10 1
2 b 13 2
3 c 14 NA
This behavior is documented in ?merge (under Value)
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
When Reduce tries to merge again, it doesn't find any match unless the names are cleaned up manually.
For continuity, this is not a clean solution but a workaround, I transform the list argument of 'Reduce' using sapply.
Reduce(function(a,b) merge(a,b,by=0,all=T),
sapply(list(x,y,z),rbind))[,-c(1,2)]
x y.x y.y
1 10 1 3
2 13 2 4
3 14 NA 3
4 NA NA 11
Warning message:
In merge.data.frame(a, b, by = 0, all = T) :
column name ‘Row.names’ is duplicated in the result
For some reason I did not have much success with Reduce. given a list of data.frames (df.lst) and a list of suffixes (suff.lst) to change the names of identical columns, this is my solution (it's loop, I know it's ugly for R standards, but it works):
df.merg <- as.data.frame(df.lst[1])
colnames(df.merg)[-1] <- paste(colnames(df.merg)[-1],suff.lst[[1]],sep="")
for (i in 2:length(df.lst)) {
df.i <- as.data.frame(df.lst[i])
colnames(df.i)[-1] <- paste(colnames(df.i)[-1],suff.lst[[i]],sep="")
df.merg <- merge(df.merg, df.i, by.x="",by.y="", all=T)
}

Resources