Calculating on grouped rows without loops [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a tidy data set which describes attributes of products. Each product have many attributes, and each attribute is described in each row. My goal is to do some calculations on each product, without using loops. The reason for not wanting to use loops is that there are several hundreds of thousands of products, and thus many million attributes.
Toy dataset with only one product:
df <- data.frame(productID = 1, attributeID = seq(1,15,1), dataType = c('range', 'range', 'predefined', 'predefined', 'bool', 'bool', 'bool', 'bool', 'double', 'double', 'double', 'double', 'double', 'double', 'double'), double = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0,15,11.4,6,0,0), logical = c(NA,NA,NA,NA,TRUE,FALSE,FALSE,FALSE,NA,NA,NA,NA,NA,NA,NA), predefined = c(NA,NA,'Black','Round',NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), from.value = c(0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), to.value = c(249,368,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
# productID attributeID dataType double logical predefined from.value to.value
# 1 1 1 range NA NA <NA> 0 249
# 2 1 2 range NA NA <NA> 0 368
# 3 1 3 predefined NA NA Black NA NA
# 4 1 4 predefined NA NA Round NA NA
# 5 1 5 bool NA TRUE <NA> NA NA
# 6 1 6 bool NA FALSE <NA> NA NA
# 7 1 7 bool NA FALSE <NA> NA NA
# 8 1 8 bool NA FALSE <NA> NA NA
# 9 1 9 double 0.0 NA <NA> NA NA
# 10 1 10 double 0.0 NA <NA> NA NA
# 11 1 11 double 15.0 NA <NA> NA NA
# 12 1 12 double 11.4 NA <NA> NA NA
# 13 1 13 double 6.0 NA <NA> NA NA
# 14 1 14 double 0.0 NA <NA> NA NA
# 15 1 15 double 0.0 NA <NA> NA NA
For example, how would one go about counting the zeros for each product in the double column?

Since you're only after counting the number of zeros in the double column, the following should help:
library(tidyverse)
df %>%
group_by(productID) %>%
summarise(sum.of.zeros=sum(double==0, na.rm = T))
The above sums the instances where double equals zero (if it equals zero, it would counted as 1 (TRUE) and if not it would be 0 (FALSE). The na.rm = T is required because the expression NA==0 would return an NA.

Take a look at the tidyverse packages, and dplyr in particular
library(tidyverse)
df %>% group_by( productID, from.value ) %>% summarise( amount = n_distinct( attributeID ))
# # A tibble: 2 x 3
# # Groups: productID [?]
# productID from.value amount
# <dbl> <dbl> <int>
# 1 1 0 2
# 2 1 NA 13

With data.table you can do:
library("data.table")
setDT(df)[, sum(na.omit(double)==0), productID]
or
setDT(df)[, sum(double==0, na.rm=TRUE), productID]

Related

Making the rows of a data frame to NAs using R

I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you
Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA
A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA

Summing columns with string match on dates

I have a dataframe df with an ID variable and daily dates (format XYYYYMMDD) as column headers:
ID <- c(101,102,203,207,209)
X20170101 <- c(1,NA,NA,2,1)
X20170102 <- c(NA,1,1,1,NA)
X20170103<-c(NA,NA,NA,2,1)
X20170201<-c(NA,2,NA,NA,1)
X20170202<-c(NA,1,1,NA,NA)
X20170301<-c(NA,1,NA,NA,NA)
df <- data.table(ID,X20170101,X20170102,X20170103,X20170201,X20170202,X20170301)
ID X20170101 X20170102 X20170103 X20170201 X20170202 X20170301
101 1 NA NA NA NA NA
102 NA 1 NA 2 1 1
203 NA 1 NA NA 1 NA
207 2 1 2 NA NA NA
209 1 NA 1 1 NA NA
For each ID, I would like to sum across all dates/columns belonging to the same month. If yyyymm is the vector of strings for the first three months
yyyymm <- c("X201701","X201702","X201703")
I would like to obtain the dataframe want with strings in yyyymm as headers of the columns. That is:
ID X201701 X201702 X201703
101 1 NA NA
102 1 3 1
203 1 1 NA
207 5 NA NA
209 2 1 NA
My idea was to avoid reshaping the format of my dataset and use functions lapply and grepl to partially match the strings, but I'm missing something.
test = lapply(df, function(x) colSums(df[,grepl(x, names(df))]))
Many thanks.
Here's one using lubridate package to parse dates and split.default to divide data.frame into groups based on same month
library(lubridate)
factors = sapply(ymd(gsub("X", "", names(df)[-1])), function(x)
paste0(year(x), sprintf("%02d", as.integer(month(x)))))
data.frame(df[,1],
lapply(split.default(df[,-1], factors), function(x)
rowSums(x, na.rm = TRUE) * (NA^(rowSums(is.na(x)) == NCOL(x)))))
# ID X201701 X201702 X201703
#1 101 1 NA NA
#2 102 1 3 1
#3 203 1 1 NA
#4 207 5 NA NA
#5 209 2 1 NA
Is there a reason you don't want to spread your data?
library(tidyverse)
want <- df %>%
gather(key, value, -ID) %>%
mutate(key = substr(key, 1, 7)) %>%
group_by(ID, key) %>%
summarise(value = sum(value, na.rm=TRUE)) %>%
spread(key, value)
# A tibble: 5 x 4
# Groups: ID [5]
ID X201701 X201702 X201703
* <dbl> <dbl> <dbl> <dbl>
1 101 1 0 0
2 102 1 3 1
3 203 1 1 0
4 207 5 0 0
5 209 2 1 0

Faster way to convert a list to a data.frame with some column values missing

I have this list of list
> head(train)
[[1]]
[[1]]$Physics
[1] 8
[[1]]$Chemistry
[1] 7
[[1]]$PhysicalEducation
[1] 3
[[1]]$English
[1] 4
[[1]]$Mathematics
[1] 6
[[1]]$serial
[1] 195490
.
.
[[6]]
[[6]]$Physics
[1] 2
[[6]]$Chemistry
[1] 1
[[6]]$Biology
[1] 2
[[6]]$English
[1] 4
[[6]]$Mathematics
[1] 8
[[6]]$serial
[1] 182318
each sub-list has any five elements out of these 12 and one extra named serial
columns <- c("Physics", "Chemistry", "PhysicalEducation", "English",
"Mathematics", "serial", "ComputerScience", "Hindi", "Biology",
"Economics", "Accountancy", "BusinessStudies")
I am trying yo convert this list into data frame.
Presently, I am doing this using this for loop by iterating one row at a time. Although this works, it takes a huge amount of time.
colclass <- rep("numeric",12)
comby <- read.table(text = '', colClasses = colclass, col.names = columns)
for(i in 1:length(train)){
comby[i,names(train[[i]])] <- train[[i]]
}
I tried using do.call(rbind, train) but that doesn't work as it keeps adding new data into the old columns from the first iteration.
What's a better, faster way? I have around 1.5 million observations.
Desired o/p : the data frame should have all the columns. I want NA where there is no value. Also I am interested if it could be done faster without using any additional packages.
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics Accountancy
1 8 7 3 4 6 195490 NA NA NA NA NA
2 1 1 1 3 3 190869 NA NA NA NA NA
3 1 2 2 1 2 3111 NA NA NA NA NA
4 8 7 6 7 7 47738 NA NA NA NA NA
5 1 1 1 3 2 85520 NA NA NA NA NA
6 2 1 NA 4 8 182318 NA NA 2 NA NA
BusinessStudies
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Here is the reproducible code
train <- [{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":3,\"English\":4,\"Mathematics\":6,\"serial\":195490},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":3,\"serial\":190869},{\"Physics\":1,\"Chemistry\":2,\"PhysicalEducation\":2,\"English\":1,\"Mathematics\":2,\"serial\":3111},{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":6,\"English\":7,\"Mathematics\":7,\"serial\":47738},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":2,\"serial\":85520},{\"Physics\":2,\"Chemistry\":1,\"Biology\":2,\"English\":4,\"Mathematics\":8,\"serial\":182318},{\"Physics\":3,\"Chemistry\":4,\"PhysicalEducation\":5,\"English\":5,\"Mathematics\":8,\"serial\":77482},{\"Accountancy\":2,\"BusinessStudies\":5,\"Economics\":3,\"English\":6,\"Mathematics\":7,\"serial\":152940},{\"Physics\":5,\"Chemistry\":6,\"Biology\":7,\"English\":3,\"Mathematics\":8,\"serial\":132620}]
train <- rjson::fromJSON(train)
As a starting point you can use purrr::map as follows:
A sample data set:
x <- list(list(physics=8,
Chemistry=7,
PhysicalEducation=3,
English=4,
serial=195490),
list(physics=2,
Chemistry=1,
Biology=2,
English=4,
Mathematics=8,
serial=182318))
Sol.1 [Shortest to avoid loops]
zzz <- sapply(columns, function(n) map_dbl(x,n,.null=NA) ) %>%
data.frame()
Which gives:
> zzz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
If you would like to understand how this works, you can check the longer solutions below.
Sol.2 [Manual assignment]
-pick the values for each column:
z <- data.frame(
serial = map_dbl(x,"serial",.null=NA),
Biology = map_dbl(x,"Biology",.null=NA),
Chemistry = map_dbl(x,"Chemistry",.null=NA)
)
Which gives:
> z
serial Biology Chemistry
1 195490 NA 7
2 182318 2 1
>
Sol.3 [Pre-defined dataframe and for-loop]
create a dataframe with a fixed size
zz <- data.frame(matrix(NA, nrow = length(x), ncol = 12))
assign names
names(zz) <- columns
assign values from the lists
for(i in 1:ncol(zz)){
zz[columns[i]] <- map_dbl(x,columns[i],.null=NA)
}
Which gives:
> zz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
You can accomplish this in base R by combining Reduce, and Map.
data
Here is a dataset that matches your structure.
set.seed(1234)
temp <- replicate(7, setNames(replicate(7, sample(1:10, 1), simplify=FALSE), letters[1:7]),
simplify=FALSE)
To produce a data.frame from this, you can use
Reduce(rbind, Map(data.frame, temp))
a b c d e f g
1 2 7 7 7 9 7 1
2 3 7 6 7 6 3 10
3 3 9 3 3 2 3 4
4 4 2 1 3 9 6 10
5 9 1 5 3 4 6 2
6 8 3 3 10 9 6 7
7 4 7 4 6 7 5 3
Where data.frame constructs data.frames with the inner elements. Map applies this to each element of the outer list, resulting in a list of data.frames. Finally, Reduce rbinds the data.frames in the list and produces a single data.frame.

reshape and aggregate datatable

I asked a very similar question and because I haven't quite gotten a handle on tidyr or reshape I have to ask another question. I have a datatable containing repeat id values (see below):
id Product NI
1 Direct Auto 15
2 Direct Auto 15
3 Direct Auto 15
4 Direct Auto 15
5 Direct Auto 15
6 Direct Auto 15
6 Mortgage 50
9 Direct Auto 15
10 Direct Auto 15
11 Direct Auto 15
12 Direct Auto 15
13 Direct Auto 15
14 Direct Auto 15
15 Direct Auto 15
16 Direct Auto 15
1 Mortgage 50
5 Personal 110
19 Direct Auto 15
20 Direct Auto 15
1 Direct Auto 15
I would like the id aggregated to one row, the Product column to be 'spread' so that the values become variables, another variable containing the aggregated count of each Product by id, and the NI to be summed for each of the product groups by ID. So see example below:
id DirectAuto DA_NI Mortgage Mortgage_NI Personal P_NI
1 2 30 1 50 NA NA
2 1 15 NA NA NA NA
3 1 15 NA NA NA NA
4 1 15 NA NA NA NA
5 1 15 NA NA 1 110
6 1 15 1 50 NA NA
9 1 15 NA NA NA NA
11 1 15 NA NA NA NA
12 1 15 NA NA NA NA
13 1 15 NA NA NA NA
14 1 15 NA NA NA NA
15 1 15 NA NA NA NA
16 1 15 NA NA NA NA
19 1 15 NA NA NA NA
20 1 15 NA NA NA NA
For example, id 1 has 2 Direct Auto, so his DA_NI is 30 and he has 1 Mortgage so his NI is Mortgage_NI = 50.
So, basically make a 'wider' datatable. I'm still reading and practicing tidyr and reshape, but in the mean-time maybe someone can help.
Here is some of my starting code:
df[, .(tot = .N, NI = sum(NI)), by = c("id","Product")]
Afterwards, using some tidyr & reshape commands I can't seem to get the final output I want.
data.table v1.9.5 has more nicer features for melting and casting. Using dcast from the devel version:
require(data.table) # v1.9.5
dcast(dt, id ~ Product, fun.agg = list(sum, length), value.var="NI", fill=NA)
I think this is what you're looking for. You can checkout the new HTML vignettes here.
Rename the columns to your liking.
It's a little tricky to do this. It can be done using tidyr and dplyr though it goes against Hadley Wickgam's tidy data principles.
dat %>%
group_by(id, Product) %>%
summarise(NI = sum(NI), n = n()) %>%
gather(variable, value, n, NI) %>%
mutate(
col_name = ifelse(variable == "n",
as.character(Product),
paste(Product, variable, sep = "_"))
) %>%
select(-c(Product, variable)) %>%
spread(col_name, value)

rowMean if row passes a test

I'm working on a data set where the source name is specified by a 2-letter abbreviation in front of the variable. So all variables from source AA start with AA_var1, and source bb has bb_variable_name_2. There are actually a lot of sources, and a lot of variable names, but I leave only 2 as a minimal example.
I want to create a mean variable for any row where the number of sources, that is, where the number of unique prefixes for which the data on that row is not NA, is greater than 1. If there's only one source, I want that total variable to be NA.
So, for example, my data looks like this:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1
1 NA NA 123456 10 12
2 NA 10 194200 12 NA
3 12 10 132200 NA NA
4 12 NA 132201 NA 12
5 NA NA 132202 NA NA
6 12 13 132203 14 NA
And I want the following:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1 rowMeanIfDiverseData
1 NA NA 123456 10 12 NA #has only bb
2 NA 10 194200 12 NA 11 #has AA and bb
3 12 10 132200 NA NA NA #has only AA
4 12 NA 132201 NA 12 12 #has AA and bb
5 NA NA 132202 NA NA NA #has neither
6 12 13 132203 14 NA 13 #has AA and bb
Normally, I just use rowMeans() for this kind of thing. But the additional subsetting of selecting only rows whose variable names follow a convention /at the row level/ has caught me confused between the item-level and the general apply-level statements I'm used to.
I can get the prefixes at the dataframe level:
mynames <- names(df[!names(df) %in% c("myid")])
tmp <- str_extract(mynames, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)])
So,
> uniq
[1] "AA" "bb"
So, I can make this a function I can apply to df like so:
multiSource <- function(x){
nm = names(x[!names(x) %in% badnames]) # exclude c("myid")
tmp <- str_extract(nm, perl("[A-Za-z]{2}(?=_)")) # get prefixes
uniq <- unique(tmp[!is.na(tmp)]) # ensure unique and not NA
if (length(uniq) > 1){
return(T)
} else {
return(F)
}
}
But this is clearly confused, and still getting data-set level, ie:
> lapply(df,multiSource)
$AA_var1
[1] FALSE
$AA_var2
[1] FALSE
$bb_meow
[1] FALSE
$bb_A_v1
[1] FALSE
And...
> apply(df,MARGIN=1,FUN=multiSource)
Gives TRUE for all.
I'd otherwise like to be saying...
df$rowMean <- rowMeans(df, na.rm=T)
# so, in this case
rowMeansIfTest <- function(X,test) {
# is this row muliSource True?
# if yes, return(rowMeans(X))
# else return(NA)
}
df$rowMeanIfDiverseData <- rowMeansIfTest(df, test=multiSource)
But it is unclear to me how to do this without some kind of for loop.
The strategy here is to split the data frame by columns into variable groups, and for each row identifying if there are non-NA values. We then check with rowsums to make sure there are at least two variables with non-NA values for a row, and if so, add the mean of those values with cbind.
This will generalize to any number of columns so long as they are named in the AA_varXXX format, and so long as the only column not in that format is myid. Easy enough to fix if this isn't strictly the case, but these are the limitations on the code as written now.
df.dat <- df[!names(df) == "myid"]
diverse.rows <- rowSums(
sapply(
split.default(df.dat, gsub("^([A-Z]{2})_var.*", "\\1", names(df.dat))),
function(x) apply(x, 1, function(y) any(!is.na(y)))
) ) > 1
cbind(df, div.mean=ifelse(diverse.rows, rowMeans(df.dat, na.rm=T), NA))
Produces:
AA_var1 AA_var2 myid BB_var3 BB_var4 div.mean
1 NA NA 123456 10 12 NA
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NA
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NA
6 12 13 132203 14 NA 13
This solution seems a little convoluted to me, so there's probably a better way, but it should work for you.
# Here's your data:
df <- data.frame(AA_var1 = c(NA,NA,12,12,NA,12),
AA_var2 = c(NA,10,10,NA,NA,13),
BB_var3 = c(10,12,NA,NA,NA,14),
BB_var4 = c(12,NA,NA,12,NA,NA))
# calculate rowMeans for each subset of variables
a <- rowMeans(df[,grepl('AA',names(df))], na.rm=TRUE)
b <- rowMeans(df[,grepl('BB',names(df))], na.rm=TRUE)
# count non-missing values for each subset of variables
a2 <- rowSums(!is.na(df[,grepl('AA',names(df))]), na.rm=TRUE)
b2 <- rowSums(!is.na(df[,grepl('BB',names(df))]), na.rm=TRUE)
# calculate means:
rowSums(cbind(a*a2,b*b2)) /
rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
Result:
> df$rowMeanIfDiverseData <- rowSums(cbind(a*a2,b*b2)) /
+ rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NaN
2 NA 10 12 NA 11
3 12 10 NA NA NaN
4 12 NA NA 12 12
5 NA NA NA NA NaN
6 12 13 14 NA 13
And a little cleanup to exactly match your intended output:
> df$rowMeanIfDiverseData[is.nan(df$rowMeanIfDiverseData)] <- NA
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NA
2 NA 10 12 NA 11
3 12 10 NA NA NA
4 12 NA NA 12 12
5 NA NA NA NA NA
6 12 13 14 NA 13
My attempt, somewhat longwinded.....
dat<-data.frame(AA_var1=c(NA,NA,12,12,NA,12),
AA_var2=c(NA,10,10,NA,NA,13),
myid=1:6,
BB_var3=c(10,12,NA,NA,NA,14),
BB_var4=c(12,NA,NA,12,NA,NA))
#what columns are associated with variables used in our mean
varcols<-grep("*var[1-9]",names(dat),value=T)
#which rows have the requisite diversification of non-nulls
#i assume these columns will start with capitals and folloowed by underscore
meanrow<-apply(!is.na(dat[,varcols]),1,function(x){n<-varcols[x]
1<length(unique(regmatches(n,regexpr("[A-Z]+_",n))))
})
#do the row mean for all
dat$meanval<-rowMeans(dat[,varcols],na.rm=T)
#null out for those without diversification (i.e. !meanrow)
dat[!meanrow,"meanval"]<-NA
I think some of the answers are making this seem more complicated than it is. This will do it:
df$means = ifelse(rowSums(!is.na(df[, grep('AA_var', names(df))])) &
rowSums(!is.na(df[, grep('BB_var', names(df))])),
rowMeans(df[, grep('_var', names(df))], na.rm = T), NA)
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1 NA NA 123456 10 12 NA
#2 NA 10 194200 12 NA 11
#3 12 10 132200 NA NA NA
#4 12 NA 132201 NA 12 12
#5 NA NA 132202 NA NA NA
#6 12 13 132203 14 NA 13
Here's a generalization of the above, given the comment, assuming unique id's (if they're not, create a unique index instead):
library(data.table)
library(reshape2)
dt = data.table(df)
setkey(dt, myid) # not strictly necessary, but makes life easier
# find the conditional
cond = melt(dt, id.var = 'myid')[,
sum(!is.na(value)), by = list(myid, sub('_var.*', '', variable))][,
all(V1 != 0), keyby = myid]$V1
# fill in the means (could also do a join, but will rely on ordering instead)
dt[cond, means := rowMeans(.SD, na.rm = T), .SDcols = grep('_var', names(dt))]
dt
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1: NA NA 123456 10 12 NA
#2: 12 10 132200 NA NA NA
#3: 12 NA 132201 NA 12 12
#4: NA NA 132202 NA NA NA
#5: 12 13 132203 14 NA 13
#6: NA 10 194200 12 NA 11
fun <- function(x) {
MEAN <- mean(c(x[1], x[2], x[4], x[5]), na.rm=TRUE)
CHECK <- sum(!is.na(c(x[1], x[2]))) > 0 & sum(!is.na(c(x[4], x[5])) > 0)
MEAN * ifelse(CHECK, 1, NaN)
}
df$rowMeanIfDiverseData <- apply(df, 1, fun)
df
AA_var1 AA_var2 myid BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 123456 10 12 NaN
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NaN
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NaN
6 12 13 132203 14 NA 13

Resources