Excluding rows with NAs in column-wise calculations - r

I have the below function:
colNames = c(1,4)
myfun = function(a,b){
test$result = 0.0
for (i in colNames)
{
test$result = test$result + (test[,i] * exp(-a*test[,i+1]) * exp(b*test[,i+2]))
}
return(test$result)
}
I am basically trying to multiply 3 columns in a sequence (by performing exp operation on i+1 and i+2th columns and multiplying them with col i) and adding their result to a similar operation done to the next 3 columns.
However, I have several null values and whenever I encounter a row in test[,i] with a null value, I want to exclude it from the calculation and perform the next loop.
I mean rows with null values in test[,i] should not be used in the calculation of test$result. Is there anyway to do this?
Sample data:
2 1708.637715 21.30199589 1 408.4464296 19.8614872
1 1708.637715 21.30199589 1 408.4464296 19.8614872
2 1708.637715 21.30199589 1 408.4464296 19.8614872
1 1708.637715 21.30199589 1 408.4464296 19.8614872
6 1708.637715 21.30199589 NA 408.4464296 19.8614872
0 1708.637715 21.30199589 NA 408.4464296 19.8614872
My first iteration should run normally, but in the next iteration only columns 1 to 4 have to be used in the addition
Please help

You simply have to filter out any rows with NA before you enter the loop. To do that the code would be:
test <- test[!apply(is.na(test), 1, any),]
So then if you alter the function to:
new.myfun = function(a,b){
test <- test[!apply(is.na(test), 1, any),]
test$result = 0.0
for (i in colNames)
{
test$result = test$result + (test[,i] * exp(-a*test[,i+1]) * exp(b*test[,i+2]))
}
return(test$result)
}
new.myfun(1,1)
With the output:
[1] 1.736616e-169 1.736616e-169 1.736616e-169 1.736616e-169
Which is hopefully what you're trying to achieve.
You can explicitly iterate through rows (or use apply function):
new.myfun = function(a,b){
check.for.na <- function(x,y,z, a, b) {
if(any(is.na(x), is.na(y), is.na(z))){
return(0)
}
return(x*exp(-a*y)*exp(-b*z))
}
result = rep(0, length(test))
for (ROW in 1:length(test)){
for (i in colNames)
{
check_here_for_na <- check.for.na(test[ROW,i], test[ROW,i+1], test[ROW,i+2], a, b)
result[ROW] = result[ROW] + check_here_for_na
}
}
return(result)
}
new.myfun(1,1)

Related

how to create function with input from dataframe and apply it over all rows?

I try to write a function in R which takes several variables from a dataframe as input and gives a vector with results as output.
Based on this post below I did write the function below.
How can create a function using variables in a dataframe
Although I receive this warning message:
the condition has length > 1 and only the first element will be used
I have tried to solve it by the post below using sapply in the function although I do not succeed.
https://datascience.stackexchange.com/questions/33351/what-is-the-problem-with-the-condition-has-length-1-and-only-the-first-elemen
# a data frame with columns a, x, y and z:
myData <- data.frame(a=1:5,
x=(2:6),
y=(11:15),
z=3:7)
myFun3 <- function(df, col1 = "x", col2 = "y", col3 = "z"){
result <- 0
if(df[,col1] == 2){result <- result + 10
}
if(df[,col2] == 11){result <- result + 100
}
return(result)
}
myFun3(myData)
> Warning messages:
> 1: In if (df[, col1] == 2) { :
> the condition has length > 1 and only the first element will be used
> 2: In if (df[, col2] == 11) { :
> the condition has length > 1 and only the first element will be used
Can someone explain me how I can apply the function over all rows of the dataframe?
Thanks a lot!
We need ifelse instead of if/else as if/else is not vectorized
myFun3 <- function(df, col1 = "x", col2 = "y", col3 = "z"){
result <- numeric(nrow(df))
ifelse(df[[col1]] == 2, result + 10,
ifelse(df[[col2]] == 11, result + 100, result))
}
myFun3(myData)
#[1] 10 0 0 0 0
Or the OP's code can be Vectorized after making some changes i.e. remove the second if with an else if ladder
myFun3 <- Vectorize(function(x, y){
result <- 0
if(x == 2) {
result <- result + 10
} else if(y == 11){
result <- result + 100
} else result <- 0
return(result)
})
myFun3(myData$x, myData$y)
#[1] 10 0 0 0 0
Regarding the OP's doubts about when multiple conditions are TRUE, then want only the first to be executed, the ifelse (nested - if more than two) or if/else if/else (else if ladder or if/else nested) both works because it is executed in that same order we specified the condition and it stops as soon as a TRUE condition occurred i.e. suppose we have multiple conditions
if(expr1) {
1
} else if(expr2) {
2
} else if(expr3) {
3
} else if(expr4) {
4
} else {
5}
checks the first expression ('expr1') first, followed by second, and so on. The moment it return TRUE, it exit i.e. it is a nested condition
if(expr1) {
1
} else {
if(expr2) {
2
} else {
if(expr3) {
3
} else {
if(expr4) {
4
} else 5
}
}
}
There is a cost for this i.e.. whereever we have the more values that matches the 1, only the expr1 is executed and thus saves time, but if there are more 5 values, then all those conditions are checked

Elementary problems with a Loop in R

I am working with for-loops in R;
I have a data frame which contains n columns.
I have to build a vector of length n where each element is 1 if the column is a double, 0 otherwise.
this is what I have tried:
y<-rep(0,dim.data.frame(datafr)[2])
attach(datafr)
x<-names(dat)
for (j in 1:length(x)){
for(i in x){
if(is.double(i)){
y[j]<-1
}else{
y[j]<-0
}
}
}
However, it does not work since the y vector returned has no 1, but just n 0.
A column in a data.frame should all be the same class, so you only need to check past the first value in a column.
A simple approach is to create the vector with sapply which loops (in this case) over the columns of a frame.
datafr <- data.frame(a=1:5, b=1:5 + 0, d=letters[1:5])
sapply(datafr, is.double)
# a b d
# FALSE TRUE FALSE
If you must use a for loop, this can be unrolled with
y <- integer(ncol(datafr)) # defaults to 0
y
# [1] 0 0 0
for (j in seq_along(datafr)) {
if (is.double(datafr[[j]])) {
y[j] <- 1L
}
}
y
# [1] 0 1 0

while loop in R challenge

I have a dataset with 2 columns which consists of a boolean column and values.I will like to find the sum of the F value using while loop.Coe shown below but giving error:
sum <- 0
FM <- 0
idx <- 1
while ( idx <= nrow(dataset)){
if(subset(dataset,boolean=="F")){
sum <- sum + dataset [ idx,"value" ]
FM <- FM + 1
}
idx <- idx + 1
}
print(sum)
error message is : Error in idx : object 'idx' not found
If you count sum of logical values, you get count how many TRUE values are present. Since in this case as you want to count number of FALSE values we can negate the values and then use sum.
sum(!df$boolean)
#[1] 2
However, I guess you want this in a while loop. You can iterate over every value in boolean column, check if it is FALSE and increment the count.
i <- 1
FM <- 0
while(i <= nrow(df)) {
if(!df$boolean[i])
FM <- FM + 1
i <- i + 1
}
FM
#[1] 2
We can also do this without if condition
while(i <= nrow(df)) {
FM <- FM + !df$boolean[i]
i <- i + 1
}
data
df <- data.frame(boolean= c(TRUE,FALSE,TRUE,TRUE,FALSE),value=c(8,16,4,12,9))

R programming. Dataframe and subsets

Select only unique sets from a dataframe
here one set = one row of data frame.
syntax in r?
I want set concepts
see this example
1 1 2
1 2 1
1 2 3
o/p:
1 1 2
1 2 3
Here row1 and row2 form the sets ={1,2}, so I need only one copy of such rows.
This is my code for apriori algorithm. The function trim(data,r) is what i'hv been trying as a solution,but isn't working out.
uniqueItemSets<-function(data){
#unique items in basket
items <- c()
for(j in c(1:ncol(data))){
items <- c(items,unique(data[,j]))
}
items <- unique(items)
#return(as.list(items))
return(items)
}
F_itemset<-function(data,candidate,sup){
count <- rep(0,nrow(candidate))
for(i in c(1:nrow(data))){ #every transaction
for(j in c(1:nrow(candidate))){ #every dataset
x <- candidate[j,]
#x <- uniqueItemSets(x)
y <- data[i,]
#y <- uniqueItemSets(y)
if(all(x %in% y)){
count[j] <- count[j] + 1
}
}
}
#pruning
pp<-cbind(candidate,count)
pp<-as.data.frame(pp)
pp<-subset(pp,pp$count>=sup)
return(pp)
}
#k-itemset :k-value
makeItemSet<- function(candidate,k){
l<-combn(candidate,k,simplify=FALSE)
return(l)
}
aprio<-function(data,sup,conf,kmax){
C <- uniqueItemSets(data)
C <- as.data.frame(C)
for(k in c(2:kmax))
{
F <- F_itemset(data,C,sup)
F$count <- NULL
if(nrow(F)<k){
break;
}
F<-t(F)
C <- combn(F,k,simplify=FALSE)
C <- as.data.frame(C)
C <- t(C) #transpose
C<-unique(C)
trim(C,1)
}
return(F)
}
**
new <- data.frame()
trim<-function(data,r)
{
x<-as.data.frame(data[r,])
c<-c()
for(j in c(1:ncol(x))){
c<-c(c,x[,j])
}
c<-unique(c)
if(r+1<=nrow(data)){
for(i in c((r+1):nrow(data))){
t<-c()
for(j in c(1:ncol(data))){
t<-c(t,data[i,j])
}
t<-unique(t)
if(all(t %in% c) && all(c %in% t))
{
data[-i,]
}
}
new <- as.data.frame(data)
if(r+1 < nrow(data)){
trim(data[r+1:nrow(data),],r+1)
}
}
}
**
You can use apply with margin = 1 to execute row wise functions. The only thing to be aware of is that you need to transpose the outcome to get the order you need
d <- data.frame(number1 = c(1,1,1),
number2 = c(1,2,2),
number3 = c(2,1,3))
# next two statements can be run in one line of code if you want
d_sort <- t(apply(d, 1, sort))
# get rid of duplicate rows
unique(d_sort)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 1 2 3

How could I make this R snippet faster and more R-ish?

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.
Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:
isBinaryVector <- function(v) {
if (length(v) == 0) {
return (c(0, 1))
}
a <- v[1]
b <- a
lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
if (a < b) {
return (c(a, b))
} else {
return (c(b, a))
}
}
EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for
A B C
1 1 1 0
2 2 2 0
3 3 1 0
I will lapply this isBinaryVector and get:
$A
[1] 1 1
$B
[1] 1 1
$C
[1] 0 0
The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.
Is there anyway I could do this faster?
Thanks for any inputs!
You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.
Try this:
> dat <- data.frame(
+ A = 1:3,
+ B = c(1, 2, 1),
+ C = 0
+ )
>
> sapply(dat, function(x)length(unique(x))==2)
A B C
FALSE TRUE FALSE
Next, you want to get the min and max value. The function range does this. So:
> sapply(dat, range)
A B C
[1,] 1 1 0
[2,] 3 2 0
And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:
isBinary <- function(x)length(unique(x))==2
binaryValues <- function(x){
if(isBinary(x)) range(x) else NA
}
sapply(dat, binaryValues)
$A
[1] NA
$B
[1] 1 2
$C
[1] NA
This function returns true or false for vectors (or columns of a data frame):
is.binary <- function(v) {
x <- unique(v)
length(x) - sum(is.na(x)) == 2L
}
Also take a look at this post
I'd use something like that to get column indicies:
bivalued <- apply(my.data.frame, 2, is.binary)
nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]
Sample data:
my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
c.0..1. rnorm.100. c.5..19. letters.1.5. c..a....b..
TRUE FALSE TRUE FALSE TRUE

Resources