R: Vector of sums with conditions from different data frames

R: Vector of sums with conditions from different data frames - r

I want to make a vector of sums where sum would be a number of 1s in one column in one df if another column from the same df has values equal or greater with a column from a different df in which I actually want to write vector.
I have something like this
DF1$A <- c( 0.12 , 0.29, 0.36, 0.55)
DF2
x <- c(0,0,1,0,1,0,1,0,0,1)
y <- c(0.11, 0.55, 0.23,0.33,0.59,0.66,0.88,0.11,0.05,0.90)
I want to make a vector DF1$B
DF1B<- sum(DF2$Y >= DF1$A & DF2$X == 1)
Problem is that I get a vector of one value and I want every value of the vector to be different based on a condition that is also a vector.
Also, I am getting this massage l
longer object length is not a multiple of shorter object length.

Ones and zeros serve as logical values, and so giving a numeric vector positions in logical terms would only take the elements correcponding to the TRUE or 1 positions.
as.logical(x)
# FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
y[as.logical(x)]
# 0.23 0.59 0.88 0.90
sum(y[as.logical(x)])
# 2.6

Just read the warning message and try it in the console and you will see yourself ...
c(1:4) >= c(1:10)
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Warning message:
In c(1:4) >= c(1:10) :
longer object length is not a multiple of shorter object length
you have to care about what you are comparing, length matters.
This should be ok ...
c(1:4) >= c(1:4)
This should be ok as well
c(1:4) >= c(1:8)
or
c(c(1:4),c(1:4)) >= c(1:4)
Some time you would like to compare one row with many rows, expecting the length of the rows is the same. So this is why you see that warning.
Length matters and in your case, the length of longer is not multiple of shorter object length.
BTW
The length related commands are length for vectors and lists, nrow, ncol, dim for tables like objects.

Related

Indexing tables of logical vectors with zero counts in R

I have the following:
> v1 <- c(T, F, T, T, F)
> table(v)
v
FALSE TRUE
2 3
To index the 'True' column, I do this:
> `table(v1)[2]`
TRUE
3
However, if a logical vector contains only FALSE values, the table will only have one column and the previos strategy no longer works to retrieve the TRUE column:
> v2 <- c(F, F, F, F, F)
> table(v2)[2]
<NA>
NA
How can one consistently index the TRUE column regardless of if its count is zero? One solution is to do this:
> table(factor(v2, levels= c("FALSE", "TRUE")))[2]
TRUE
0
But this feels like cheating because it treats TRUE and FALSE as characters that become levels of a factor. For non-logical vectors, this behaviour is understandable, because there is no way of knowing what levels exist. (1) Is there a way to force table() to take into consideration the fact that logical vectors only take on two values and always present two columns for them? (2) Am I overthinking this and the last command is an acceptable and robust practice?

Convert to factor with levels specified so that it always have two levels - without a TRUE value, there is no way the table to create the count of TRUE as that information is not present. With factor levels, it gives the TRUE count to be 0
table(factor(v2, levels = c(FALSE, TRUE)))[2]
It is not clear why a logical vector TRUE values needs to be counted with table and then extract based on the TRUE, FALSE names. It can be more easily done with sum as TRUE -> 1 and FALSE -> 0, negating (!) reverses this
> sum(v1)
[1] 3
> sum(!v1)
[1] 2
> sum(v2)
[1] 0
> sum(!v2)
[1] 5

Because the case of logical is so specific for the requirements, I would write a specific function:
logitable <- function(x)
{
x <- as.logical(x)
kNA <- sum(is.na(x))
kT <- sum(x, na.rm=TRUE)
kF <- length(x) - kT - kNA
return (structure(
c(kT, kF, kNA),
names = c("TRUE", "FALSE", "NA")
))
}
Please note that the type of the return object is not of class "table" --- let me know if this is important to you, to return such an object.
Test with:
logitable(c(T,F,T,F,T))
logitable(c(T,T,T,T,T))
logitable(c(F,F,F,F,F))
logitable(c(T,F,T,F,NA))

Subsetting a logical vector with a logical vector in R

(Note: following the suggestions in the comments, I have changed the original title "Comparing the content of two vectors in R?" to "Subsetting a logical vector with a logical vector in R")
I am trying to understand the following R code snippet (by the way, the question originated while I was trying to understand this example.)
I have a vector a defined as:
a = c(FALSE, FALSE)
Then I can define b:
b <- a
I check b's content and everything looks OK:
b
#> [1] FALSE FALSE
Question
Now, what is the following code doing? Is it checking if b is equal to "not" a?
b[!a]
#> [1] FALSE FALSE
But if I try b[a] the result is different:
b[a]
#> logical(0)
I also tried a different example:
a = c(FALSE, TRUE)
b <- a
b
#> [1] FALSE TRUE
Now I try the same operations as above, but I get a different result:
b[!a]
#> [1] FALSE
b[a]
#> [1] TRUE
Created on 2021-03-23 by the reprex package (v0.3.0)

[] is used for subsetting a vector. You can subset a vector using integer index or logical values.
When you are using logical vector to subset a vector, a value in the vector is selected if it is TRUE. In your example you are subsetting a logical vector with a logical vector which might be confusing. Let's take another example :
a <- c(10, 20)
b <- c(TRUE, FALSE)
a[b]
#[1] 10
Since 1st value is TRUE and second is FALSE, the first value is selected.
Now if we invert the values, 20 would be selected because !b returns FALSE TRUE.
a[!b]
#[1] 20
Now implement this same logic in your example -
a = c(FALSE, FALSE)
b <- a
!b returns TRUE TRUE, hence both the values are selected when you do b[!a] and the none of the value is selected when you do b[a].

b[!a] will result in displaying those values of b which are at TRUE positions as evalauted by !a.
!a is actually T, T therefore displays first and second values of b which are F and F
More efficiently please see this
a <- 1:4
b <- c(T, T, F, T)
now a[!b] will display a[c(F, F, T, F)] i.e. only third element of a

Using which(), !is.na() and parameter like [1,]

Can someone describe exactly (I understand partially) what the following line does?
which(!is.na(table[1,]))
1) table[1,] = ? line 1 or column 1 or of a file called "table"?
2) !is.na = why the !? (is.na is used to eliminate the NA but why the !? Normally, ! represents negative (not equal).

If we split the function to pieces,
table[1,]
subset the first row of the dataset
is.na(table[1,])
checks whether there are NA values in the first row. It will return a vector of logical elements (TRUE for NA and FALSE for non-NA).
! is negation operator. So, it will convert the TRUE to FALSE and vice versa to give a vector of logical elements that are non NA for TRUE values
!is.na(table[1,])
and lastly the which wrapper gives the numeric index of TRUE values
To demonstrate an example, say we have a matrix
m1 <- matrix(c(NA, 0, 1, 2), 2, 2)
Then, if we follow the steps
m1[1,] #returns the 1st row as a vector
#[1] NA 1
is.na(m1[1,]) #returns TRUE for NA
#[1] TRUE FALSE
!is.na(m1[1,]) #returns TRUE for non-NA elements
#[1] FALSE TRUE
which(!is.na(m1[1,]))
#[1] 2
#or perhaps more usefully
which(is.na(m1[1,]))
#[1] 1

Filtering with logical + NA values in one column

I have the following data frame:
df <- data.frame("Logical"=c("true",NA,"false","true","","false"),
"Numeric"=c(1,2,3,4,5,6))
unique(df$Logical)
length(df$Logical == TRUE)
I'm trying to figure out, how many TRUE-values do I have in my df$Logical column. But seems I'm missing something and length(df$Logical == TRUE) returns no of records in my logical column.
What I'm doing wrong in this particular case. Desired result should be 2 for TRUE-values in df$Logical column. Many thanks in advance.

We need to specify the string in the lower case as the values were 'true/false' and not exactly TRUE/FALSE. Also, instead of length, sum should be used. The sum gets the number of TRUE elements.
sum(df$Logical == "true")
#[1] 2
If there are NA elements in the column, use na.rm = TRUE
sum(df$Logical=='true', na.rm = TRUE)
#[1] 2
The length of a logical or any other vector would be the same as the original length/number of rows of the dataset.
length(df$Logical == "true")
#[1] 6
because it returns a logical vector of length 6.
df$Logical == "true"
#[1] TRUE FALSE FALSE TRUE FALSE FALSE
To get the counts of both true and false, we can use table
table(df$Logical)

First of all "true" and "false" as you put it into you data frame are not Booleans but simple strings.
Moreover, length(df$Logical == TRUE) will always return 6 in this example, i.e. the number of elements in the column. This is because df$Logical == TRUE returns a sequence of TRUE or FALSE. In your case it will return
FALSE NA FALSE FALSE FALSE FALSE
because the boolean expression is never true. However, the length of this will be 6 as returned by length().
To overcome the problem you might define your data frame like this
df <- data.frame("Logical"=c(TRUE,NA,FALSE,FALSE,NA,FALSE),
"Numeric"=c(1,2,3,4,5,6))
And then you can sum up the number of TRUE
sum(df$Logical == TRUE, na.rm = T)
[1] 2
na.rm = T is important here because otherwise the sum will return NA if one more more elements are NA.
Alternatively, you can work with strings to indicate true or false (and empty strings a NA)
Then you could write
df <- data.frame("Logical"=c("true",NA,"false","true","","false"),
"Numeric"=c(1,2,3,4,5,6))
sum(df$Logical == "true", na.rm = T)
[1] 2

why sometimes R can't tell difference between NA and 0?

I am trying to extract rows of data with field "var" equals 0.
But I found "NA" were taken as 0:
There are 20 rows of 0 and 809 rows of "NA".
There are total 81291 rows in data frame d.
> length(d$var[d$var == "0"])
[1] 829
> length(d$var[d$var == 0])
[1] 829
The above 829 values include both 0 and "NA"
> length(d$var[d$var == "NA"])
[1] 809
> length(d$var[d$var == NA])
[1] 81291
Why does the above code gave the length of d?

x == NA is not the way to test whether the value of some variable x is NA. Use is.na()instead:
> 2 == NA
[1] NA
> is.na(2)
[1] FALSE
Similarly, use is.null() to test whether an object is the NULL object.

Here is the solution that gives the right answer.
length(which(d$var == 0))
the reason you are facing that problem is that in your expression, the condition check does not give FALSE for the NA values, it gives NA instead and when you add the condition as the index, the values which are not FALSE are checked for. in the expression i have given, it checks for which conditions are TRUE and hence you get the right answer.

One way to evaluate this is the inelegant
length(d$var[(d$var == 0) & (!is.na(d$var))])
(or slightly more compactly, sum(d$var==0 & !is.na(d$var)))
I think your code illustrates some misunderstandings you are having about R syntax. Let's make a compact, reproducible example to illustrate:
d <- data.frame(var=c(7, 0, NA, 0))
As you point out, length(d$var[d$var==0]) will return 3, because NA==0 is evaluated as NA.
When you enclose the value you're looking for in quotation marks, R evaluates it as a string. So length(d$var[d$var == "NA"]) is asking how many elements in d$var are the character string "NA". Since there are no characters "NA" in your data set, you get back the number of values that evaluate to NA (because "NA"==NA evaluates to NA).
In order to answer your last question, look at what d$var[d$var==NA] returns: a vector of NA of the same length as your original vector. Again, any == comparison with NA evaluates to NA. Since all of the comparisons in that expression are to NA, you'll get back a vector of NAs that is the same length as your original vector.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Vector of sums with conditions from different data frames - r

Related

Indexing tables of logical vectors with zero counts in R

Subsetting a logical vector with a logical vector in R

Using which(), !is.na() and parameter like [1,]

Filtering with logical + NA values in one column

why sometimes R can't tell difference between NA and 0?

Categories

Resources