conditional statements with missing values - julia

total Julia beginner here, coming from R. I am struggling to find a simple syntax to
define a new variable based on two variables that have missing values. Suppose I have two variables in R as follows:
esi_offer_cps <- esi_own <- rep(NA,100)
esi_offer_cps[1:25] <- 0
esi_offer_cps[26:50] <- 1
esi_own[1:16] <- 0
esi_own[25:40] <- 0
esi_own[51:62] <- 0
esi_own[63:80] <- 1
t1 <- table(esi_offer_cps, esi_own, exclude=NULL)
print(t1)
esi_own
esi_offer_cps 0 1 <NA>
0 17 0 8
1 15 0 10
<NA> 12 18 20
In R I define a new variable as follows:
esi_offer <- ifelse(esi_offer_cps | (!is.na(esi_own) & esi_own == 1),1,0)
t2 <- table(esi_offer, exclude=NULL)
print(t2)
esi_offer
0 1 <NA>
25 43 32
The 43 records with esi_offer = 1 come from the sum of 15+10+18, and the 25 records with value 0 come from 17+8. This works in R because features such as NA | T = T and NA & F = F work when vectorized. In Julia I understand I can't have Boolean statements with .&& or .|| where missing values are present, and I am going nuts looking for some reasonable way to deal with cases like this in base Julia. performance is not my worry at the moment (one thing at the time!). I understand that if these are columns of data frames there might ways specific to dataframes, but I have not investigated those because I am trying to understand the basics. I have been exploring the use of ismissing, ternary operators and bitwise operators, but I think I am missing the big picture: there must be some principled way to look at operations of this type but I can't fathom it. If you can point to me to an article or a basic book with examples I would be grateful (or give me an example of how to deal with his particular case, hopefully I can extrapolate from that). I am still at the stage where reading the manual is not illuminating yet. Thank you in advance for your patience.

As August notes the && and || operators in Julia are strict, i.e. they require Bool as a first argument. Conversly & and | fully support 3-valued logic:
julia> [false, missing, true] .| [false missing true]
3×3 Matrix{Union{Missing, Bool}}:
false missing true
missing missing true
true true true
julia> [false, missing, true] .& [false missing true]
3×3 Matrix{Union{Missing, Bool}}:
false false false
false missing missing
false missing true
If you need some additional explanation of this please comment.

Related

R. Display only TRUE values from a boolean expression?

Data science student here. New to R, in my first course. I've spent way too much time trying to figure out this exercise, so I figured I would ask someone on here.
I have created a dataframe built from 4 matrices, titled bee_numbers_data_2:
buff_tail garden_bee red_tail honeybee carder_bee
10 8 18 12 8
1 3 9 13 27
37 19 1 16 6
5 6 2 9 32
12 4 4 10 23
The exercise asks us to only show honeybee numbers >= 10.
So I've created a boolean expression to display the TRUE FALSE statements:
bee_numbers_data_2$honeybee>=10
Which returns:
[1] TRUE TRUE TRUE FALSE TRUE
However, I want to display a list of the VALUES of the true statements, not a list of TRUE FALSE statements.
I've been pouring over my textbook and the internet trying to figure out this simple problem, so any help would be greatly appreciated. Thanks so much.
Although this is a fairly simple question, covered in most introductory texts on R, I could not find a duplicate on SO, so it seems worth answering here.
Let's break it down. As you already showed, we can use boolean expressions to generate a vector of boolean values:
bee_numbers_data_2 = data.frame(honeybee=c(12,13,16,9,10))
bee_numbers_data_2$honeybee >= 10
# [1] TRUE TRUE TRUE FALSE TRUE
If we want to know which of those are true, we can use the base R function which:
which(bee_numbers_data_2$honeybee >= 10)
# [1] 1 2 3 5
If we want to know the original values corresponding to those position indices, we can use those indices to subset the original data, using [
bee_numbers_data_2$honeybee[which(bee_numbers_data_2$honeybee >= 10)]
# [1] 12 13 16 10
Or, equivalently and a little more simple, we can subset using the boolean vales directly:
bee_numbers_data_2$honeybee[bee_numbers_data_2$honeybee >= 10]
Note that as you learn more R, you will find that there are also some more advanced ways to filter and subset data, such as the packages data.table and dplyr. However, it is best to understand how to use base R first, as shown above.

Referencing different data frames(same size) for conditional change of resulting data frame

I want to include values stored in different data frames(same size) into conditions to generate value for the resulting data frame.
My data:
int contains the intensity of different features (m2, m4, m1e3) measured for a number of people (in rows).
int<-data.frame(int_m2=c(33,32,35) ,int_m4=c(111,113,118), int_m1e3=c(104,99,110))
View(int)
3 different quality evaluation (s_, p_, i_) for each feature (m2, m4, m1e3).
s_<-data.frame(s_m2=rep(8,3) ,s_m4=rep(100,3), s_m1e3=c("NA", 100, 100 ))
p_<-data.frame(p_m2=rep(10,3), p_m4=rep(10,3), p_m1e3=c("NA", 10, 10 ))
i_<-data.frame(i_m2=rep(0.1,3), i_m4=rep(0.5,3), i_m1e3=c("NA", 0.1, 0.5 ))
I need results in which intensity from int would be pasted if conditions are TRUE for quality evaluation. In the case of "NA" or FALSE value should be 0.
Quality evaluation conditions:
if conditions(s_[i,j] >9 & p_[i,j] <20 & i_[i,j]> 0.2
Expected results:
res<-data.frame(m2=c(0,0,0),m4=c(111,113,118), m1e3=c(0, 0, 110 ))
Elaborate ranting:
My goal was to apply this on a big data set and to use foreach and dopar to make it faster.
But, after trying combinations of mutate, ifelse, sappy, or old school if/else loop I always get stuck on how to make inputs for conditions change if applied on the data frame. All that I found during extensive research are constants as conditions applied over data frames.
So each try I had was half-finished code and thus not very useful share.
Any recommendations for further reading on the subject are more than welcome.
Instead of looping over i, j, create the multiple logical expressions on the whole dataset, and multiply with int. Those FALSE (-> 0) multiplied returns 0 for corresponding 'int' while the TRUE (-> 1) will return the same value from 'int'
(s_ > 9 & p_ < 20 & i_ > 0.2) * int
-output
int_m2 int_m4 int_m1e3
1 0 111 0
2 0 113 0
3 0 118 0

How do you read the %in% operator in plain English?

I'm struggling with how to read the %in% operator in R in "plain English" terms. I've seen multiple examples of code for its use, but not a clear explanation of how to read it.
For example, I've found terminology for the pipe operator %>% that suggests to read it as "and then." I'm looking for a similar translation for the %in% operator.
In the book R for Data Science in chapter 5 titled "Data Transformation" there is an example from the flights data set that reads as follows:
The following code finds all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:
nov_dec <- filter(flights, month %in% c(11, 12))
When I read "a useful short-hand for this problem is x %in% y," and then look at the nov_dec example, it seems like this is to be understood as "select every row where month (x) is one of the values in c(11,12) (y)," which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11 and 12 in the month column." In this example, it seems like x should be the values of 11 and 12 and the %in% operator is checking if those values are in y which would be the month column. My brain is reading this example from right to left.
However, all of the code examples I've found seem to indicate that this x %in% y should be read left to right and not right to left.
Can anyone help me read the %in% operator in layman's terms please? Examples would be appreciated.
If I wanted to really "spell it out", I'd read x %in% y as "for each x value, is it in y"?
nov_dec <- filter(flights, month %in% c(11, 12))"
When I read "A useful short-hand for this problem is x %in% y," and then look at the nov_dec example, it seems like this is to be understood as "select every row where month ('x') is one of the values in c(11,12) ('y'), which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11 and 12 in the month column." In this example, it seems like 'x' should be the values of 11 and 12 and the %in% operator is checking if those values are in 'y' which would be the month column. My brain is reading this example from right to left.
The left-vs-right thing is all about what you're asking about. x %in% y is asking (using my verbose phrasing above), "for each x value, is it in y?" With that phrasing, we know to expect an answer (TRUE or FALSE) for every item in x.
This might actually get clearer if we extend it a little more - two common related questions are "are any x values in y?" and "are all the x values in y"? These can be coded naturally as
any(x %in% y) # Are any x values in y?
all(x %in% y) # Are all x values in y?
To me, at least, those seem quite natural, and they use the left-to-right reading. It would get convoluted to try to use a right-to-left reading here, something like "look for the y values in x, did you cover every x value with your matches?"
That's actually a really good question. Think about the literal nature here:
Is the answer yes?
Is the answer no?
Is the answer yes or no?
Is the answer yes and no?
When you use %in% it is in lieu of an 'or' statement-- are any of these in here?
answers = data.frame(ans = sample(rep(c("yes","no","maybe"),
each = 3, times = 2)),
ind = 1:9)
# yes or no?
answers[answers$ans == "yes"|answers$ans == "no",]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# now about %in%
answers[answers$ans %in% c("yes","no"),]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# yes and no?
answers[answers$ans == c("yes","no"),]
# ans ind
# 1 yes 1
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 12 no 3
# what happened here? were you expecting that?
# this checked the first row for yes,
# the second row for no,
# the third row for yes,
# the fourth row for no and so on...
I think your disconnect is understanding how to apply "in" to a vector. You wrote that you want to read it as "Look for 11 and 12 in the month column." You can indeed think of it that way. Your example was:
nov_dec <- filter(flights, month %in% c(11, 12))
And that could be expressed in plain English as:
Give me all the flights where one of the values in c(11, 12) is in the month column
But we could also say that 11 and 12 are "in" the vector c(11, 12). That's what the left-to-right reading would be:
Give me all the flights whose month is in the vector c(11, 12).
Or, expressed slightly differently and more verbosely:
Give me all the flights whose month is equal to one of the values in the vector c(11, 12)
This is conceptually similar to using a bunch of | operators in a row (month == 11 | month == 12), but it's best not to think of those as exactly equivalent. Instead of explicitly comparing x to every value in y, you're asking the question "is x equal to one of the values in y?" That's different in the same way that saying "please turn off the lights" is different than saying "please walk over to that plate on the wall and pull the little stick on it downwards." It's expressing what you want instead of how to figure it out, which makes your code more readable, and code is read more often than it's written, so that's important!!!
Now I'm getting way out of my area - again, I don't know what R actually does here - but the underlying method of answering the question might also be different. It might use a binary search algorithm to find out if x is in y.

How to write a SAS macro in R?

I want to test the SAS code my team has produced in R to compare the estimates that we get from each but being new to R am not having much luck. In SAS we have written 3 macros to produce three separate estimates (HFS010, HFS011, HFS012), an example of one given here;
%macro HFS010 (peninc_var, pengn_var, pentax_var, pentype_var, HFS010_x_var);
do i = 1 to dim(pentypex);
if &pentype_var = 1 and &pengn_var = 1 then &HFS010_x_var = &peninc_var;
else if &pentype_var = 1 and &pengn_var = 2 then &HFS010_x_var = &peninc_var + &pentax_var;
end;
%mend HFS010;
Basically the idea is that each macro produces an estimate for gross pension income (so where applicable adds tax deducted from pensions on to pension income value). There are three macros as we want separate estimates for cases where pentype = 1 (HFS010), pentype = 2 (HFS011) and pentype = 3 to 7 (HFS012) and the survey accepts up to 16 entries for pensions.
To attempt to produce an equivalent of the above code in R, I wrote the following;
for(i in 1:16) {
pens_data[[paste0("HFS010_",i)]] <- case_when(
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 1 ~ pens_data[[paste0("peninc",i)]],
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 2 ~ pens_data[[paste0("peninc",i)]] + pens_data[[paste0("pentax",i)]],
TRUE ~ 0)
This code does not produce errors but upon inspecting the estimates, there were some cases that should have estimates that were left blank.
Does anyone know of a way to write a macro in R? I thought of writing a function potentially for each of HFS010, HFS011, HFS012 but being new to R am not sure how to go about this.
If anyone has any suggestions as to why my R code isn't producing the correct estimates, or how they would write the equivalent of a SAS macro in R it would be greatly appreciated! I have tried to use defmacro but could not get this to work without errors.
Thanks so much!
Ashlee
There are many ways to write this in R. But first a copule of comments:
R works fine with vector, so we should as possible manipulate vectors. This is much faster and allows to avoid slow for loop with side effect.
In order to help other to give you answer please provide a reproducible example that cover both uses cases.
For example:
set.seed(1)
dx <- data.frame(
peninc_var=sample(c(1,3),5,TRUE),
pengn_var=sample(c(1,2),5,TRUE),
pentax_var=1:5)
Here an option in base R. I am creating the new variable HFS010_x_var using ifelse :
dx$HFS010_x_var <-
with(dx,{
## I am adding a last NO condition here to assign missing NA
ifelse(peninc_var==1 & pengn_var==1,peninc_var,
ifelse(peninc_var==1 & pengn_var==2,peninc_var + pentax_var,NA))
})
peninc_var pengn_var pentax_var HFS010_x_var
1: 1 2 1 2
2: 1 2 2 3
3: 3 2 3 NA
4: 3 2 4 NA
5: 1 1 5 1
Another option (more sugar syntax ) is to use data.table:
library(data.table)
setDT(dx)
dx[peninc_var==1 & pengn_var==1,HFS010_x_var := peninc_var]
dx[peninc_var==1 & pengn_var==2,HFS010_x_var := peninc_var+pentax_var]

Check if two intervals overlap in R

Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})

Resources