R. Display only TRUE values from a boolean expression? - r

Data science student here. New to R, in my first course. I've spent way too much time trying to figure out this exercise, so I figured I would ask someone on here.
I have created a dataframe built from 4 matrices, titled bee_numbers_data_2:
buff_tail garden_bee red_tail honeybee carder_bee
10 8 18 12 8
1 3 9 13 27
37 19 1 16 6
5 6 2 9 32
12 4 4 10 23
The exercise asks us to only show honeybee numbers >= 10.
So I've created a boolean expression to display the TRUE FALSE statements:
bee_numbers_data_2$honeybee>=10
Which returns:
[1] TRUE TRUE TRUE FALSE TRUE
However, I want to display a list of the VALUES of the true statements, not a list of TRUE FALSE statements.
I've been pouring over my textbook and the internet trying to figure out this simple problem, so any help would be greatly appreciated. Thanks so much.

Although this is a fairly simple question, covered in most introductory texts on R, I could not find a duplicate on SO, so it seems worth answering here.
Let's break it down. As you already showed, we can use boolean expressions to generate a vector of boolean values:
bee_numbers_data_2 = data.frame(honeybee=c(12,13,16,9,10))
bee_numbers_data_2$honeybee >= 10
# [1] TRUE TRUE TRUE FALSE TRUE
If we want to know which of those are true, we can use the base R function which:
which(bee_numbers_data_2$honeybee >= 10)
# [1] 1 2 3 5
If we want to know the original values corresponding to those position indices, we can use those indices to subset the original data, using [
bee_numbers_data_2$honeybee[which(bee_numbers_data_2$honeybee >= 10)]
# [1] 12 13 16 10
Or, equivalently and a little more simple, we can subset using the boolean vales directly:
bee_numbers_data_2$honeybee[bee_numbers_data_2$honeybee >= 10]
Note that as you learn more R, you will find that there are also some more advanced ways to filter and subset data, such as the packages data.table and dplyr. However, it is best to understand how to use base R first, as shown above.

Related

conditional statements with missing values

total Julia beginner here, coming from R. I am struggling to find a simple syntax to
define a new variable based on two variables that have missing values. Suppose I have two variables in R as follows:
esi_offer_cps <- esi_own <- rep(NA,100)
esi_offer_cps[1:25] <- 0
esi_offer_cps[26:50] <- 1
esi_own[1:16] <- 0
esi_own[25:40] <- 0
esi_own[51:62] <- 0
esi_own[63:80] <- 1
t1 <- table(esi_offer_cps, esi_own, exclude=NULL)
print(t1)
esi_own
esi_offer_cps 0 1 <NA>
0 17 0 8
1 15 0 10
<NA> 12 18 20
In R I define a new variable as follows:
esi_offer <- ifelse(esi_offer_cps | (!is.na(esi_own) & esi_own == 1),1,0)
t2 <- table(esi_offer, exclude=NULL)
print(t2)
esi_offer
0 1 <NA>
25 43 32
The 43 records with esi_offer = 1 come from the sum of 15+10+18, and the 25 records with value 0 come from 17+8. This works in R because features such as NA | T = T and NA & F = F work when vectorized. In Julia I understand I can't have Boolean statements with .&& or .|| where missing values are present, and I am going nuts looking for some reasonable way to deal with cases like this in base Julia. performance is not my worry at the moment (one thing at the time!). I understand that if these are columns of data frames there might ways specific to dataframes, but I have not investigated those because I am trying to understand the basics. I have been exploring the use of ismissing, ternary operators and bitwise operators, but I think I am missing the big picture: there must be some principled way to look at operations of this type but I can't fathom it. If you can point to me to an article or a basic book with examples I would be grateful (or give me an example of how to deal with his particular case, hopefully I can extrapolate from that). I am still at the stage where reading the manual is not illuminating yet. Thank you in advance for your patience.
As August notes the && and || operators in Julia are strict, i.e. they require Bool as a first argument. Conversly & and | fully support 3-valued logic:
julia> [false, missing, true] .| [false missing true]
3×3 Matrix{Union{Missing, Bool}}:
false missing true
missing missing true
true true true
julia> [false, missing, true] .& [false missing true]
3×3 Matrix{Union{Missing, Bool}}:
false false false
false missing missing
false missing true
If you need some additional explanation of this please comment.

R subset exclusion based on string creates extra column

I have a data set as such below
salaries <- read.csv('salaries.csv', header=TRUE)
print(salaries)
Name Job Salary CompanyExperience IndustryExperience
John Engineer 50000 3 12
Adam Manager 55000 6 7
Alice Manager #N/A 6 6
Bob Engineer 65000 5 #N/A
Carl Engineer 70000 #N/A 10
I would like to plot some of this information, however I would need to exclude any data points with "#N/A" by removing any rows where there is an "#N/A" text string (produced by MS Excel spreadsheet exported to CSV) to make a plot of Salary ~ CompanyExperience.
My code to subset is as follows:
salaries <-salaries[salaries$CompanyExperience!="#N/A" &
salaries$Salary!="#N/A",]
#write.csv(salaries, "salaries2.csv")
#salaries <- read.csv('salaries2.csv', header=TRUE)
print(salaries)
Now this seems to work without any issue, producing:
Name Job Salary CompanyExperience IndustryExperience
1 John Engineer 50000 3 12
2 Adam Manager 55000 6 7
4 Bob Engineer 65000 5 #N/A
Which seems fine, however as soon as I try to put this data subset into a linear regression, I get an error:
> salarylinear <- lm(salaries$CompanyExperience ~ salaries$Salary)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Now if I've done some experimenting and have found that if I subset the data using things like "!=10000" or "<50", I dont get this error. Also, I've found that when I write this new subset into a CSV file and read it again (by removing the # tags in the code above, the data set will have added a mysterious "X" column at the front and wont have the error when trying to run a linear regression:
X Name Job Salary CompanyExperience IndustryExperience
1 1 John Engineer 50000 3 12
2 2 Adam Manager 55000 6 7
3 4 Bob Engineer 65000 5 #N/A
I've searched the web and cant find any reason why this is happening. Is there a way I can produce a useable subset by excluding "#N/A" strings without having to resort to writing the data to disk and reading into memory again?
Most likely what is happening is that columns of data that you think are numeric are not in fact numeric. Two things are leading to this:
read.csv() doesn't know that "#N/A" means "missing" and as a result, it is reading in "#N/A" as a string (not a number), causing it to think that the whole columns of Salary, CompanyExperience, and IndustryExperience are string variables.
read.csv() has a notorious default to read in strings as factors. If you're unfamiliar with factors, one good resource is this.
This combination of events is why lm() thinks your dependent variable is a factor and is throwing an error.
The solution is to add na.strings = "#N/A" as an argument to read.csv(). Then your data will be read in as numeric. You can proceed straight to running your regression because lm() will drop rows with NA's automatically.
However, to be a bit more explicit, you may also want to add stringsAsFactors = FALSE as an argument to read.csv() just in case you have any other things that mean "missing" but are coded as, say, a blank. And, if you want to handle the NAs manually before running your regression, you can drop rows with NAs using complete.cases() or something like salaries[!is.na(Salary),]
Follow-up to our discussion in the comments about what happens when you subset a data.frame with a matrix:
First, we create a 3x2 dataframe to work with:
df <- data.frame(x=1:3, y=4:6)
Then, let's create a vector of TRUE/FALSE for the rows we want to keep when we subset our dataframe.
v <- c(T,T,F)
Here, v has 2 TRUEs followed by 1 FALSE so if we subset our 3-row dataframe with v, we will be selecting the first 2 rows and omitting the 3rd row:
df[v,]
x y
1 1 4
2 2 5
Great, that works as expected. But what about if we subset with a matrix? We create matrix m that has the same 3x2 dimensions as our dataframe. m is full of TRUEs except for 2 FALSEs in cells (1,1) and (3,2).
m <- matrix(c(F,T,T,T,T,F), ncol=2)
m
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
[3,] TRUE FALSE
Now, if we try to subset our dataframe with m, we might at first think that we're gong to only get row 2 back, because m has a FALSE in its first and third row. That, of course, isn't what happens.
df[m,]
x y
2 2 5
3 3 6
NA NA NA
NA.1 NA NA
The trick to understanding this is to know that a matrix in R is just a vector with a dimension attribute. The dimension is as expected, because we created m:
dim(m)
[1] 3 2
But as a vector, what does m look like:
as.vector(m)
[1] FALSE TRUE TRUE TRUE TRUE FALSE
We see that m-as-a-vector is just the columns of m, repeated one after the other (because R "fills in" matrices column-wise). Let me re-write m with the original cells identified, in case my description isn't clear:
[1] FALSE TRUE TRUE TRUE TRUE FALSE
(1,1) (2,1) (3,1) (1,2) (2,2) (3,2)
So when we try to subset our dataframe with m, it's like using this length-6 vector, and this length-6 vector says to select rows 2:5. So when we write df[m, ] R faithfully selects rows 2 and 3, and then when it tries to select rows 4 and 5, they don't "exist" so R fills them in with NAs. This is why we get more rows in our subset than in our original dataframe.
Lastly, we saw that df[m, ] has funny rownames like NA.1. Rownames must be unique, so R calls the row 4 of the "subset" 'NA' and it calls row 5 of the subset 'NA.1'.
I hope this clears it up for you. Happy coding!

Check if two intervals overlap in R

Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})

How can I extract and merge corresponding row variables for designated column in data frame if the row has string variable in R?

So I have this data frame in R of which I'd like to bar plot the terms of one column via info <- table(df$ForPlot) . But first I need to merge corresponding row variables with that column IF the that row of the column I'd like to plot has a text (of which some rows have 2 terms some have 1 and others have none). So for example from this:
ID Name ForPlot
1 cool
2 nice ready soft
3 fast
4 slow party
5 good low
6 bad
7 true yo fit
8 false
I need a function or a practical way of accomplishing this:
ID Name ForPlot
1 cool
2 nice nice ready soft
3 fast
4 slow slow party
5 good good low
6 bad
7 true true yo fit
8 false
So ONLY if my "ForPlot" column has a string, the corresponding row from the "Name" column should be extracted an merged. Any ideas?
UPDATE So I thought I new how to plot the frequencies via info <- table(df$ForPlot) which I thought would have taken the frequencies of all the different texts in ForPlot, then run a bar plot of that. I was wrong. Instead it took the entire string of each row (multiple words) as a frequency count. Any ideas on how to make a bar plot from a column with multiple values?
You can do it with ifelse
df$ForPlot <- ifelse(df$ForPlot != "", paste(df$Name, df$ForPlot), " ")
> df
#Name ForPlot
#1 Cool
#2 nice nice ready soft
#3 fast
#4 slow slow party
#5 good good low
#6 bad
#7 true true yo fit
#8 false
EDIT : Updated the answer as per #Robert Dove's comment
Here is a way:
i <- df$ForPlot != ''
df$ForPlot[i] <- paste(df$Name[i], df$ForPlot[i])
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1'), using the logical condition (ForPlot!='') in 'i', we assign the 'ForPlot' by pasteing 'Name' and 'ForPlot' columns. This should be very fast as we are assigning in place.
library(data.table)
setDT(df1)[ForPlot!='', ForPlot:= paste(Name, ForPlot)]
df1
# ID Name ForPlot
#1: 1 cool
#2: 2 nice nice ready soft
#3: 3 fast
#4: 4 slow slow party
#5: 5 good good low
#6: 6 bad
#7: 7 true true yo fit
#8: 8 false
Update
If we need a bar plot of the word frequency after the transformation, we can split the 'ForPlot' column by space (strsplit), unlist the output list, use table to get the frequency and then plot with barplot.
barplot(table(unlist(strsplit(df1$ForPlot, ' '))))

Having trouble understanding how "Identical" works

I have a probably very stupid question regarding the identical() function.
I was writting a script to test if some values come several time in my data.frame to regroup them. I compare values 2 by 2 over 4 columns.
I identified some in my table, and wanted to test my script. Here is part of the data.frame:
Ret..Time Mass Ret..Time Mass deltaRT deltaMZ
178 3.5700 797.6324 3.4898 797.6018 0.0802 0.0306
179 3.6957 797.6519 3.7502 797.5798 0.0545 0.0721
180 3.3526 797.6655 3.2913 797.5980 0.0613 0.0675
182 3.1561 797.7123 3.1650 797.5620 0.0089 0.1503
182.1 3.1561 797.7123 3.0623 797.6174 0.0938 0.0949
183 3.4495 797.8207 3.3526 797.6655 0.0969 0.1552
So here the elements of column 1 and 2 on row "180" are equal to those in 3 and 4 on row "183".
Here is what I get and what confuses me:
all.equal(result["180",1:2],result["183",3:4])
[1] "Attributes: < Component “row.names”: 1 string mismatch >"
identical(result["180",1:2],result["183",3:4])
[1] FALSE
identical(result["180",1],result["183",3]) & identical(result["180",2],result["183",4])
[1] TRUE
I get that all.equal reacts to the different rownames (although I don't really understand why, I'm asking to compare the values in specifice columns, not whole rows).
But why does identical need to compare the values separately? It doesn't work any better if I use result[180,c(1,2)] and result[183,c(3,4)]. Does identical() start to use the rownames too if I compare more than 1 value? How to prevent that? In my case, I have only 2 values to compare to 2 other values, but what if the string to compare was spanned over 10 columns? Would I need to add & and identical() to compare each of the 10 columns individually?
Thanks in advance!
Keep in mind that not only the value but also all attributes must match for identical to return TRUE . Consider:
foo<-1
bar<-1
dim(foo)<-c(1,1)
identical(foo,bar)
[1] FALSE

Resources