Having trouble understanding how "Identical" works - r

I have a probably very stupid question regarding the identical() function.
I was writting a script to test if some values come several time in my data.frame to regroup them. I compare values 2 by 2 over 4 columns.
I identified some in my table, and wanted to test my script. Here is part of the data.frame:
Ret..Time Mass Ret..Time Mass deltaRT deltaMZ
178 3.5700 797.6324 3.4898 797.6018 0.0802 0.0306
179 3.6957 797.6519 3.7502 797.5798 0.0545 0.0721
180 3.3526 797.6655 3.2913 797.5980 0.0613 0.0675
182 3.1561 797.7123 3.1650 797.5620 0.0089 0.1503
182.1 3.1561 797.7123 3.0623 797.6174 0.0938 0.0949
183 3.4495 797.8207 3.3526 797.6655 0.0969 0.1552
So here the elements of column 1 and 2 on row "180" are equal to those in 3 and 4 on row "183".
Here is what I get and what confuses me:
all.equal(result["180",1:2],result["183",3:4])
[1] "Attributes: < Component “row.names”: 1 string mismatch >"
identical(result["180",1:2],result["183",3:4])
[1] FALSE
identical(result["180",1],result["183",3]) & identical(result["180",2],result["183",4])
[1] TRUE
I get that all.equal reacts to the different rownames (although I don't really understand why, I'm asking to compare the values in specifice columns, not whole rows).
But why does identical need to compare the values separately? It doesn't work any better if I use result[180,c(1,2)] and result[183,c(3,4)]. Does identical() start to use the rownames too if I compare more than 1 value? How to prevent that? In my case, I have only 2 values to compare to 2 other values, but what if the string to compare was spanned over 10 columns? Would I need to add & and identical() to compare each of the 10 columns individually?
Thanks in advance!

Keep in mind that not only the value but also all attributes must match for identical to return TRUE . Consider:
foo<-1
bar<-1
dim(foo)<-c(1,1)
identical(foo,bar)
[1] FALSE

Related

R. Display only TRUE values from a boolean expression?

Data science student here. New to R, in my first course. I've spent way too much time trying to figure out this exercise, so I figured I would ask someone on here.
I have created a dataframe built from 4 matrices, titled bee_numbers_data_2:
buff_tail garden_bee red_tail honeybee carder_bee
10 8 18 12 8
1 3 9 13 27
37 19 1 16 6
5 6 2 9 32
12 4 4 10 23
The exercise asks us to only show honeybee numbers >= 10.
So I've created a boolean expression to display the TRUE FALSE statements:
bee_numbers_data_2$honeybee>=10
Which returns:
[1] TRUE TRUE TRUE FALSE TRUE
However, I want to display a list of the VALUES of the true statements, not a list of TRUE FALSE statements.
I've been pouring over my textbook and the internet trying to figure out this simple problem, so any help would be greatly appreciated. Thanks so much.
Although this is a fairly simple question, covered in most introductory texts on R, I could not find a duplicate on SO, so it seems worth answering here.
Let's break it down. As you already showed, we can use boolean expressions to generate a vector of boolean values:
bee_numbers_data_2 = data.frame(honeybee=c(12,13,16,9,10))
bee_numbers_data_2$honeybee >= 10
# [1] TRUE TRUE TRUE FALSE TRUE
If we want to know which of those are true, we can use the base R function which:
which(bee_numbers_data_2$honeybee >= 10)
# [1] 1 2 3 5
If we want to know the original values corresponding to those position indices, we can use those indices to subset the original data, using [
bee_numbers_data_2$honeybee[which(bee_numbers_data_2$honeybee >= 10)]
# [1] 12 13 16 10
Or, equivalently and a little more simple, we can subset using the boolean vales directly:
bee_numbers_data_2$honeybee[bee_numbers_data_2$honeybee >= 10]
Note that as you learn more R, you will find that there are also some more advanced ways to filter and subset data, such as the packages data.table and dplyr. However, it is best to understand how to use base R first, as shown above.

R subset exclusion based on string creates extra column

I have a data set as such below
salaries <- read.csv('salaries.csv', header=TRUE)
print(salaries)
Name Job Salary CompanyExperience IndustryExperience
John Engineer 50000 3 12
Adam Manager 55000 6 7
Alice Manager #N/A 6 6
Bob Engineer 65000 5 #N/A
Carl Engineer 70000 #N/A 10
I would like to plot some of this information, however I would need to exclude any data points with "#N/A" by removing any rows where there is an "#N/A" text string (produced by MS Excel spreadsheet exported to CSV) to make a plot of Salary ~ CompanyExperience.
My code to subset is as follows:
salaries <-salaries[salaries$CompanyExperience!="#N/A" &
salaries$Salary!="#N/A",]
#write.csv(salaries, "salaries2.csv")
#salaries <- read.csv('salaries2.csv', header=TRUE)
print(salaries)
Now this seems to work without any issue, producing:
Name Job Salary CompanyExperience IndustryExperience
1 John Engineer 50000 3 12
2 Adam Manager 55000 6 7
4 Bob Engineer 65000 5 #N/A
Which seems fine, however as soon as I try to put this data subset into a linear regression, I get an error:
> salarylinear <- lm(salaries$CompanyExperience ~ salaries$Salary)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Now if I've done some experimenting and have found that if I subset the data using things like "!=10000" or "<50", I dont get this error. Also, I've found that when I write this new subset into a CSV file and read it again (by removing the # tags in the code above, the data set will have added a mysterious "X" column at the front and wont have the error when trying to run a linear regression:
X Name Job Salary CompanyExperience IndustryExperience
1 1 John Engineer 50000 3 12
2 2 Adam Manager 55000 6 7
3 4 Bob Engineer 65000 5 #N/A
I've searched the web and cant find any reason why this is happening. Is there a way I can produce a useable subset by excluding "#N/A" strings without having to resort to writing the data to disk and reading into memory again?
Most likely what is happening is that columns of data that you think are numeric are not in fact numeric. Two things are leading to this:
read.csv() doesn't know that "#N/A" means "missing" and as a result, it is reading in "#N/A" as a string (not a number), causing it to think that the whole columns of Salary, CompanyExperience, and IndustryExperience are string variables.
read.csv() has a notorious default to read in strings as factors. If you're unfamiliar with factors, one good resource is this.
This combination of events is why lm() thinks your dependent variable is a factor and is throwing an error.
The solution is to add na.strings = "#N/A" as an argument to read.csv(). Then your data will be read in as numeric. You can proceed straight to running your regression because lm() will drop rows with NA's automatically.
However, to be a bit more explicit, you may also want to add stringsAsFactors = FALSE as an argument to read.csv() just in case you have any other things that mean "missing" but are coded as, say, a blank. And, if you want to handle the NAs manually before running your regression, you can drop rows with NAs using complete.cases() or something like salaries[!is.na(Salary),]
Follow-up to our discussion in the comments about what happens when you subset a data.frame with a matrix:
First, we create a 3x2 dataframe to work with:
df <- data.frame(x=1:3, y=4:6)
Then, let's create a vector of TRUE/FALSE for the rows we want to keep when we subset our dataframe.
v <- c(T,T,F)
Here, v has 2 TRUEs followed by 1 FALSE so if we subset our 3-row dataframe with v, we will be selecting the first 2 rows and omitting the 3rd row:
df[v,]
x y
1 1 4
2 2 5
Great, that works as expected. But what about if we subset with a matrix? We create matrix m that has the same 3x2 dimensions as our dataframe. m is full of TRUEs except for 2 FALSEs in cells (1,1) and (3,2).
m <- matrix(c(F,T,T,T,T,F), ncol=2)
m
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
[3,] TRUE FALSE
Now, if we try to subset our dataframe with m, we might at first think that we're gong to only get row 2 back, because m has a FALSE in its first and third row. That, of course, isn't what happens.
df[m,]
x y
2 2 5
3 3 6
NA NA NA
NA.1 NA NA
The trick to understanding this is to know that a matrix in R is just a vector with a dimension attribute. The dimension is as expected, because we created m:
dim(m)
[1] 3 2
But as a vector, what does m look like:
as.vector(m)
[1] FALSE TRUE TRUE TRUE TRUE FALSE
We see that m-as-a-vector is just the columns of m, repeated one after the other (because R "fills in" matrices column-wise). Let me re-write m with the original cells identified, in case my description isn't clear:
[1] FALSE TRUE TRUE TRUE TRUE FALSE
(1,1) (2,1) (3,1) (1,2) (2,2) (3,2)
So when we try to subset our dataframe with m, it's like using this length-6 vector, and this length-6 vector says to select rows 2:5. So when we write df[m, ] R faithfully selects rows 2 and 3, and then when it tries to select rows 4 and 5, they don't "exist" so R fills them in with NAs. This is why we get more rows in our subset than in our original dataframe.
Lastly, we saw that df[m, ] has funny rownames like NA.1. Rownames must be unique, so R calls the row 4 of the "subset" 'NA' and it calls row 5 of the subset 'NA.1'.
I hope this clears it up for you. Happy coding!

Check if two intervals overlap in R

Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})

Is there a simple way to rank on multiple criteria that preserves ties in R?

When a single criterion is well ordered, the rank function returns the obvious thing:
rank(c(2,4,1,3,5))
[1] 2 4 1 3 5
When a single criterion has ties, the rank function (by default) assigns average ranks to the ties:
rank(c(2,4,1,1,5))
[1] 3.0 4.0 1.5 1.5 5.0
The rank function doesn't let you sort on multiple criteria, so you have to use something else. One way to do it is by using match and order. For a single criterion without ties the results are the same:
rank(c(2,4,1,3,5))
[1] 2 4 1 3 5
match(1:5, order(c(2,4,1,3,5)))
[1] 2 4 1 3 5
For a single criterion with ties, however, the results differ:
rank(c(2,4,1,4,5))
[1] 2.0 3.5 1.0 3.5 5.0
match(1:5, order(c(2,4,1,4,5)))
[1] 2 3 1 4 5
The ties are broken in such a way that the tied elements have their original order preserved rather than being assigned equal ranks. This feature generalizes, obviously, when you sort on multiple criteria:
match(1:5, order(c(2,4,1,4,5),c(10,11,12,11,13)))
[1] 2 3 1 4 5
Finally, the question: Is there a simple, or built-in, way of computing rank using multiple criteria that preserves ties? I've written a function to do it, but it's ugly and seems ridiculously complicated for such a basic functionality...
interaction does what you need:
> rank(interaction(c(2,4,1,4,5),c(10,11,12,11,13), lex.order=TRUE))
[1] 2.0 3.5 1.0 3.5 5.0
Here is what is happening.
interaction expects factors, so the vectors are coerced. Doing so produces the order in the factor levels as indicated by sort.list, which for numeric is numerically nondecreasing order.
Then to combine the two factors, the interaction creates factor levels by varying the second argument fastest (because lex.order=TRUE). Thus ties in the first vector are resolved by the value in the second vector (if possible).
Finally, rank coerces the resulting factor to numeric.
What is actually ranked:
> as.numeric(interaction(c(2,4,1,4,5),c(10,11,12,11,13), lex.order=TRUE))
[1] 5 10 3 10 16
You will save some memory if you supply the option drop=TRUE to interaction. This will change the ranked numeric values, but not their order, so the final result is the same.

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources