Types and comparisons in R - r

I've been working with R for a month or so, and my comprehension of some subtleties is still quite superficial.
I have had an issue, which I managed to solve (details below), but I still can't explain precisely why it did not work with the first solution.
Note that the example below makes no practical sense for I have simplified it as much as possible so that the problem is quite clear.
ISSUE :
Given a data frame with 4 columns (email, first, last, company) :
> users <- data.frame(matrix(vector(), 0, 4, dimnames=list(c(), c("email", "first", "last", "company"))), stringsAsFactors=F)
> users[1,] <- c("robert#redford.com", "Robert", "Redford", "Paramount")
> users[2,] <- c("julia#roberts.com", "Erin", "B.", "Hinkley")
> users[3,] <- c("matt#damon.com", "Will", "H.", "Stanford")
> users[4,] <- c("john#malkovitch.com", "John", "M.", "JM")
I take one particular row :
> user <- users[3,]
When I try to subset the dataframe on a criteria which could have lead to return the previously mentioned row, it returns no result.
> users[users$email == user["email"],]
[1] email first last company
<0 lignes> (ou 'row.names' de longueur nulle)
I instantly thought it was a casting issue (sorry for this bad one)
> users[users$email == as.character(user["email"]),]
email first last company
3 matt#damon.com Will H. Stanford
However, when I tried to figure out where exactly the issue was, and tried this :
> users[users$email == "matt#damon.com",]
email first last company
3 matt#damon.com Will H. Stanford
> user["email"] == "matt#damon.com"
email
3 TRUE
> users[3,]$email == user$email
[1] TRUE
I got quite confused :
First, I thought about it as a math problem : if A == B and B == C, then A == C (according to Captain Obvious). So, just replacing a member A by another member B which is supposed to be equal to A (given the "TRUE" statement) in some expression should have no impact on the result of this expression.
3 TRUE != [1] TRUE. I think [1] TRUE is a logical vector of size 1 which first element is TRUE. 3 TRUE is (1x1) matrix row, which column "email" value is TRUE.
My problem is with consistency : either two objects of equal content but different types should be equal, or they should be different. I have a problem with "Sometimes there is type inference, and sometimes not". Is there a rule I can't see beyond this behavior ? (I guess there is one)
Another expression of the behavior I'd like to get is this one :
> unique(users$email) == "matt#damon.com"
[1] FALSE FALSE TRUE FALSE
> unique(users$email) == user["email"]
email
3 FALSE
Obviously R does get what I want (considering the fact that it gives me the matching row). But I can't explain (nor use) the result of the second statement.
Any explanations / thoughts?

in normal list situations
users$email == user[["email"]]
however in data.frames things get inconsistent/ a lot worse!
tdf=data.frame(matrix(1:100,10,10))
tdf[] # returns data.frame everything
tdf[1] # returns data.frame first column
tdf[1,1] # returns object as type of the object...
tdf[,1] # returns a vector of the first column
tdf[1,] # returns a data.frame of the first row # eeeeeugh... that is odd....
tdf[2:4] # returns a data.frame with 3 columns
tdf[1,2:4] # returns a data.frame of the first row of 3 colums
tdf[2:4,2:4] # returns a 3x3 data.frame
tdf[2:4,1] # returns a vector of 2:4 row and 1st column
tdf[,2:4] # returns a data.frame with 3 columns
then there is also the double [[]]
do note that in data.frames things get horribly annoying and fugly
tdf[[1]] # gives the first row as a vector
tdf[[1,1]] # gives first element
and pretty much all other combinations gives errors
and assigning stuff to a data.frame or matrix, is an even bigger mess!

Related

Why does R return integer(0) for under-indexing but NA for over-indexing a vector? [duplicate]

Say I have a vector, for example, x <- 1:10, then x[0] returns a zero-length vector of the same class as x, here integer(0).
I was wondering if there is a reason behind that choice, as opposed to throwing an error, or returning NA as x[11] would? Also, if you can think of a situation where having x[0] return integer(0) is useful, thank you for including it in your answer.
As seen in ?"["
NA and zero values are allowed: rows of an index matrix containing a
zero are ignored, whereas rows containing an NA produce an NA in the
result.
So an index of 0 just gets ignored. We can see this in the following
x <- 1:10
x[c(1, 3, 0, 5, 0)]
#[1] 1 3 5
So if the only index we give it is 0 then the appropriate response is to return an empty vector.
My crack at it as I am not a programmer and certainly do not contribute to R source. I think it may be because you need some sort of place holder to state that something occurred here but nothing was returned. This becomes more apparent with things like tables and split. For instance when you make a table of values and say there are zero of that cell you need to hold that that cell made from a string in a vector has no values. it would not be a appropriate to have x[0]==0 as it's not the numeric value of zero but the absence of any value.
So in the following splits we need a place holder and integer(0) holds the place of no values returned which is not the same as 0. Notice for the second one it returns numeric(0) which is still a place holder stating it was numeric place holder.
with(mtcars, split(as.integer(gear), list(cyl, am, carb)))
with(mtcars, split(gear, list(cyl, am, carb)))
So in a way my x[FALSE] retort is true in that it holds the place of the non existent zero spot in the vector.
All right this balonga I just spewed is true until someone disputes it and tears it down.
PS page 19 of this guide (LINK) state that integer() and integer(0) are empty integer.
Related SO post: How to catch integer(0)?
Since the array indices are 1-based, index 0 has no meaning. The value is ignored as a vector index.

Add a specified number of blank rows to a data table without overwriting the heading

Im trying to make a large blank data.table with a header row in order to add values in specific places once it is set up. I have been able to duplicate the first row and then clear every other row or every row, but what I'd like to do is clear every row after the header row. Some columns are numeric input and some are character input.
[input3]:
headers: header1 header2 header3..... header 60+
Values: NA NA NA ... NA
Duplicate row:
input3 <- input2[rep(1:nrow(input2), each = 2), ]
Clear every row:
input3[1:nrow(input3) %% 1 == 0, ] <- NA
But if I try to rewrite that as duplicating blank rows starting at row 2 (to preserve the header) I get this error:
input3[2:nrow(input3) %% 1 == 0, ] <- NA
"Error in [.data.table(x, i, which = TRUE) : i evaluates to a logical vector length 9 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle."
I need to be able to dynamically add rows while keeping the header as this is going to be a gigantic table I will export to another program.
Edit: this is different from this link in that I'm adding additional rows not specified originally in the data. Not just wiping rows.
Instead use
input3[c(FALSE,2:nrow(input3) %% 1 == 0,] <- NA
By using 2:nrow, you were explicitly giving a shortened vector. When that thing is a logical vector, it must be length 1 or the same as the number of rows. Period.
Though this has its problems and I discourage its use, perhaps you were expecting it to behave like this:
input3[which(2:nrow(input3) %% 1 == 0),] <- NA
The "good" of this is that the which(...) returns a vector of integer, so it does not need to be the same length as the number of rows in the frame/table.
From ?Extract (which includes [ and friends):
For '['-indexing only: 'i', 'j', '...' can be logical
vectors, indicating elements/slices to select. Such vectors
are recycled if necessary to match the corresponding extent.
'i', 'j', '...' can also be negative integers, indicating
elements/slices to leave out of the selection.
"Recycling" is why length 1 works: its logical value is used for all rows. If you use length 2 and there are an even number of rows (e.g., mtcars[c(T,F),]), then it will give every-other-row. On a similar vein, if you assume recycling and there are not an even multiple of rows (e.g., mtcars[c(T,F,F),]), then your assumptions start becoming less clear.
Add to that the behavior of data.table where it does not enforcing of this. Recycling can get you in trouble, so data.table doesn't encourage it.
library(data.table)
mt <- as.data.table(mtcars)
mt[c(T,F),] <- NA
# Error in `[.data.table`(x, i, which = TRUE) :
# i evaluates to a logical vector length 2 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
mt[c(1,3),] <- NA

$value in unidimensional integrals in R [duplicate]

I have transitioned from STATA to R, and I was experimenting with different data types so that R's data structures are clear in my mind.
Here's how I set up my data structure:
b<-list(u=5,v=12)
c<-list(u=7)
j<-list(name="Joe",salary=55000,union=T)
bcj<-list(b,c,j)
Now, I was trying to figure out different ways to access u=5. I believe there are three ways:
Try1:
bcj[[1]][[1]]
I got 5. Correct!
Try2:
bcj[[1]][["u"]]
I got 5. Correct!
Try3:
bcj[[1]]$u
I got 5. Correct!
Try4
bcj[[1]][1][1]
Here's what I got:
bcj[[1]][1][1]
$u
[1] 5
class(bcj[[1]][1][1])
[1] "list"
Question 1: Why did this happen?
Also, I experimented with the following:
bcj[[1]][1][1][1][1][1]
$u
[1] 5
class(bcj[[1]][1][1][1][1][1])
[1] "list"
Question 2: I would have expected an error because I don't think so many lists exist in bcj, but R gave me a list. Why did this happen?
PS: I did look at this thread on SO, but it's talking about a different issue.
I think this is sufficient to answer your question. Consider a length-1 list:
x <- list(u = 5)
#$u
#[1] 5
length(x)
#[1] 1
x[1]
x[1][1]
x[1][1][1]
...
always gives you the same:
#$u
#[1] 5
In other words, x[1] will be identical to x, and you fall into infinite recursion. No matter how many [1] you write, you just get x itself.
If I create t1<-list(u=5,v=7), and then do t1[2][1][1][1]...this works as well. However, t1[[2]][2] gives NA
That is the difference between [[ and [ when indexing a list. Using [ will always end up with a list, while [[ will take out the content. Compare:
z1 <- t1[2]
## this is a length-1 list
#$v
#[1] 7
class(z1)
# "list"
z2 <- t1[[2]]
## this takes out the content; in this case, a vector
#[1] 7
class(z2)
#[1] "numeric"
When you do z1[1][1]..., as discussed above, you always end up with z1 itself. While if you do z2[2], you surely get an NA, because z2 has only one element, and you are asking for the 2nd element.
Perhaps this post and my answer there is useful for you: Extract nested list elements using bracketed numbers and names?

How do I count the number of pattern occurrences, if the pattern includes NA, in R?

I have a string of 0's, 1's and NA's like so:
string<-c(0,1,1,0,1,1,NA,1,1,0,1,1,NA,1,0,
0,1,0,1,1,1,NA,1,0,1,NA,1,NA,1,0,1,0,NA,1)
I'd like to count the number of times the PATTERN "1-NA-1" occurs. In this instance, I would like get the count 5.
I've tried table(string), and trying to replicate this but nothing seems to work. I would appreciate anyone's help!
# some ugly code, but it seems to work
sum( head(string, -2) == 1 & is.na(head(string[-1],-1))
& string[-1:-2] == 1, na.rm = TRUE)
Something like:
x <- which(is.na(string))
x <- x[!x %in% c(1,length(string))]
length(x[string[x-1] & string[x+1]])
# [1] 5
-- REASONING --
First, we check which values of string are NA with is.na(string). Then we find those indices with which and store them in x.
As #Rick mentions, if the first/last value is NA it would lead to problems in our next step. So, we make sure that those are removed (as it shouldn't count anyway).
Next, we want to find the situation where both string[x-1] and string[x+1] are 1. In other words, 1 & 1. Note that FALSE and TRUE can be evaluated as 0 and 1 respectively. So, if you type 1 == TRUE you will get TRUE. If you type 1 & 1 you will also get TRUE back. So, string[x-1] & string[x+1] will return TRUE when both are 1, and FALSE otherwise. We basically obtain a logical vector, and subset x with that vector to get all positions in x that satisfy our search. Then we use length to determine how many there are.

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources