help me understand partial matching in data.frame column names [duplicate] - r

I've encountered a strange behavior when dropping columns from data.frame. Initially I have:
> a <- data.frame("a" = c(1,2,3), "abc" = c(3,2,1)); print(a)
a abc
1 1 3
2 2 2
3 3 1
Now, I remove a$a from the data.frame
> a$a <- NULL; print(a)
abc
1 3
2 2
3 1
As expected, I have only abc column in my data.frame. But the strange part begins, when I try to reference deleted column a.
> print(a$a)
[1] 3 2 1
> print(is.null(a$a))
[1] FALSE
It looks like R returns value of the a$abc instead of NULL.
This happens when the beginning of the name of remaining column exactly matches the name of deleted column.
Is it a bug or do I miss something here?

From the the help. ?$
name: A literal character string or a
name (possibly backtick quoted). For
extraction, this is normally (see
under ‘Environments’) partially
matched to the names of the object.
So that's the normal behaviour because the name is partially matched. See ?pmatch for more info about partial matching.
Cheers

Perhaps it's worth pointing out (since it didn't come up on the previous related question) that this partial matching behavior is potentially a reason to avoid using '$' except as a convenient shorthand when using R interactively (at least, it's a reason to be careful using it).
Selecting a column via dat[,'ind'] if you know the name of the column, but not the position, or via dat[,3] if you know the position, is often safer since you won't run afoul of the partial matching.

While your exact question has already been answered in the comments, an alternative to avoid this behaviour is to convert your data.frame to a tibble, which is a stripped downed version of a data.frame, without column name munging, among other things:
library(tibble)
df_t <- as_data_frame(a)
df_t
# A tibble: 3 × 1
abc
<dbl>
1 3
2 2
3 1
> df_t$a
NULL
Warning message:
Unknown column 'a'

From the R Language Definition [section 3.4.1 pg.16-17] --
https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
• Character: The strings in i are matched against the names attribute of x and the resulting integers are used. For [[ and $ partial matching is used if exact matching fails, so x$aa will match x$aabb if x does not contain a component named "aa" and "aabb" is the only name which has prefix "aa". For [[, partial matching can be controlled via the exact argument which defaults to NA indicating that partial matching is allowed, but should result in a
warning when it occurs. Setting exact to TRUE prevents partial matching from occurring, a FALSE value allows it and does not issue any warnings. Note that [ always requires an exactmatch. The string "" is treated specially: it indicates ‘no name’ and matches no element (not even those without a name). Note that partial matching is only used when extracting
and not when replacing.

Related

Identifying if one field is partial derived from another in R

So given a pattern, say two letters, and the position of the pattern, is it a prefix, suffix or in the middle, I need to identify if a field is partial derived from another. So for example given the following dataset
data.V1 data.V2
1 GH GH1001
2 FD FD2002
3 TH 2345TH
4 ED ED56763
5 US 4345US
6 FG F6736tG
if LL is the pattern for column one where LL refers to two letters in this case. If the pattern for column 2 is LL#. This indicates that the position of the pattern is the first elements of each element in row 2. So in the dataset above rows 1,2&4 would obey the pattern.
I have tried if then statements but these did not work if the pattern was in the middle , #LL#. I have also tried the function regmathes but that did not work either.
apply(df,1,function(x) grepl(paste0("^",x["data.V1"]),x["data.V2"]))
For every row (that's the 1 in apply), this will check whether the contents of data.V1 appears right at the beginning of data.V2 (^ means the beginning of a string for regular expressions).
Result:
1 2 3 4 5 6
TRUE TRUE FALSE TRUE FALSE FALSE
Replace the grepl first argument with the following for:
End of string: paste0(x["data.V1"],"$")
Middle of string: paste0(".+",x["data.V1"],".+")
After n characters (n defined elsewhere): paste0(".{",n,"}",x["data.V1"])
(for the last one, the form "{n}" means the last character is repeated n times. Since it is preceded by ".", it means any n characters.)

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Using cbind on XTS object changes the dash (-) character in previous column names to a dot (.)

I have some R code that creates an XTS object, and then performs various cbind operations in the lifetime of that object. Some of my columns have names such as "adx-1". That is fine until another cbind() operation is performed. At that point, any columns with the "-" character are changes to a ".". So "adx-1" becomes "adx.1".
To reproduce:
x = xts(order.by=as.Date(c("2014-01-01","2014-01-02")))
x = cbind(x,c(1,2))
x
..2
2014-01-01 1
2014-01-02 2
colnames(x) = c("adx-1")
x
adx-1
2014-01-01 1
2014-01-02 2
x = cbind(x,c(1,2))
x
adx.1 ..2
2014-01-01 1 1
2014-01-02 2 2
It doesn't just do this with numbers either. It changes "test-text" to "test.text" as well. Multiple dashes are changed too. "test-text-two" is changed to "test.text.two".
Can someone please explain why this happens and, if possible, how to stop it from happening?
I can of course change my naming schemes, but it would be preferred if I didn't have to.
Thanks!
merge.xts converts the column names into syntactic names, which cannot contain -. According to ?Quotes:
Identifiers consist of a sequence of letters, digits, the period
('.') and the underscore. They must not start with a digit nor
underscore, nor with a period followed by a digit.
There is currently no way to alter this behavior.
The reason for the behavior is precisely the one Joshua Ulrich highlighted. It's common across many data types in R: you need "valid" names. Here is a great discussion of this "issue".
For data frames, you can pass the option check.names = FALSE as a workaround, but this is not implemented for xts object. This said, there are plenty of other workarounds available to you.
For instance, you could simply rename the columns of interest after very cbind. Using your code, simply add:
colnames(x)[1] <- c("adx-1")
to force back your desired column name.
Alternatively, you could consider this gsub solution if you wanted something potentially more systematic.

Types and comparisons in R

I've been working with R for a month or so, and my comprehension of some subtleties is still quite superficial.
I have had an issue, which I managed to solve (details below), but I still can't explain precisely why it did not work with the first solution.
Note that the example below makes no practical sense for I have simplified it as much as possible so that the problem is quite clear.
ISSUE :
Given a data frame with 4 columns (email, first, last, company) :
> users <- data.frame(matrix(vector(), 0, 4, dimnames=list(c(), c("email", "first", "last", "company"))), stringsAsFactors=F)
> users[1,] <- c("robert#redford.com", "Robert", "Redford", "Paramount")
> users[2,] <- c("julia#roberts.com", "Erin", "B.", "Hinkley")
> users[3,] <- c("matt#damon.com", "Will", "H.", "Stanford")
> users[4,] <- c("john#malkovitch.com", "John", "M.", "JM")
I take one particular row :
> user <- users[3,]
When I try to subset the dataframe on a criteria which could have lead to return the previously mentioned row, it returns no result.
> users[users$email == user["email"],]
[1] email first last company
<0 lignes> (ou 'row.names' de longueur nulle)
I instantly thought it was a casting issue (sorry for this bad one)
> users[users$email == as.character(user["email"]),]
email first last company
3 matt#damon.com Will H. Stanford
However, when I tried to figure out where exactly the issue was, and tried this :
> users[users$email == "matt#damon.com",]
email first last company
3 matt#damon.com Will H. Stanford
> user["email"] == "matt#damon.com"
email
3 TRUE
> users[3,]$email == user$email
[1] TRUE
I got quite confused :
First, I thought about it as a math problem : if A == B and B == C, then A == C (according to Captain Obvious). So, just replacing a member A by another member B which is supposed to be equal to A (given the "TRUE" statement) in some expression should have no impact on the result of this expression.
3 TRUE != [1] TRUE. I think [1] TRUE is a logical vector of size 1 which first element is TRUE. 3 TRUE is (1x1) matrix row, which column "email" value is TRUE.
My problem is with consistency : either two objects of equal content but different types should be equal, or they should be different. I have a problem with "Sometimes there is type inference, and sometimes not". Is there a rule I can't see beyond this behavior ? (I guess there is one)
Another expression of the behavior I'd like to get is this one :
> unique(users$email) == "matt#damon.com"
[1] FALSE FALSE TRUE FALSE
> unique(users$email) == user["email"]
email
3 FALSE
Obviously R does get what I want (considering the fact that it gives me the matching row). But I can't explain (nor use) the result of the second statement.
Any explanations / thoughts?
in normal list situations
users$email == user[["email"]]
however in data.frames things get inconsistent/ a lot worse!
tdf=data.frame(matrix(1:100,10,10))
tdf[] # returns data.frame everything
tdf[1] # returns data.frame first column
tdf[1,1] # returns object as type of the object...
tdf[,1] # returns a vector of the first column
tdf[1,] # returns a data.frame of the first row # eeeeeugh... that is odd....
tdf[2:4] # returns a data.frame with 3 columns
tdf[1,2:4] # returns a data.frame of the first row of 3 colums
tdf[2:4,2:4] # returns a 3x3 data.frame
tdf[2:4,1] # returns a vector of 2:4 row and 1st column
tdf[,2:4] # returns a data.frame with 3 columns
then there is also the double [[]]
do note that in data.frames things get horribly annoying and fugly
tdf[[1]] # gives the first row as a vector
tdf[[1,1]] # gives first element
and pretty much all other combinations gives errors
and assigning stuff to a data.frame or matrix, is an even bigger mess!

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources