Why does R need the name of the dataframe? - r

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?

You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.

Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Related

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Counting all the matchings of a pattern in a vector in R

I have a boolean vector in which I want to count the number of occurrences of some patterns.
For example, for the pattern "(1,1)" and the vector "(1,1,1,0,1,1,1)", the answer should be 4.
The only built-in function I found to help is grepRaw, which finds the occurrences of a particular string in a longer string. However, it seems to fail when the sub-strings matching the pattern overlap:
length(grepRaw("11","1110111",all=TRUE))
# [1] 2
Do you have any ideas to obtain the right answer in this case?
Edit 1
I'm afraid that Rich's answer works for the particular example I posted, but fails in a more general setting:
> sum(duplicated(rbind(c(FALSE,FALSE),embed(c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE,TRUE),2))))
[1] 3
In this other example, the expected answer would be 0.
Using the function rollapply you can apply a moving window of width = 2 summing the values. Then you can sum the records where the result is equal to 2 i.e. sum(c(1,1))
library(zoo)
z <- c(1,1,1,0,1,1,1)
sum(rollapply(z, 2, sum) == 2)

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

How can I assign a value using if-else conditions in R

I have this dataframe with a column a. I would like to add a different column 'b' based on column 'a'.
For: if a>=10, b='double'. Otherwise b='single'.
How can I do it?
Sample output:
a b
2 single
2 single
4 single
11 double
12 double
12 double
45 double
4 single
You can use ifelse to act on vectors with if statements.
ifelse(a>=10, "double", "single")
So your code could look like this
mydata <- cbind(a, ifelse(a>10, "double", "single"))
(Specified in comments below that if a=10, then "double")
Strictly speaking, if-else is assignable in r, that is
x1 <- if (TRUE) 1 else 2
is legit. For details see https://adv-r.hadley.nz/control-flow.html#choices
However, as this vectorizes over neither the test condition nor the value branches, it's not applicable to the particular case described in the question details, which is about adding a column in a conditional manner. In such a situation ifelse or the more typesafe if_else (from dplyr) can be used.

help me understand partial matching in data.frame column names [duplicate]

I've encountered a strange behavior when dropping columns from data.frame. Initially I have:
> a <- data.frame("a" = c(1,2,3), "abc" = c(3,2,1)); print(a)
a abc
1 1 3
2 2 2
3 3 1
Now, I remove a$a from the data.frame
> a$a <- NULL; print(a)
abc
1 3
2 2
3 1
As expected, I have only abc column in my data.frame. But the strange part begins, when I try to reference deleted column a.
> print(a$a)
[1] 3 2 1
> print(is.null(a$a))
[1] FALSE
It looks like R returns value of the a$abc instead of NULL.
This happens when the beginning of the name of remaining column exactly matches the name of deleted column.
Is it a bug or do I miss something here?
From the the help. ?$
name: A literal character string or a
name (possibly backtick quoted). For
extraction, this is normally (see
under ‘Environments’) partially
matched to the names of the object.
So that's the normal behaviour because the name is partially matched. See ?pmatch for more info about partial matching.
Cheers
Perhaps it's worth pointing out (since it didn't come up on the previous related question) that this partial matching behavior is potentially a reason to avoid using '$' except as a convenient shorthand when using R interactively (at least, it's a reason to be careful using it).
Selecting a column via dat[,'ind'] if you know the name of the column, but not the position, or via dat[,3] if you know the position, is often safer since you won't run afoul of the partial matching.
While your exact question has already been answered in the comments, an alternative to avoid this behaviour is to convert your data.frame to a tibble, which is a stripped downed version of a data.frame, without column name munging, among other things:
library(tibble)
df_t <- as_data_frame(a)
df_t
# A tibble: 3 × 1
abc
<dbl>
1 3
2 2
3 1
> df_t$a
NULL
Warning message:
Unknown column 'a'
From the R Language Definition [section 3.4.1 pg.16-17] --
https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
• Character: The strings in i are matched against the names attribute of x and the resulting integers are used. For [[ and $ partial matching is used if exact matching fails, so x$aa will match x$aabb if x does not contain a component named "aa" and "aabb" is the only name which has prefix "aa". For [[, partial matching can be controlled via the exact argument which defaults to NA indicating that partial matching is allowed, but should result in a
warning when it occurs. Setting exact to TRUE prevents partial matching from occurring, a FALSE value allows it and does not issue any warnings. Note that [ always requires an exactmatch. The string "" is treated specially: it indicates ‘no name’ and matches no element (not even those without a name). Note that partial matching is only used when extracting
and not when replacing.

Resources