Does R have a distinction between character vectors and strings? - r

As far as I know, what most languages call a string, R calls a character vector. For example, "Alice" is not a string, it's a character vector of length 1. Similarly, c("Alice", "Bob") is a character vector of length 2. I cannot recall my IDE or any of my work with R's type system telling me that R has any internal concept of "strings".
Despite this, R's documentation frequently uses the word "string":
?paste and ?nchar frequently talk of "character strings".
Many "See Also" sections mention strings without any qualifier, e.g. ?paste, ?chartr, and ?agrep.
?strsplit mentions "substrings".
?agrep, ?toString, and ?adist talk about strings both in their titles and "Description" sections.
strsplit, strwidth, and toString have string or a shorthand for it in their names.
So does R actually have a concept of strings, or does it always mean exactly the same thing as "character vector"?

Converting my comment to an answer.
A description of character and string can be found in the R Language Definition:
R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character) and raw. The modes and storage modes for the different vector types are listed in the following table.
typeof
mode
storage.mode
logical
logical
logical
integer
numeric
integer
double
numeric
double
complex
complex
complex
character
character
character
raw
raw
raw
[...]
String vectors have mode and storage mode "character". A single element of a character vector is often referred to as a character string.

Related

What happens when you combine multiple datatypes in an atomic vector in R programming language?

I tried running a code to identify the type of the vector produced while combining different data types. Here is the code and what I got as the output. Can somebody explain why this output is seen?
v<-c(1L,2,TRUE)
typeof(v)
Output: [1] "double"
Seems like this is the rule:
When you attempt to combine different types they will be coerced in a fixed order: character → double → integer → logical. For example, combining a character and an integer yields a character.
An atomic vector can only hold values of a single data type. If you put several different types in it, these get coerced to a common type. In your case double.
IF you want to keep the data type of the original values, you need to use a list. Lists do not have this restriction.

In R what makes NULL atomical and therefore unable to exist in a vector?

In R for Everyone by Jared P. Lander on p. 54 it says "...NULL is atomical and cannot exist within a vector. If used inside a vector, it simply disappears."
I understand the concept of being atomic is being indivisible and that NULL represents "nothingness", used commonly to handle returns that are undefined.
Therefore, is NULL atomical b/c it has this one value always of "nothingness", meaning something simply does not exist and therefore R's way of handling that is to just not let it exist in a vector or on assignment in a list it will actually remove that element?
Trying to wrap my head around it and find a more intuitive and comprehensive answer.
In my opinion talking about vectors as being "atomic" is more confusing than helpful. Instead, consider that R has a series of data types built into the language. They are given by definition and are distinct from one another.
For example, one such data type is "integer vector", which represents a sequence of integer values. Note that R does not have a data type of "integer". If we are talking about integer 5 in R, it is actually an integer vector of length 1.
Another built-in data type is NULL. There is a single object of type NULL, which is also called NULL. Since NULL is a type and an object, but not an integer value, it cannot be part of an integer vector.
Missing data in an integer vector are represented by NA. In this context NA is considered an integer value. Note that NA can also be a numeric value, logical value, etc. NA is a not a data type, but a value.
A complete list of built-in data types can be found in the R source code and also in the documentation, e.g. https://cran.r-project.org/doc/manuals/r-release/R-ints.html#SEXPTYPEs

What is the underlying logic when comparing strings?

What is the logic used by R to end up with the output FALSE in the below logical operation on characters. Is it just comparing letter S with letter T instead of the entire string.
"Sachin" > "Tendulkar"
Output: FALSE
This is in the documentation. ?">" gives:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use
In other words, this is just a regular dictionary-style comparison. Things can get very complicated/weird depending on locales (e.g. how non-alphabetic, accented, upper/vs lower case, etc. etc. characters are handled), but this case looks straightforward. "S" comes before "T" in every locale I can imagine, so "S"<"T"; in a lexicographic sort, this will determine the order (otherwise ties would be broken by later letters in the sequence).

Definition of vector in R

I learnt that a vector is a sequence of data elements of the same basic type. Then what will we call a in the following code (as it contains both numeric and charater):
a = c(1,"b")
is.vector(a)
[1] TRUE
So is the definition of vector wrong? I referred this tutorial.
The tutorial simplifies and that can cause confusion. Its definition describes "basic vector types", but there are also "generic vectors".
From the language definition (which you should study):
2.1.1 Vectors
Vectors can be thought of as contiguous cells containing data. Cells
are accessed through indexing operations such as x[5]. More details
are given in Indexing.
R has six basic (‘atomic’) vector types: logical, integer, real,
complex, string (or character) and raw. The modes and storage modes
for the different vector types are listed in the following table.
typeof mode storage.mode
logical logical logical
integer numeric integer
double numeric double
complex complex complex
character character character
raw raw raw
Single numbers, such as 4.2,
and strings, such as "four point two" are still vectors, of length 1;
there are no more basic types. Vectors with length zero are possible
(and useful).
2.1.2 Lists
Lists (“generic vectors”) are another kind of data storage. Lists have
elements, each of which can contain any type of R
object, i.e. the elements of a list do not have to be of the same
type. List elements are accessed through three different indexing
operations. These are explained in detail in Indexing.
Lists are vectors, and the basic vector types are referred to as
atomic vectors where it is necessary to exclude lists.
From help("is.vector"):
If mode = "any", is.vector may return TRUE for the atomic modes, list
and expression. For any mode, it will return FALSE if x has any
attributes except names. [...]
(An expression is basically a list.)
Note that factors are not vectors; is.vector returns FALSE and as.vector converts a factor to a character vector for mode = "any".
Finally, as #Henrik points out, c coerces all arguments to the same type.
Actually, in your example, the "1" will be viewed as a character by R.
a<-c(1,"b")
typeof(a[1])
[1] "character"

Sorting Algorithm in R

I had a question related to the sorting algorithm in R.
if I use order() to sort a particular column, the shorter string is not what is sorted first.
To give you an example: I had to sort a column of character type and it puts firearm_weight above fire_weigh and this is not how the dictionary way of sorting strings anyways.
How can I change this while using the order() command?
Thanks!
"_" < "a" is TRUE on my system and locale.
help("Comparison") is relevant here:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: [...]
Collation of non-letters (spaces, punctuation signs, hyphens,
fractions and so on) is even more problematic.
You could substitute "_" with something that is ordered after "z" on your system. E.g., a "µ" on my system.

Resources