What are character vectors made of? - r

"Alice" is a character vector of length 1. "Bob" is also a character vector of length 1, but it's clearly shorter. At face value, it appears that R's character are made out of something smaller than characters, but if you try to subset them, say "Alice"[1], you'll just get the original vector back. How does R internally make sense of this? What are character vectors actually made of?

You're mistaking vector length for string length.
In R common variables are all vectors containing whatever data you typed, so both are vectors that contain 1 string even if you don't assign a name to them.
If you want to check the size of each string, use nchar function:
nchar("Alice")
[1] 5
nchar("Bob")
[1] 3

Related

Subsetting a string vector based on a partial match of unknown characters

I have a vector of 8-character file names of the format
"/relative/path/to/folder/a(bc|de|fg)...[xy]1.sav"
where the brackets hold one of two-three known characters, and the '...' are three unknown characters. I want to match all character vectors that has the same unknown sequence XXX and sort into a list of character vectors.
I am not sure how to proceed on this. I am thinking about a way to extract the letters in the fourth to sixth position (...), and put into a vector then use `grep to get all the files with the matching string.
E.g.
# Pseudo-code. Not functioning code, but sort of the thing I want to do
> char.extr <- str_extract(file.vector, !"a(bc|de|fg)...[xy]1.sav")
> char.extr
"JKL", "MNO" ,"PQR" ...
# Use grep and lapply to put matched strings into list
> path.list <- lapply(char.extr, grep, file.vector)
> path.list
1. "/relative/path/to/folder/abcJKLx1.sav"
"/relative/path/to/folder/adeJKLy1.sav"
2. "/relative/path/to/folder/afgMNOx1.sav"
"/relative/path/to/folder/abcMNOy1.sav"
Since we know the name structure, I'd imaging extracting the 3 letter substring and then using split to get individual lists is what you're looking for.
split(path.list, substr(basename(path.list), 4, 6))

R: use strsplit transform to list that is not length of 1

I was writing some code that need to inspect individual characters in a input string. I want to know whether there were numbers in string, and use strsplit to split the character into list. Then, use any() to check the numbers (the limitation is that I can't use grepl :(
b<-"Idontknow456"
b<-strsplit(b,"")
length(b)
any(6%in%b)
c<-list("d","t","5")
length(c)
any(5%in%c)
the result
1
FALSE
3
TRUE
May I ask how should I modify my code, so that I can split individual characters to inspect them?

Object in R is integer but has length of 8364

I have a data.frame from which I extracted a column called Volume. The code is as follows:
volume = aapl.us$Volume
In the console, I am told the following:
typeof(volume)
# "integer"
length(volume)
# 8364
How is this possible?
The case that you encounter is not strange behavior in R. It may sound unintuitive at first to users of other programming language where there is a distinction between a scalar (single number) and a vector (one-dimensional array).
R does not have "scalar" data. Simplest data structure in R is a vector, and it can be a numeric, character, factor, integer, logical, or complex-valued vector. A single number in R is a "vector of length one", and not a "scalar". A vector must contain data of the same type.
typeof() returns the type of a variable (see the link for further information). In your case, Volume is a vector that contains integers, and that vector has length 8364.

vector - character/integer class (under the hood)

Starting to learn R, and I would appreciate some help understanding how R decides the class of different vectors. I initialize vec <- c(1:6) and when I perform class(vec) I get 'integer'. Why is it not 'numeric', because I thought integers in R looked like this: 4L
Also with vec2 <- c(1,'a',2,TRUE), why is class(vec2) 'character'? I'm guessing R picks up on the characters and automatically assigns everything else to be characters...so then it actually looks like c('1','a','2','TRUE') am I correct?
Type the following, you can see the help page of the colon operator.
?`:`
Here is one paragraph.
For numeric arguments, a numeric vector. This will be of type integer
if from is integer-valued and the result is representable in the R
integer type, otherwise of type "double" (aka mode "numeric").
So, in your example c(1:6), since 1 for the from argument can be representable in R as integer, the resulting sequence becomes integer.
By the way, c is not needed to create a vector in this case.
For the second question, since in a vector all the elements have to be in the same type, R will automatically convert all the elements to the same. In this case, it is possible to convert everything to be character, but it is not possible to convert "a" to be numeric, so it results in a character vector.

adding or retaining leading zeros without converting to character format

Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])
You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.
You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.

Resources