Last characters of a column of a data.frame in R - r

I have a table inside a data.frame, and I need to get only the last two characters from that table, how do I do this?
Note: I was trying to do it using str_sub, but in it I can only define which character starts and which ends, and my data varies the size of characters. Follow my example below that does not solve:
base$estado <- str_sub(psd_base$itbc_name, start = 2)

You can use the functions substr() and nchar() to select the last letter of a character. Both are directly applicable to vectors, so you can write:
names = c("Alpha","Bip","Charlemagne","Haggs","O")
substr(names,nchar(names),nchar(names))
Which will give the output:
[1] "a" "p" "e" "s" "O"
Since I do not have a reproducible example of your data, this example has to suffice. I think you get the idea.

Related

Replace characters only if it is not repeating

Is there a way to replace a character only if it is not repeating, or repeating a certain number of times?
str = c("ddaabb", "daabb", "aaddbb", "aadbb")
gsub("d{1}", "c", str)
[1] "ccaabb" "caabb" "aaccbb" "aacbb"
#Expected output
[1] "ddaabb" "caabb" "aaddbb" "aacbb"
You can use negative lookarounds in your regex to exclude cases where d is preceeded or followed by another d:
gsub("(?<!d)d(?!d)", "c", str, perl=TRUE)
Edit: adding perl=TRUE as suggested by OP. For more info about regex engine in R see this question
Now that you've added "or repeating a specified number of times," the regex-based approaches may get messy. Thus I submit my wacky code from a previous comment.
foo <- unlist(strsplit(str, '')
bar <- rle(foo)
and then look for instances of bar$lengths == desired_length and use the returned indices to locate (by summing all bar$lengths[1:k] ) the position in the original sequence. If you only want to replace a specific character, check the corresponding value of bar$values[k] and selectively replace as desired.

Wildcard to match string in R

This might sound quite silly but it's driving me nuts.
I have a matrix that has alphanumeric values and I'm struggling to test if some elements of that matrix match only the initial and final letters. As I don't care the middle character, I'm trying (withouth success) to use a wildcard.
As an example, consider this matrix:
m <- matrix(nrow=3,ncol=3)
m[1,]=c("NCF","NBB","FGF")
m[2,]=c("MCF","N2B","CCD")
m[3,]=c("A3B","N4F","MCP")
I want to evaluate if m[2,2] starts with "N" and ends with "B", regardless of the 2nd letter in the string. I've tried something like
grep("N.B",m)
and it works, but still I want to know if there is a more compact way of doing it, like:
m[2,2]="N.B"
which ovbiously didn't work!
Thanks
You can use grepl with the subseted m like:
grepl("^N.B$", m[2,2])
#[1] TRUE
or use startsWith and endsWith:
startsWith(m[2,2], "N") & endsWith(m[2,2], "B")
#[1] TRUE

Why sort(), arrange() ignores special characters in R. How to suppress it?

When I sort the strings in R, I noticed that it ignores the special characters such as ~, [, etc.
For example
> sort(c('~a','b','c'))
[1] "~a" "b" "c"
while I was expecting the following results since '~' is ordered higher in the ASCII table
[1] "b", "c", "~a"
I would like a results that keeps the ASCII order. Is there a way to enforce that in sort() as well as arrange(). I am particular wanting to find a solution for arrange() because I am applying the sorting in a data frame.
It is not ignoring the ASCII order. It seems like ~ indeed comes before the alphabet. See the following.
> v=c("a", "~b", "c")
> sort(v)
[1] "~b" "a" "c"

R: replacing a table column with a modified version of that column

I am using R currently and I have produced a table with 3 columns. The first column contains names looking like "XXX_YYY_ZZZ" and I would only want to keep the "XXX" part. This is why I tried gsub, but couldn't make it so I turned to strapplyc(), which works but produces only one column. Apparently, I would want to keep my initial table, but with the first column replaced by the strapplyc() output. Or any other different approach you think would fit better!
Thank you in advance.
Since you have NOT showed samples so creating a simplex example here for testing it.
cal1 <- c("XXX_YYY_ZZZ","XXX_YYY_ZZZ")
gsub("_.*","",cal1)
Output will be as follows.
> gsub("_.*","",cal1)
[1] "XXX" "XXX"
Works for me. Here is a regex which looks for three groups of text, separated by underscores. The ^ indicates start of string and $ indicates end of string. I capture first (\\1) group, but there's nothing stopping you from capturing \\2, \\3 or even \\1\\3.
gsub("^(.*)_(.*)_(.*)$", "\\1", "XXX_YYY_ZZZ")
[1] "XXX"
You could also use strsplit.
> strsplit("XXX_YYY_ZZZ", "_")[[1]][1]
[1] "XXX"

Why combine produces a different behavior from readLines() function

I am learning R and so far I am not having any trouble in catching up besides the following problem that I am hopeful someone out there will help me to understand.
If I create a character vector in the following way test1 <- c("a", "b", "c")
I get one vector of type character and I can access to each member of the vector through an indexer test1[n].
That makes sense and does what I understand it should do.
However if I do test2 <- readLines("file1.txt") where file1.txt contains one line (several random words space separated) I get one vector of class character (same as the first case) and I can't use an indexer (unless there's a way and I don't know about it yet).
Questions:
Why both are char type based but they are stored differently
How one could tell them apart without knowing how they have been created
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
Any help to understand the insides of this language is wildly appreciated!
Why both are char type based but they are stored differently
Both are stored in exactly the same way. R has no specific type to represent a single character and as a consequence characters are not a collections.
In the first case you have simply a character vector of length 3 where each element has size 1
test1 <- c("a", "b", "c")
typeof(test1)
# [1] "character"
length(test1)
# [1] 3
nchar(test1)
# [1] 1 1 1
and in the second case a character vector of length equal to number of lines in an input file and each element has size equal to length of string:
writeLines("foobar", con="file1.txt")
test2 <- readLines("file1.txt")
typeof(test2)
# [1] "character"
length(test2)
# [1] 1
nchar(test2)
# [1] 6
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
If you have fixed size elements you can try readBin but generally speaking strisplit is the way to go:
f <- "file1.txt"
readBin(f, what = 'raw', size = 1, n = file.info(f)$size) %>% sapply(rawToChar)
# [1] "f" "o" "o" "b" "a" "r" "\n"

Resources