Is there a build-in ordinal sequence vector in R? - r

I need a long ordinal sequence vector in R. As a simple example of what I want:
OS <- c("First","Second","Third")
Is there a build-in vector like that?

from library(english)
ordinal(1:5)
# [1] first second third fourth fifth

I googled "R cardinal numbers" and got to the vignette for the toOrdinal package, but unfortunately it doesn't actually get you words.
library(toOrdinal)
sapply(1:5,toOrdinal)
## [1] "1st" "2nd" "3rd" "4th" "5th"
The docs say
convert_to: OPTIONAL. Output type that provided 'cardinal_number' is
converted into. Default is 'ordinal_number' which refers to
the 'cardinal_number' followed by the appropriate ordinal
indicator. Additional options planned include 'ordinal_word'.
so maybe this will eventually do what you want ...

Related

readxl, returning values are slightly off from the values on Excel

I'm trying to read an Excel file into R.
I used read_excel function of the readxl package with parameter col_types = "text" since the columns of the Excel sheet contain mixed data types.
df <- read_excel("Test.xlsx",sheet="Sheet1",col_types = "text")
But it appears a very slight difference in the numeric value is introduced. It's always those few values so I think it's some hidden attributes in Excel.
I tried to format those values as numbers in Excel, and also tried add 0s after the number, but it won't work.
I changed the numeric value of a cell from 2.3 to 2.4, and it was read correctly by R.
This is a consequence of floating-point imprecision, but it's a little tricky. When you enter the number 1.2 (for example) into R or Excel, it's not represented exactly as 1.2:
print(1.2,digits=22)
## [1] 1.199999999999999955591
Excel and R usually try to shield you from these details, which are inevitable if you're using fixed precision floating-point values (which most computer systems do), by limiting the printing precision to a level that will ignore those floating-point imprecisions. When you explicitly convert to character, however, R figures you don't want to lose information, so it gives you all the digits. Numbers that can be represented exactly in a binary representation, such as 2.375, don't gain all those extra digits.
However, there's a simple solution in this case:
readxl::read_excel("Test.xlsx", na="ND")
This tells R that the string "ND" should be treated as a special "not available" value, so all of your numeric values get handled properly. When you examine your data, the tiny imprecisions will still be there, but R will print the numbers the same way that Excel does.
I feel like there's probably a better way to approach this (mixed-type columns are really hard to deal with), but if you need to 'fix' the format of the numbers you can try something like this:
x <- c(format(1.2,digits=22),"abc")
## [1] "1.199999999999999955591" "abc"
fix_nums <- function(x) {
nn <- suppressWarnings(as.numeric(x))
x[!is.na(nn)] <- format(nn[!is.na(nn)])
return(x)
}
fix_nums(x)
## [1] "1.2" "abc"
Then if you're using tidyverse you can use my_data %>% mutate_all(fix_nums)

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

"Named tuples" in r

If you load the pracma package into the r console and type
gammainc(2,2)
you get
lowinc uppinc reginc
0.5939942 0.4060058 0.5939942
This looks like some kind of a named tuple or something.
But, I can't work out how to extract the number below the lowinc, namely 0.5939942. The code (gammainc(2,2))[1] doesn't work, we just get
lowinc
0.5939942
which isn't a number.
How is this done?
As can be checked with str(gammainc(2,2)[1]) and class(gammainc(2,2)[1]), the output mentioned in the OP is in fact a number. It is just a named number. The names used as attributes of the vector are supposed to make the output easier to understand.
The function unname() can be used to obtain the numerical vector without names:
unname(gammainc(2,2))
#[1] 0.5939942 0.4060058 0.5939942
To select the first entry, one can use:
unname(gammainc(2,2))[1]
#[1] 0.5939942
In this specific case, a clearer version of the same might be:
unname(gammainc(2,2)["lowinc"])
Double brackets will strip the dimension names
gammainc(2,2)[[1]]
gammainc(2,2)[["lowinc"]]
I don't claim it to be intuitive, or obvious, but it is mentioned in the manual:
For vectors and matrices the [[ forms are rarely used, although they
have some slight semantic differences from the [ form (e.g. it drops
any names or dimnames attribute, and that partial matching is used for
character indices).
The partial matching can be employed like this
gammainc(2, 2)[["low", exact=FALSE]]
In R vectors may have names() attribute. This is an example:
vector <- c(1, 2, 3)
names(vector) <- c("first", "second", "third")
If you display vector, you should probably get desired output:
vector
> vector
first second third
1 2 3
To ensure what type of output you get after the function you can use:
class(your_function())
I hope this helps.

how to use R language $ symbol to extract column from a matrix

I am new to R language, if I use us_stocks$"LNC" I could get the corresponding data zoo. resB is a list with following elements.The library is zoo, which is the type of us_stocks
resB
# [[1]] LNC 7
# [[2]] GAM 62
# [[3]] CMA 7
class(resB)
# [1] "list"
names(resB[[1]])
# [1] "LNC"
but when use us_stocks$names(resB[[1]]) I could not get the zoo series? How to fix this?
It often takes a while to understand what is meant by " ... $ is a function which does not evaluate its second argument." Most R functions would take names(resB[[1]]) and eval;uate it and then act on the value. But not $. It expects the second argument to be an actual column name but given as an unquoted string. This is an example of "non-standard evaluation". You will also see it operating in the functions library and help, as well as many functions in what is known perhaps flippantly as the hadleyverse, which includes the packages 'ggplot2' and 'dplyr'. The names of dataframe columns or the nodes of R lists are character literals, however, they are not really R names in the sense that their values cannot be accessed with an unquoted sequence of letters typed to the console at the toplevel of R.
So as already stated you should be using d[[ names(resB[[1]]) ]]. This is also much safer to use in programming, since there are often problems with scoping involved with the use of the $-function in anything other than interactive console use.

issues on the format for a given argument

I am trying to use textcat package for n-gram analysis, which has the following function:
textcat(x, p = TC_char_profiles, method = "CT", ..., options = list())
The function specification indicates that
The argument x can be a character vector of texts, or an R object which can be coerced to this using as.character.
I do not know what does the "R object which can be coerced to this using as.character" mean? In other words, I do not quite understand what should be the correct input format for this x in accordance with the above description. Suppose I have a 100 documents. How to transfer these documents into the format of x?
You really have two questions here.
(1). What does the "R object which can be coerced to this using as.character" mean?
That means that other classes of R object can be passed in, in place of one that is just character. An example is a factor, where as.character(x) will drop the extra features provided and revert to a simple character vector.
as.character(1:2) ## will give a vector c("1", "2")
This extends for other derived classes, and it's a standard R idiom to provide a method for common functions like as.character that define a coercion from any given class to character.
(2). In what format must my data be to input to textcat?
In short, it must be a character vector or something that can be coerced to one. You are asking about documents, so presumably you have text files. The function readLines will provide a character vector from a text file, a vector as long as the number of lines in the file. Any more for this question needs a lot more detail from you about what the analysis is supposed to do, does it need to be broken into lines of text from a file? Broken into words? Keep sets of lines/words from different files as separate sets? And so on.
In really simplistic terms using the example in readLines, you could do something like this but further detail needs more information for your question:
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file="ex.data",
sep="\n")
readLines("ex.data", n=-1)
x <- readLines("ex.data", n=-1)
require(textcat)
textcat(x)

Resources