I have a dataframe LoopVariable and the following couple of lines of code:
print(unique(LoopVariable[,"Job..R"]))
[1] "14047/2" "18331/3"
My output are two character and that is all good. My question now is: How can I count my output for further calculation usage? In other words: I have two characters and I need them to be as an integer for further calculation usage. In my example here the integer value would be "2".
Use the length() function for this. You can find more about the function by typing ?length into your console.
This is likely what you should expect:
length(unique(LoopVariable[,"Job..R"]))
[1] 2
Related
I have two datasets that I want to link (inner_join) with a common key which is a string. The problem is that in one of the two dataset the key is not complete, but this uncomplete key is included in the other one, like the following example:
key for 1st dataset: PV955--075P412171042--
and for the 2nd: PV955--???P412171042--
The ??? represents numbers that are missing, so my question is can we do like a string comparison/inclusion to check if the characters of my 2nd key are included my 1st key and do the join on this if yes?
Idk if the issue is clear, and thanks for the answers.
It's hard to answer without seeing your data, however you can try this:
library(stringr)
> str_detect("075P412171042","P412171042")
[1] TRUE
In base R with regular expressions :
key1 <- "PV955--075P412171042--"
key2 <- "PV955--???P412171042--"
key2re <- gsub("--...", "--...", key2)
grepl(key2re, key1)
## [1] TRUE
Replace the 3 unknown characters after "--" by dots meaning any character in regular expressions.
Then grepl check if the two strings match.
I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. My alignment can be read in as a data frame. Here is a sample of 3.
alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
"Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
"MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
"-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))
Each of the dashes represents a space. What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. So far I've tried using the str_count function. For example:
alignment$shift <- str_count(alignment$Sequence, "-")
but this fails me when I have gaps downstream in my sequence. Really I'm only interested in the gaps that occur at the beginning of the sequences.
I stumbled across the regex function in a post that almost perfectly matches my problem, (How to count the number of hyphens at the beginning of a string in javascript?) but this is in Java and I'm not sure how to translate this to R.
My questions are:
1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character?
2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string?
You could do this...
alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)
alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))
alignment$shift
[1] 0 0 23
It just counts the number of characters removed by telling gsub to delete the start of a string (the ^) followed by any number of spaces (-+). You could use str_replace instead of gsub.
Maybe this might help? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string.
library(stringr)
str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
start end
[[2]]
start end
[[3]]
start end
[1,] 1 24
As part of my dataset, one of the columns is a series of 24-digit numbers.
Example:
bigonumber <- 429382748394831049284934
When I import it using either data.table::fread or read.csv, it shows up as numeric in exponential format (EG: 4.293827e+23).
options(digits=...) won't work since the number is longer than 22 digits.
When I do
as.character(bigonumber)
what I get is "4.29382748394831e+23"
Is there a way to get bigonumber converted to a character string and show all of the digits as characters? I don't need to do any math on it, but I do need to search against it and do dplyr joins on it.
I need to this after import, since the column number varies from month to month.
(Yes, in the perfect world, my upstream data provider would use a hash instead of a long number and a static number of columns that stay the same every month, but I don't get to dictate that to them.)
You can specify colClasses on your fread or read.csv statement.
bignums
429382748394831049284934
429382748394831049284935
429382748394831049284936
429382748394831049284937
429382748394831049284938
429382748394831049284939
bignums <- read.csv("~/Desktop/bignums.txt", sep="", colClasses = 'character')
You can suppress the scientific notation with
options(scipen=999)
If you define the number then
bigonumber <- 429382748394831049284934
you can convert it into a string:
big.o.string <- as.character(bigonumber)
Unfortunately, this does not work because R converts the number to a double, thereby losing precision:
#[1] "429382748394831019507712"
The last digits are not preserved, as pointed out by #SabDeM. Even setting
options(digits=22)
doesn't help, and in any case 22 is the largest number that is allowed; and in your case there are 24 digits. So it seems that you will have to read the data directly as character or factor. Great answers have been posted showing how this can be achieved.
As a side note, there is a package called gmp that allows using arbitrarily large integer numbers. However, there is a catch: they have to be read as characters (again, in order to prevent R's internal conversion into double).
library(gmp)
bigonumber <- as.bigz("429382748394831049284934")
> bigonumber
Big Integer ('bigz') :
[1] 429382748394831049284934
> class(bigonumber)
[1] "bigz"
The advantage is that you can indeed treat these entries as numbers and perform calculations while preserving all the digits.
> bigonumber * 2
#Big Integer ('bigz') :
#[1] 858765496789662098569868
This package and my answer here may not solve your problem, because reading the numbers directly as characters is an easier way to achieve your goal, but I thought I might post this anyway as an information for users who may need to use large integers with more than 22 digits.
Use digest::digest on bigonumber to generate an md5 hash of the number yourself?
bigonumber <- 429382748394831049284934
hash_big <- digest::digest(bigonumber)
hash_big
# "e47e7d8a9e1b7d74af6a492bf4f27193"
I saw this before I posted my answer, but dont see it here anymore.
set options(scipen) to a big value so that there is no truncation:
options(scipen = 999)
bigonumber <- 429382748394831049284934
bigonumber
# [1] 429382748394831019507712
as.character(bigonumber)
# [1] "429382748394831019507712"
Use "scan" to read the file - the "what" parameter lets you define the input type of each column.
If you want numbers as numbers you can't print all values. The digits options allows a maximum of 22 digits. The range is from 1 to 22. It uses the print.default method. You can set it with:
options( digits = 22 )
Even with this options, the numbers will change. I ignore why that happens, most likely due to the fact that the object your are about to print (the number) is longer than the allowed amount of digits and so R does some weird stuff. I'll investigate about it.
I have a data set in which I want to pad zeroes in front of a set of dates that don't have six characters. For example, I have a date that reads 91003 (October 3rd, 2009) and I want it to read 091003, as well as any other date that is missing a zero in front. When I use the sprintf function, the code is:
Data1$entrydate <- sprintf("%06d", data1$entrydate)
But what it spits out is something like 000127, or some other other random number for all the other dates in the problem. I don't understand what's going on, and I would appreciate some help on the issue. Thanks.
PS. I am sometimes also getting a error message that sprintf is only for character values, I don't know if there is any code for numerical values.
I guess you got different results than expected because the column class was factor. You can convert the column to numeric either by as.numeric(as.character(datacolumn)) or as.numeric(levels(datacolumn)). According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
So, you can use
levels(data1$entrydate) <- sprintf('%06d', as.numeric(levels(data1$entrydate)))
Example
Here is an example that shows the problem
v1 <- factor(c(91003, 91104,90103))
sprintf('%06d', v1)
#[1] "000002" "000003" "000001"
Or, it is equivalent to
sprintf('%06d', as.numeric(v1)) #the formatted numbers are
# the numeric index of factor levels.
#[1] "000002" "000003" "000001"
When you convert it back to numeric, works as expected
sprintf('%06d', as.numeric(levels(v1)))
#[1] "090103" "091003" "091104"
I need help conditionally adding leading or trailing zeros.
I have a dataframe with one column containing icd9 diagnoses. as a vector, the column looks like:
"33.27" "38.45" "9.25" "4.15" "38.45" "39.9" "84.1" "41.5" "50.3"
I need all the values to have the length of 5, including the period in the middle (not counting ""). If the value has one digit before the period, it need to have a leading zero. If value has one digit after the period, it need to have zero at the end. So the result should look like this:
"33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Here is the vector for R:
icd9 <- c("33.27", "38.45", "9.25", "4.15", "38.45", "39.9", "84.1", "41.5", "50.3" )
This does it in one line
formatC(as.numeric(icd9),width=5,format='f',digits=2,flag='0')
ICD-9 codes have some formatting quirks which can lead to misinterpretation with simple string processing. The icd package on CRAN takes care of all the corner cases when doing ICD processing, and has been battle-tested over about six years of use by many R users.
Using this function called change that accepts the argument of the max number of characters, i think it can help
change<-function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n))
icd92<-gsub(" ","",paste(change(icd9,5)))
You can also use sprintf after converting the vector into numeric.
sprintf("%05.2f", as.numeric(icd9))
[1] "33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Notes
The examples in ?sprint to get work out the proper format.
There is some risk of introducing errors due to numerical precision here, though it works well in the example.