I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column
Related
I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"
I have been trying to read a file which has date field and a numeric field. I have the data in an excel sheet and looks something like below -
Date X
1/25/2008 0.0023456
12/23/2008 0.001987
When I read this in R using the readxl::read_xlsx function, the data in R looks like below -
Date X
1/25/2008 0.0023456000000000
12/23/2009 0.0019870000000000
I have tried limiting the digits using functions like round, format (nsmall = 7), etc. but nothing seems to work. What am I doing wrong? I also tried saving the data as a csv and a txt and read it using read.csv and read.delim but I face the same issue again. Any help would be really appreciated!
As noted in the comments to the OP and the other answer, this problem is due to the way floating point math is handled on the processor being used to run R, and its interaction with the digits option.
To illustrate, we'll create an Excel spreadsheet with the data from the OP, and demonstrate what happens as we adjust the options(digits=) option.
Next, we'll write a short R script to illustrate what happens when we adjust the digits option.
> # first, display the number of significant digits set in R
> getOption("digits")
[1] 7
>
> # Next, read data file from Excel
> library(xlsx)
>
> theData <- read.xlsx("./data/smallNumbers.xlsx",1,header=TRUE)
>
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>
> # change digits to larger number to replicate SO question
> options(digits=17)
> getOption("digits")
[1] 17
> head(theData)
Date X
1 2008-01-25 0.0023456000000000002
2 2008-12-23 0.0019870000000000001
>
However, the behavior of printing significant digits varies by processor / operating system, as setting options(digits=16) results in the following on a machine running an Intel i7-6500U processor with Microsoft Windows 10:
> # what happens when we set digits = 16?
> options(digits=16)
> getOption("digits")
[1] 16
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>
library(formattable)
x <- formattable(x, digits = 7, format = "f")
or you may want to add this to get the default formatting from R:
options(defaultPackages = "")
then, restart your R.
Perhaps the problem isn't your source file as you say this happens with .csv and .txt as well.
Try checking to see the current value of your display digits option by running options()$digits
If the result is e.g. 14 then that is likely the problem.
In which case, try running r command options(digits=8) which will set the display digits=8 for the session.
Then, simply reprint your dataframe to see the change has already taken effect with respect to how the decimals are displayed by default to the screen.
Consult ?options for more info about digits display setting and other session options.
Edit to improve original answer and to clarify for future readers:
Changing options(digits=x) either up or down does not change the value that is stored or read into into internal memory for floating point variables. The digits session option merely changes how the floating point values print i.e. display on the screen for common print functions per the '?options` documentation:
digits: controls the number of significant digits to print when printing numeric values.
What the OP showed as the problem he was having (R displaying more decimals after last digit in a decimal number than the OP expected to see) was not caused by the source file having been read from Excel - i.e. given the OP had the same problem with CSV and TXT the import process didn't cause a problem.
If you are seeing more decimals than you want by default in your printed/displayed output (e.g. for dataframes and numeric variables) try checking options()$digits and understand that option is simply the default for the number of digits used by R's common display and printing methods. HOWEVER, it does not affect floating point storage on any of your data or variables.
Regarding floating point numbers though, another answer here shows how setting option(digits=n) higher than the default can help demonstrate some precision/display idiosyncrasies that are related to floating point precision. That is a separate problem to what the OP displayed in his example but it's well worth understanding.
For a much more detailed and topic specific discussion of floating point precision than would be appropriate to rehash here, it's well worth reading this definitive SO question+answer: Why are these numbers not equal?
That other question+answer+discussion covers issues specifically around floating point precision and contains a long, well presented list of references that you will find helpful if you need more information on the subject.
I'm using the SVA packages in R, dat is a csv file containing genes in rows and samples in columns. The file SIF.csv contains only 3 columns, array, sample and batch.
http://www.filedropper.com/samplesmall
http://www.filedropper.com/sifsmall
I followed the SVA manual, though I don't understand what does
modcombat do here. I understand it turns the data table into a
matrix, what why do we write ~1 in bracket ?? What does it mean?
Also, it generates an error, I think it means that the number of rows
isn't matching, is there a way to fix that?
Library(sva)
dat = read.csv("Combat_matrix_input.csv");
sif = read.csv("sif.csv");
modcombat = model.matrix(~1, data=dat)
newdata = ComBat(dat=dat, batch=sif$Batch, par.prior = TRUE, mod = modcombat)
Found 6 batches
Error in cbind(batchmod, mod) :
number of rows of matrices must match (see arg 2)
Firstly, please kindly post your csv file by uploading on some cloud drive. "~" commands generally means approximately equal for more details of the operator see: http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/plotmath.html
Also, for the model.matrix parameters refer to the manual that will help you understand the function, see: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/model.matrix.html
Edit: After looking at your both the files, I can see that the number of columns is different in each of the file and you might have understood that by now. Following document enlists the detailed steps, see: http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf. I hope it helps.
I am a new R user and having some difficulty when trying to rename certain records in a column.
My data have columns named classcode and fish_tl, among others. Classcode is a character value, fish_tl is numeric.
When classcode='OCAL' and fish_tl<20, I need to rename that value of classcode so that it is now "OCALYOY". I don't want to change any of the other records in classcode.
I'm running the following code:
data$classcode<-ifelse(data$classcode=='OCAL'& data$fish_tl<20,
'OCALYOY',data$classcode)
My problem seems to be with the "else" aspect: the code runs fine, and returns 'OCALYOY' as expected, but the other values of classcode have now been converted to numeric (although when I look at the mode of that field, it still returns as "character").
What am I doing wrong?
Thanks very much!
You can make the else part as.character(data$classcode). ifelse has some odd semantics with regard to the classes of the arguments, and it is turning your factor into it's underlying numeric representation. as.character will keep it as a character value.
You may be getting tripped up in a factor vs character issue, though you point out that R thinks it's character. Regardless, wrapping as.character() around your code seems to fix the problem for me:
> ifelse(data$classcode=='OCAL'& data$fish_tl<20,
+ 'OCALYOY',as.character(data$classcode))
#-----
[1] "BFRE" "BFRE" "BFRE" "HARG" "OCALYOY" "OYT" "OYT" "PFUR"
[9] "SPAU" "BFRE" "OCALYOY" "OCAL"
If this isn't it, can you make your question reproducible by adding the output of dput() to your question instead of the text representation?
Sorry for possibly a complete noob question but I have just started programming with R today and I am stuck already.
I am reading some data from a file which is in the format.
3.482373 8.0093238198371388 47.393873
0.32 20.3131 31.313
What I want to do is split each line then deal with each of the individual numbers.
I have imported the stringr package and using
x = str_split(line, " ")
This produces a list which I would like to index but don't know how.
I have learnt that x[[1:2]] gets the second element but that is about it. Ideally I would like something like
x1 = x[1]
x2 = x[2]
x3 = x[3]
But can't find anyway of doing this.
Thanks in advance
By using unlist you will get a vector instead of a list of vectors, and you will then be able to index it directly :
R> unlist(str_split("foo bar baz", " "))
[1] "foo" "bar" "baz"
But maybe you should read your file directly from read.table or one of its variant ?
And if you are beginning with R, you really should read one of the introduction available if you want to understand subsetting, indexing, etc.
you can wrap your call to str_split with unlist to get the behavior you're looking for.
The usual way to get this in would be to import it into a dataframe (a special sort of list). If file name is "fil.dat"" and is in "C:/dir/"
dfrm <- read.table("C:/dir/fil.dat") # resist the temptation to use backslashes
dfrm[2,2] # would give you the second item on the second row.
By default the field separator in R is "white-space" and that seems to be what you have, so you do not need to supply a sep= argument and the read.table function will attempt to import as numeric. To be on the safe side, you might consider forcing that option with colClasses=rep("numeric", 3) because if it encounters a strange item (such as often produced by Excel dumps), you will get a factor variable and will probably not understand how to recover gracefully.