I am trying to multiply 2 numbers together. I am aware that R has a limit for integer size, but from the console, when I manually multiply the numbers, I get a result.
However, when I use variables containing those exact numbers, and I multiply those together, I get a NAs produced by integer overflow error:
Why is this? Are the variables somehow not properly resolving before being multiplied? I need to be able to use variables, so would it be possible to make it work? Floating point numbers are not an option since precision is needed.
Related
I have a data frame in R that I want to analyse. I want to know how many specific numbers are in a data frame column. for example, I want to know the frequency of number 0.9998558 by using
sum(deviation_multiple_regression_3cell_types_all_spots_all_intersection_genes_exclude_50_10dec_rowSums_not_0_for_moran_scaled[,3]== 0.9998558)
However, it seems that the decimal shown is not the actual one (it must be 0.9998558xxxxx) since the result I got from using the above command is 0 (the correct one should be 3468). How can I access that number without knowing the exact decimal numbers so that I get the correct answer? Please see the screenshot below.
The code below gives the number of occurrences in the column.
x <- 0.9998558
length(which(df$a==x))
If you are looking for numbers stating with 0.9998558, I think you can do it in two different ways: working with data as numeric or as character.
Let x be your variable:
Data as character
This way counts exactly what you are looking for
sum(substr(as.character(x),1,9)=="0.9998558")
Data as numeric
This will include all the values with a difference with the reference value lower than 1e-7; this may include values not starting exactly with 0.9998558
sum(abs(x-0.9998558)<1e-7)
You can also "truncate" the numbers in your vector and compare them with the number you want. Here, we write 10^7 because 7 is the number of decimals you want to compare.
sum(trunc(x*10^7)/10^7)==0.9998558)
I am setting up a pipeline to import, format, normalize, and plot a bunch of datasets. The pipeline will rely heavily on tidyverse solutions (dplyr and ggplot2).
During the input/format step I would like to decide if/when to use factors vs. characters in various columns that contain letters. Likewise, I need to decide if I should designate numerical columns as integers (when it's reasonable) or use double.
My gut feeling is that as default I should just use character and double. Neither speed nor space are an issue since the resulting datasets are relatively small (~20 x 10,000 max) so I figure that this will give me the most flexibility. Are the disadvantages to going down this road?
Performance shouldn't be a concern in most use case, the criterium is the meaning of the variables.
Factor vs character
Use character if your data is just strings that do not hold specific meaning; use factor if it's a categorical variable with a limited set of values. The main advantages of using factors are:
you get an error if you try to give a new value that is not in the levels (so that can save you from typos)
you can give an order to the levels and get an ordered factor
some functions (especially when modelling) require an explicit factor for categorical variables
you make it clear to the reader that these are not random character strings.
Integer vs double
If you know your column will only ever contain integer values, integer can be a better choice. Indeed, computations on doubles can give some numeric error, and in some situations you can end up with 26.0000000001 != 26. In addition, some packages may be aware of the type of input (although I can't think of any example).
For big numbers (more than 2e31), integers won't be able to store them whereas doubles will still behave correctly.
as.integer(2147483647)
#> [1] 2147483647
as.integer(2147483648)
#> [1] NA
#> Warning message:
#> NAs introduced by coercion to integer range
But when the numbers get even bigger, doubles will also start loosing significant digits:
1234578901234567890 == 1234578901234567891
#> [1] TRUE
Overall, I don't think there it makes a big difference in practice, using an integer type can be a way to signal to the reader and to the program that if there is a decimal number in that column, something went wrong.
I am trying to do some calculations where I divide two vectors. Sometimes I encounter a division by zero, which cannot take place. Instead of attempting this division, I would like to store an empty element in the output.
The question is: how do I do this? Can vectors have empty fields? Can a structure be the solution to my problem or what else should I use?
No, there must be something in the memory slot. Simply store a NaN or INT_MIN for integer values.
I'm running into some problems with the R function as.character() and paste(): they do not give back what they're being fed...
as.character(1415584236544311111)
## [1] "1415584236544311040"
paste(1415584236544311111)
## [1] "1415584236544311040"
what could be the problem or a workaround to paste my number as a string?
update
I found that using the bit64 library allowed me to retain the extra digits I needed with the function as.integer64().
Remember that numbers are stored in a fixed number of bytes based upon the hardware you are running on. Can you show that your very big integer is treated properly by normal arithmetic operations? If not, you're probably trying to store a number to large to store in your R install's integer # of bytes. The number you see is just what could fit.
You could try storing the number as a double which is technically less precise but can store larger numbers in scientific notation.
EDIT
Consider the answers in long/bigint/decimal equivalent datatype in R which list solutions including arbitrary precision packages.
I have a SQLite3 table with a column having format DECIMAL(7,2), but whenever I select rows with values not having a non-zero 2nd decimal place (eg. 3.00 or 3.10), the result always has trailing zero(s) missing (eg. 3 or 3.1). Is there any way that I can apply a formatting function in the SELECT statement so that I get the required 2dp? I have tried ROUND(), but this has no effect. Otherwise I have to keep converting the resulting column values into the required format for display (using Python in my case) every time I do a SELECT statement, which is a real pain.
I don't even mind if the result is string instead of numeric, as long as it has the right number of decimal places.
Any help would be appreciated.
Alan
SQLite internally uses IEEE binary floating point arithmetic, which truly does not lend itself well to maintaining a particular number of decimals. To get that type of decimal handling would require one of:
Fixed point math, or
IEEE decimal floating point (rather uncommon), or
Handling everything as strings.
Formatting the values (converting from floating point to string) after extraction is the simplest way to implement things. You could even hide that inside some sort of wrapper so that the rest of the code doesn't have to deal with the consequences. But if you're going to do arithmetic on the value afterwards then you're better off not formatting and instead working with the value as returned by the query, because the format and reconvert back to binary floating point (which Python uses, just like the vast majority of other modern languages) loses lots of information in the reduced precision.