Is there a way to use sweep(dataframe) with integer division, or something that is equivalent to such?
This is a minimal example of sweep not using integer division - which I want to replace with integer division:
sweep(x = mtcars, MARGIN = 2, STATS = unlist(mtcars[1,]), FUN = '/')
Some limitations I need to stick to:
I need to preserve the column names of the dataframe, as done in the example above.
I cannot just use round, floor, ceil, or similar - it needs to be an equivalent of integer division (floor would have different effects on negative numbers than integer division).
If possible, I'd prefer to not store any information in additional variables during this process.
I'm dealing with a relatively large dataframe, so it could turn out that very slow solutions might not be an option here.
Does anyone know a way of achieving this in R?
Pass '%/%' as your function, that is integer division. See arithmetic operator docs.
sweep(x = mtcars, MARGIN = 2, STATS = unlist(mtcars[1,]), FUN = '%/%')
Related
I am encountering a R problem. I was simply trying to make the sum of all the different values in a column from a big data set. Code looks like that:
sum(Animal$Pigs, na.rm = TRUE)
However R tells me:
In sum(Animal$Pigs, na.rm = TRUE) :
integer overflow - use sum(as.numeric(.))
Does it mean that the resulting integer is too big ? Are there any packages that might help ? If not, is there another language I could turn to for large data set (I know a bit of python).
The manual of sum says:
Integer overflow should no longer happen since R version 3.5.0.
To calculate with large integer numbers you can use the gmp library.
sum(10L^100L, 10L^50L, 1L)
#[1] 1e+100
library(gmp)
sum.bigz(as.bigz("10")^100L, as.bigz("10")^50L, 1)
#[1] 10000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000001
I have a vector of strings and I would like to hash each element individually to integers modulo n.
In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value
library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)
So the above approach will not work as it can not even produce an integer let alone modulo of one.
What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.
R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.
The mpfr-function from the Rmpfr package should work for you:
mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065
I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.
To use it
if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)
I want to retrieve all the elements along the last dimension of an N-dimensional array A. That is, if idx is an (N-1) dimensional tuple, I want A[idx...,:]. I've figured out how to use CartesianRange for this, and it works as shown below
A = rand(2,3,4)
for idx in CartesianRange(size(A)[1:end-1])
i = zeros(Int, length(idx))
[i[bdx] = idx[bdx] for bdx in 1:length(idx)]
#show(A[i...,:])
end
However, there must be an easier way to create the index i shown above . Splatting idx does not work - what am I doing wrong?
You can just index directly with the CartesianIndex that gets generated from the CartesianRange!
julia> for idx in CartesianRange(size(A)[1:end-1])
#show(A[idx,:])
end
A[idx,:] = [0.0334735,0.216738,0.941401,0.973918]
A[idx,:] = [0.842384,0.236736,0.103348,0.729471]
A[idx,:] = [0.056548,0.283617,0.504253,0.718918]
A[idx,:] = [0.551649,0.55043,0.126092,0.259216]
A[idx,:] = [0.65623,0.738998,0.781989,0.160111]
A[idx,:] = [0.177955,0.971617,0.942002,0.210386]
The other recommendation I'd have here is to use the un-exported Base.front function to extract the leading dimensions from size(A) instead of indexing into it. Working with tuples in a type-stable way like this can be a little tricky, but they're really fast once you get the hang of it.
It's also worth noting that Julia's arrays are column-major, so accessing the trailing dimension like this is going to be much slower than grabbing the columns.
I have two p-times-n arrays x and missx, where x contains arbitrary numbers and missx is an array containing zeros and ones. I need to perform recursive calculations on those points where missx is zero. The obvious solution would be like this:
do i = 1, n
do j = 1, p
if(missx(j,i)==0) then
z(j,i) = ... something depending on the previous computations and x(j,i)
end if
end do
end do
Problem with this approach is that most of the time missx is always 0, so there is quite a lot if statements which are always true.
In R, I would do it like this:
for(i in 1:n)
for(j in which(xmiss[,i]==0))
z[j,i] <- ... something depending on the previous computations and x[j,i]
Is there a way to do the inner loop like that in Fortran? I did try a version like this:
do i = 1, n
do j = 1, xlength(i) !xlength(i) gives the number of zero-elements in x(,i)
j2=whichx(j,i) !whichx(1:xlength(i),i) contains the indices of zero-elements in x(,i)
z(j2,i) = ... something depending on the previous computations and x(j,i)
end do
end do
This seemed slightly faster than the first solution (if not counting the amount of defining xlength and whichx), but is there some more clever way to this like the R version, so I wouldn't need to store those xlength and whichx arrays?
I don't think you are going to get dramatic speedup anyway, if you must do the iteration for most items, than storing just the list of those with the 0 value for the whole array is not an option. You can of course use the WHERE or FORALL construct.
forall(i = 1: n,j = 1: p,miss(j,i)==0) z(j,i) = ...
or just
where(miss==0) z = ..
But the ussual limitations of these constructs apply.
How do I create a vector like this:
a = [a_1;a_2;...,a_n];
aNew = [a;a.^2;a.^3;...;a.^T].
Is it possible to create aNew without a loop?
So you want different powers of a, all strung out into a vector? I would create an array, where each column of the array is a different power of a. Then string it out into a vector. Something like this...
aNew = bsxfun(#power,a,1:T);
aNew = aNew(:);
This does what you want, in a simple, efficient way. bsxfun is a more efficient way of writing the expansion than are other methods, such as repmat, ndgrid and meshgrid.
The code I wrote does assume that a is a column vector, as you have constructed it.
The idea is to use meshgrid to create two arrays of size n x T:
[n_mesh, t_mesh] = meshgrid(a, 1:T);
Now n_mesh is an array where each row is a duplicate of a, and t_mesh is an array where each column is 1:T.
Now you can use an element-wise operation on them to create what you need:
aNew = n_mesh .^ t_mesh;