Is it possible to read only a sample of data using readRDS? - r

I have some big matrix saved using saveRDS:
# create same big matrix and save it
x = matrix(c(1:(10*10000)),10000,10)
saveRDS(x, 'test.RDS')
Now I would like to analyze only a sample on the data, but before taking the sample, I have been reading the full matrix:
# load big matrix and take a sample on the data after reading the data
x <- readRDS('test.RDS')
set.seed(1)
x[sample.int(dim(x)[1],5),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2656 12656 22656 32656 42656 52656 62656 72656 82656 92656
[2,] 3721 13721 23721 33721 43721 53721 63721 73721 83721 93721
[3,] 5728 15728 25728 35728 45728 55728 65728 75728 85728 95728
[4,] 9080 19080 29080 39080 49080 59080 69080 79080 89080 99080
[5,] 2017 12017 22017 32017 42017 52017 62017 72017 82017 92017
I wonder whether it is possible to read only a sample on the data stored into an RDS file? That would mean not reading the whole matrix into memory before taking the sample, but somehow skip the data which does not belong to the sample?
I tried the following, and got the same result:
# find out the size of the matrix and load only the part of the matrix which is needed?
n <- dim(readRDS('test.RDS'))[1]
set.seed(1)
readRDS('test.RDS')[sample.int(dim(x)[1],5),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2656 12656 22656 32656 42656 52656 62656 72656 82656 92656
[2,] 3721 13721 23721 33721 43721 53721 63721 73721 83721 93721
[3,] 5728 15728 25728 35728 45728 55728 65728 75728 85728 95728
[4,] 9080 19080 29080 39080 49080 59080 69080 79080 89080 99080
[5,] 2017 12017 22017 32017 42017 52017 62017 72017 82017 92017
How could I read a sample on RDS file without putting the full data temporarily into memory?
Alternatively, what kind of storing & loading functions one should use in order to be able to read only a sample from a file containing a matrix or data frame?

Related

Moving from a matrix of character names to a vector of those names (for fMRI data)

I have a lower triangular matrix of fMRI network connectivities of sum(1:235), so there are 27730 values. I have these values, however, I want to cbind another vector that has the names of these regions of interest (ROIs), but I'm not sure how I can move from the 236 vector of these ROIs to the filled out 27730 vector.
So the connections should go like this: SN1-SN2, SN1-SN3…..SN1-CB4, SN2-SN3 …. SN2-CB4, SN3-SN4 …SN3-CB4 and so on. If you take all the unique connections, then the first of 236 ROIs has 235 connections, second ROI has 234 connections, third ROI has 233 connections and so on. So the total unique connections are sum(1:235) = 27730.
Per a comment, though, I have changed the vector to only contain 7 of these values.
Thus, I've also changed the connectivities to have sum(1:8) values.
Thanks much!
roi <- c("SN2", "SN3", "SN4", "SN5", "CON1", "CON2", "CB4")
connectivities <- rnorm(1:28)
Here's a way:
m <- outer(roi, roi, paste, sep = "-")
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "SN2-SN2" "SN2-SN3" "SN2-SN4" "SN2-SN5" "SN2-CON1" "SN2-CON2" "SN2-CB4"
# [2,] "SN3-SN2" "SN3-SN3" "SN3-SN4" "SN3-SN5" "SN3-CON1" "SN3-CON2" "SN3-CB4"
# [3,] "SN4-SN2" "SN4-SN3" "SN4-SN4" "SN4-SN5" "SN4-CON1" "SN4-CON2" "SN4-CB4"
# [4,] "SN5-SN2" "SN5-SN3" "SN5-SN4" "SN5-SN5" "SN5-CON1" "SN5-CON2" "SN5-CB4"
# [5,] "CON1-SN2" "CON1-SN3" "CON1-SN4" "CON1-SN5" "CON1-CON1" "CON1-CON2" "CON1-CB4"
# [6,] "CON2-SN2" "CON2-SN3" "CON2-SN4" "CON2-SN5" "CON2-CON1" "CON2-CON2" "CON2-CB4"
# [7,] "CB4-SN2" "CB4-SN3" "CB4-SN4" "CB4-SN5" "CB4-CON1" "CB4-CON2" "CB4-CB4"
m[upper.tri(m)]
# [1] "SN2-SN3" "SN2-SN4" "SN3-SN4" "SN2-SN5" "SN3-SN5" "SN4-SN5" "SN2-CON1" "SN3-CON1" "SN4-CON1"
# [10] "SN5-CON1" "SN2-CON2" "SN3-CON2" "SN4-CON2" "SN5-CON2" "CON1-CON2" "SN2-CB4" "SN3-CB4" "SN4-CB4"
# [19] "SN5-CB4" "CON1-CB4" "CON2-CB4"
Because there are 7 in roi, the first element ("SN2") has six connections; second element ("SN3") has five; etc ... producing 21 total connections.
Another way, using (and improving on) Ben's use of combn:
apply(combn(roi,2), 2, paste, collapse = "-")
# [1] "SN2-SN3" "SN2-SN4" "SN2-SN5" "SN2-CON1" "SN2-CON2" "SN2-CB4" "SN3-SN4" "SN3-SN5" "SN3-CON1"
# [10] "SN3-CON2" "SN3-CB4" "SN4-SN5" "SN4-CON1" "SN4-CON2" "SN4-CB4" "SN5-CON1" "SN5-CON2" "SN5-CB4"
# [19] "CON1-CON2" "CON1-CB4" "CON2-CB4"
Here is an example with a smaller set of values (7). For 7 values, there are 21 combinations: 6 + 5 + 4 + 3 + 2 + 1 = 45.
roi <- c("SN2", "SN3", "SN4", "SN5", "CON1", "CON2", "CB4")
The combn() function generates the desired output as a matrix:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "SN2" "SN2" "SN2" "SN2" "SN2" "SN2" "SN3" "SN3" "SN3" "SN3" "SN3"
[2,] "SN3" "SN4" "SN5" "CON1" "CON2" "CB4" "SN4" "SN5" "CON1" "CON2" "CB4"
[,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,] "SN4" "SN4" "SN4" "SN4" "SN5" "SN5" "SN5" "CON1" "CON1" "CON2"
[2,] "SN5" "CON1" "CON2" "CB4" "CON1" "CON2" "CB4" "CON2" "CB4" "CB4"
To get your final desired output, transpose the matrix, convert to data.frame, and use unite() from tidyr to stitch the two roi values together.
library(dplyr) # for the piper %>%
library(tidy)
combn(roi, 2) %>%
t() %>% as.data.frame() %>%
unite(col = "combination", sep = "-")
combination
1 SN2-SN3
2 SN2-SN4
3 SN2-SN5
4 SN2-CON1
5 SN2-CON2
6 SN2-CB4
7 SN3-SN4
8 SN3-SN5
9 SN3-CON1
10 SN3-CON2
11 SN3-CB4
12 SN4-SN5
13 SN4-CON1
14 SN4-CON2
15 SN4-CB4
16 SN5-CON1
17 SN5-CON2
18 SN5-CB4
19 CON1-CON2
20 CON1-CB4
21 CON2-CB4

Dataframe not recognized for GINI function (package reldist)

I want to calculate the Gini coefficient using the library reldist.
This is my code :
library(reldist)
year_return <- read.csv("year_return.csv")
year_return[3:19] <- lapply(year_return[3:19], function(x)
as.numeric(as.character(x)))
year_return[[2]] <- as.Date(year_return[[2]])
str(year_return)
gini(year_return[3:19],w)
This is the error I get :
Error in `[.data.frame`(x, ox) : undefined columns selected
This is w :
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
0.04591712 0.04078667 0.04126135 0.05131896 0.04349168 0.04834431 0.04694083 0.03904389 0.04117694
[,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
0.04537461 0.04692524 0.04045692 0.04696848 0.05087293 0.1713231 0.08499888 0.04396601 0.0708321
This is what I get for str(year_return) :
X Date .SXQR .SXTR .SXNR .SXMR .SXAR .SX3R .SX6R .SXFR .SXOR .SXDR .SX4R .SXRR .SXER
1 1 2000-01-03 364.94 223.93 489.04 586.38 306.56 246.81 385.36 403.82 283.78 455.39 427.43 498.08 457.57
2 2 2000-01-04 345.04 218.90 474.05 566.15 301.13 239.24 374.64 390.41 275.93 434.92 414.10 476.17 435.72
3 3 2000-01-05 338.22 215.88 464.20 542.29 298.22 239.55 373.26 383.48 272.54 430.05 406.33 466.19 436.23
4 4 2000-01-06 343.13 218.18 470.82 529.33 300.69 249.75 377.26 383.48 272.47 434.15 417.91 464.59 438.26
5 5 2000-01-07 349.46 220.10 478.87 531.65 306.50 255.17 381.19 390.23 273.76 447.02 428.54 474.40 445.40
6 6 2000-01-10 356.20 223.01 484.07 581.82 310.84 252.75 387.74 393.75 278.76 453.80 431.81 473.14 440.15
.SXKR .SX7R .SX8R .SXIR .SXPR
1 1016.39 489.65 1070.72 466.36 368.62
2 971.51 471.23 1015.13 450.38 365.89
3 924.57 464.75 949.91 446.67 363.78
4 887.88 461.62 935.48 448.10 370.22
5 918.33 465.41 970.17 456.69 376.62
6 944.22 467.89 1002.93 460.26 373.81
Here you can find the dataset I am using (year_return.csv)

R image function

I have trouble understanding the image function in R. I have the following matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6.931799 7.092166 7.136029 6.735593 6.621951 6.740000 6.049774 6.162304 6.169014 5.626374
[2,] 7.942623 7.909091 9.923077 5.888889 8.647059 8.166667 6.625000 6.529412 7.571429 5.590643
[3,] 8.446237 6.800000 9.000000 9.631579 8.892857 7.083333 6.857143 6.250000 6.413793 5.491525
[4,] 7.698276 6.666667 8.833333 7.565217 9.100000 6.705882 6.421053 7.045455 6.045455 5.267857
[5,] 6.082524 8.300000 8.250000 8.777778 7.250000 7.928571 6.500000 6.920000 5.041667 4.970833
[6,] 6.128571 8.636364 7.300000 6.266667 7.500000 7.384615 6.727273 6.312500 5.638889 4.569231
[7,] 6.146739 7.000000 7.625000 6.615385 5.466667 5.941176 7.100000 6.687500 5.789474 4.479675
[8,] 5.403509 7.714286 6.500000 8.500000 6.384615 7.133333 6.294118 5.900000 5.615385 4.759804
[9,] 5.444444 5.666667 4.875000 6.200000 6.777778 6.166667 5.642857 6.222222 5.428571 4.385093
[10,] 5.186180 5.621118 5.004878 5.045016 4.875433 4.594340 4.260377 4.276382 4.205128 3.632721
and I would like to display it as an heatmap. To so so I use the image function as following
image(1:10,1:10,mat,axes=FALSE)
but the result is definitly not what is in my matrix!!]1
Any idea ?
thanks
Firstly, you should keep in mind that the matrix is printed from top left but plotted from bottom left, like Badger has said. Increasing the row index would move you to the right on the plot.
The color intensity increases from red to white.
Another thing that you might want to change is the range on your z value. The plot takes the min and max values from your matrix and sets that as the default range. However, you might want to add the following argument: zlim=c(0,10)
, so that your range is from 0 to 10?
Lastly, if you want your plot to correspond to the locations of your z values in the matrix, you could create a new matrix where you rotate your original matrix by 90 degrees clockwise:
t(apply(mat, 2, rev))

Error in pop * mx : non-comformable arrays?

I'm trying to use the Lee-Carter function call in R for mortality rates, but I keep getting this "Error in pop * mx : non-conformable arrays" message when I try to make the call.
I have the demogdata already stored (I know I don't have a range of ages for the ages argument for demogdata(), but I just want to find the overall mortality rate for the subset of the population I'm looking at).
> (xyz = demogdata(Rates, Pop, ages = mean(data$AGE), years = 2006:2014,
type = "mortality", label = "US", name = "total"))
Mortality data for US
Series: total
Years: 2006 - 2014
Ages: 28.9763585116791 - 28.9763585116791
All of the variables I have are as follows:
> Rates
[,1] [,2] [,3] [,4] [,5]
[1,] 0.002540197 0.002242095 0.001958826 0.001708285 0.001434417
[,6] [,7] [,8] [,9]
[1,] 0.001218796 0.0009218339 0.0006424075 0.0003361666
> years
[1] 2006 2007 2008 2009 2010 2011 2012 2013 2014
> Pop
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 179120 352795 516636 682556 851217 1012475 1180256 1343384 1493307
> (ages <- mean(data$AGE))
[1] 28.97636
Here are my input arguments to the lca() function call
out <- lca(xyz, years = c(2006:2014), ages = mean(all.mod2014$AGE),
adjust="dt", restype = "rates")
Error in pop * mx : non-conformable arrays

Matrix multiplication R

I just have an easy question: I have these two matrices
Matrix Y (264 rows and 4 columns)
[,1] [,2] [,3] [,4]
1751 -1.745529 0.3692280 0.04607022 -0.07004973
1752 -1.532722 0.5642921 0.07477571 0.03380135
1753 -1.657636 0.4660229 0.05772685 -0.03314599
1754 -1.685309 0.4540047 0.08254891 -0.01623810
1755 -1.702469 0.4483389 0.10709689 -0.03936556
1756 -1.761332 0.4505378 0.04801420 -0.06385137
Matrix E (4x4,of elements e)
[,1] [,2] [,3] [,4]
[1,] -0.8769976 -0.4706054 -0.07186508 0.06512449
[2,] -0.4085563 0.8198519 -0.40067903 -0.01951755
[3,] 0.2190770 -0.3206892 -0.86394973 -0.32055350
[4,] -0.1263415 0.0594299 0.29644997 -0.94478745
I want to do this for each year b(t)=∑(e[1,i]∙Y[,i]) with i from 1 to 4.
This is what I should get (a matrix 264x4),and this is the code I've used
betaNew1<-(Y[,1]%*%t(P[1,1]))
betaNew2<-(Y[,2]%*%t(P[1,2]))
betaNew3<-(Y[,3]%*%t(P[1,3]))
betaNew4<-(Y[,3]%*%t(P[1,4]))
beta_t<-data.frame(betaNew1,betaNew2,betaNew3,betaNew4)
betaNew1 betaNew2 betaNew3 betaNew4
1 1.530825 -0.1737607 -0.003310840 0.003000300
2 1.344193 -0.2655589 -0.005373763 0.004869730
3 1.453743 -0.2193129 -0.004148544 0.003759431
4 1.478012 -0.2136570 -0.005932384 0.005375955
5 1.493062 -0.2109907 -0.007696526 0.006974630
6 1.544684 -0.2120255 -0.003450544 0.003126900
How can I avoid to use 4 instructions?
We can try
res <- lapply(seq_len(nrow(P)), function(i) Y*P[i,][col(Y)])

Resources