Dataframe not recognized for GINI function (package reldist) - r

I want to calculate the Gini coefficient using the library reldist.
This is my code :
library(reldist)
year_return <- read.csv("year_return.csv")
year_return[3:19] <- lapply(year_return[3:19], function(x)
as.numeric(as.character(x)))
year_return[[2]] <- as.Date(year_return[[2]])
str(year_return)
gini(year_return[3:19],w)
This is the error I get :
Error in `[.data.frame`(x, ox) : undefined columns selected
This is w :
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
0.04591712 0.04078667 0.04126135 0.05131896 0.04349168 0.04834431 0.04694083 0.03904389 0.04117694
[,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
0.04537461 0.04692524 0.04045692 0.04696848 0.05087293 0.1713231 0.08499888 0.04396601 0.0708321
This is what I get for str(year_return) :
X Date .SXQR .SXTR .SXNR .SXMR .SXAR .SX3R .SX6R .SXFR .SXOR .SXDR .SX4R .SXRR .SXER
1 1 2000-01-03 364.94 223.93 489.04 586.38 306.56 246.81 385.36 403.82 283.78 455.39 427.43 498.08 457.57
2 2 2000-01-04 345.04 218.90 474.05 566.15 301.13 239.24 374.64 390.41 275.93 434.92 414.10 476.17 435.72
3 3 2000-01-05 338.22 215.88 464.20 542.29 298.22 239.55 373.26 383.48 272.54 430.05 406.33 466.19 436.23
4 4 2000-01-06 343.13 218.18 470.82 529.33 300.69 249.75 377.26 383.48 272.47 434.15 417.91 464.59 438.26
5 5 2000-01-07 349.46 220.10 478.87 531.65 306.50 255.17 381.19 390.23 273.76 447.02 428.54 474.40 445.40
6 6 2000-01-10 356.20 223.01 484.07 581.82 310.84 252.75 387.74 393.75 278.76 453.80 431.81 473.14 440.15
.SXKR .SX7R .SX8R .SXIR .SXPR
1 1016.39 489.65 1070.72 466.36 368.62
2 971.51 471.23 1015.13 450.38 365.89
3 924.57 464.75 949.91 446.67 363.78
4 887.88 461.62 935.48 448.10 370.22
5 918.33 465.41 970.17 456.69 376.62
6 944.22 467.89 1002.93 460.26 373.81
Here you can find the dataset I am using (year_return.csv)

Related

Moving from a matrix of character names to a vector of those names (for fMRI data)

I have a lower triangular matrix of fMRI network connectivities of sum(1:235), so there are 27730 values. I have these values, however, I want to cbind another vector that has the names of these regions of interest (ROIs), but I'm not sure how I can move from the 236 vector of these ROIs to the filled out 27730 vector.
So the connections should go like this: SN1-SN2, SN1-SN3…..SN1-CB4, SN2-SN3 …. SN2-CB4, SN3-SN4 …SN3-CB4 and so on. If you take all the unique connections, then the first of 236 ROIs has 235 connections, second ROI has 234 connections, third ROI has 233 connections and so on. So the total unique connections are sum(1:235) = 27730.
Per a comment, though, I have changed the vector to only contain 7 of these values.
Thus, I've also changed the connectivities to have sum(1:8) values.
Thanks much!
roi <- c("SN2", "SN3", "SN4", "SN5", "CON1", "CON2", "CB4")
connectivities <- rnorm(1:28)
Here's a way:
m <- outer(roi, roi, paste, sep = "-")
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "SN2-SN2" "SN2-SN3" "SN2-SN4" "SN2-SN5" "SN2-CON1" "SN2-CON2" "SN2-CB4"
# [2,] "SN3-SN2" "SN3-SN3" "SN3-SN4" "SN3-SN5" "SN3-CON1" "SN3-CON2" "SN3-CB4"
# [3,] "SN4-SN2" "SN4-SN3" "SN4-SN4" "SN4-SN5" "SN4-CON1" "SN4-CON2" "SN4-CB4"
# [4,] "SN5-SN2" "SN5-SN3" "SN5-SN4" "SN5-SN5" "SN5-CON1" "SN5-CON2" "SN5-CB4"
# [5,] "CON1-SN2" "CON1-SN3" "CON1-SN4" "CON1-SN5" "CON1-CON1" "CON1-CON2" "CON1-CB4"
# [6,] "CON2-SN2" "CON2-SN3" "CON2-SN4" "CON2-SN5" "CON2-CON1" "CON2-CON2" "CON2-CB4"
# [7,] "CB4-SN2" "CB4-SN3" "CB4-SN4" "CB4-SN5" "CB4-CON1" "CB4-CON2" "CB4-CB4"
m[upper.tri(m)]
# [1] "SN2-SN3" "SN2-SN4" "SN3-SN4" "SN2-SN5" "SN3-SN5" "SN4-SN5" "SN2-CON1" "SN3-CON1" "SN4-CON1"
# [10] "SN5-CON1" "SN2-CON2" "SN3-CON2" "SN4-CON2" "SN5-CON2" "CON1-CON2" "SN2-CB4" "SN3-CB4" "SN4-CB4"
# [19] "SN5-CB4" "CON1-CB4" "CON2-CB4"
Because there are 7 in roi, the first element ("SN2") has six connections; second element ("SN3") has five; etc ... producing 21 total connections.
Another way, using (and improving on) Ben's use of combn:
apply(combn(roi,2), 2, paste, collapse = "-")
# [1] "SN2-SN3" "SN2-SN4" "SN2-SN5" "SN2-CON1" "SN2-CON2" "SN2-CB4" "SN3-SN4" "SN3-SN5" "SN3-CON1"
# [10] "SN3-CON2" "SN3-CB4" "SN4-SN5" "SN4-CON1" "SN4-CON2" "SN4-CB4" "SN5-CON1" "SN5-CON2" "SN5-CB4"
# [19] "CON1-CON2" "CON1-CB4" "CON2-CB4"
Here is an example with a smaller set of values (7). For 7 values, there are 21 combinations: 6 + 5 + 4 + 3 + 2 + 1 = 45.
roi <- c("SN2", "SN3", "SN4", "SN5", "CON1", "CON2", "CB4")
The combn() function generates the desired output as a matrix:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "SN2" "SN2" "SN2" "SN2" "SN2" "SN2" "SN3" "SN3" "SN3" "SN3" "SN3"
[2,] "SN3" "SN4" "SN5" "CON1" "CON2" "CB4" "SN4" "SN5" "CON1" "CON2" "CB4"
[,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,] "SN4" "SN4" "SN4" "SN4" "SN5" "SN5" "SN5" "CON1" "CON1" "CON2"
[2,] "SN5" "CON1" "CON2" "CB4" "CON1" "CON2" "CB4" "CON2" "CB4" "CB4"
To get your final desired output, transpose the matrix, convert to data.frame, and use unite() from tidyr to stitch the two roi values together.
library(dplyr) # for the piper %>%
library(tidy)
combn(roi, 2) %>%
t() %>% as.data.frame() %>%
unite(col = "combination", sep = "-")
combination
1 SN2-SN3
2 SN2-SN4
3 SN2-SN5
4 SN2-CON1
5 SN2-CON2
6 SN2-CB4
7 SN3-SN4
8 SN3-SN5
9 SN3-CON1
10 SN3-CON2
11 SN3-CB4
12 SN4-SN5
13 SN4-CON1
14 SN4-CON2
15 SN4-CB4
16 SN5-CON1
17 SN5-CON2
18 SN5-CB4
19 CON1-CON2
20 CON1-CB4
21 CON2-CB4

Is it possible to read only a sample of data using readRDS?

I have some big matrix saved using saveRDS:
# create same big matrix and save it
x = matrix(c(1:(10*10000)),10000,10)
saveRDS(x, 'test.RDS')
Now I would like to analyze only a sample on the data, but before taking the sample, I have been reading the full matrix:
# load big matrix and take a sample on the data after reading the data
x <- readRDS('test.RDS')
set.seed(1)
x[sample.int(dim(x)[1],5),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2656 12656 22656 32656 42656 52656 62656 72656 82656 92656
[2,] 3721 13721 23721 33721 43721 53721 63721 73721 83721 93721
[3,] 5728 15728 25728 35728 45728 55728 65728 75728 85728 95728
[4,] 9080 19080 29080 39080 49080 59080 69080 79080 89080 99080
[5,] 2017 12017 22017 32017 42017 52017 62017 72017 82017 92017
I wonder whether it is possible to read only a sample on the data stored into an RDS file? That would mean not reading the whole matrix into memory before taking the sample, but somehow skip the data which does not belong to the sample?
I tried the following, and got the same result:
# find out the size of the matrix and load only the part of the matrix which is needed?
n <- dim(readRDS('test.RDS'))[1]
set.seed(1)
readRDS('test.RDS')[sample.int(dim(x)[1],5),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2656 12656 22656 32656 42656 52656 62656 72656 82656 92656
[2,] 3721 13721 23721 33721 43721 53721 63721 73721 83721 93721
[3,] 5728 15728 25728 35728 45728 55728 65728 75728 85728 95728
[4,] 9080 19080 29080 39080 49080 59080 69080 79080 89080 99080
[5,] 2017 12017 22017 32017 42017 52017 62017 72017 82017 92017
How could I read a sample on RDS file without putting the full data temporarily into memory?
Alternatively, what kind of storing & loading functions one should use in order to be able to read only a sample from a file containing a matrix or data frame?

Is there a faster way to run sapply that is nested in two for loops?

I have a big data frame with > 1 million lines representing time series data for several individuals (with different individual data in different columns).
In addition, I have a 3D array that contains encounter frame numbers that indicate from which frame in my time series I want to extract data.
For a given individual and encounter type, I want to extract one time series of e.g. 100 frames. However, as I have many replicates for each meeting type and each individual, I want to directly calculate the average time series per individual and encounter type.
I managed to do so using sapply embedded in two for loops. However, running these for loops is very slow, and I now wonder whether there is a faster way of implementing this calculation in R, or whether I should rather do it in C++. Below my code, and the small bit of my data:
nb_ind = 3;
response_duration = 100;
nb_meeting_types = 2;
nb_variables = 2;
speed_offset = 2;
MEETING_START_OFFSET = 50;
replicate = 20;
# behavior_data is a data frame with columns: frame,speed1,head1,speed2,head2,speed3,head3
# there are about 1 million rows
dim(behavior_data)
[1] 1080000 7
head(behavior_data)
frame speed1 head1 speed2 headd2 speed3 head3
1 0 0 25 2.4 179 1.1 16
2 1 1.5 20 2.0 -175 1.6 27
3 2 1.6 28 2.0 -178 1.0 37
4 3 0.8 56 1.6 170 0.8 37
5 4 0.3 56 1.8 162 0 40
# encounters is an array with frame numbers of dimension [nb_ind,replicate,nb_meeting_types]
# these frame number correspond to starting points of meetings, for which I want to calculate the speed
dim(encounters)
[1] 3 20 2
head(encounters[,,1])
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 12049 17693 23350 29018 34666 40327 68608 57293 74264 45980 113864 79922 119522 102552 51636 153462 91235 142151 159121 62948
[2,] 12036 17694 23352 29014 34674 40322 68606 57296 74268 45982 113865 79929 119521 102558 51639 153463 91242 142161 159168 62952
[3,] 12037 17694 23351 29011 34669 40329 68606 57298 74263 45985 NA 79921 NA 102550 51641 NA 91234 NA NA 62950
all_average_speeds = array(NaN, c(nb_ind, response_duration, nb_meeting_types))
for (j in 1:nb_ind){
#calculate the average speed response for each meeting type for a given individual
average_speed = numeric(0);
for (i in 1:nb_meeting_types){
# calculate the average speed response across all replicates of a given meeting type for a given individual
average_speed_type = sapply(1:response_duration, function(k){
mean(behavior_data[,(j-1)*nb_variables + speed_offset][which(behavior_data$frame == ((encounters[j,,i] + k-1) - MEETING_START_OFFSET)], na.rm=TRUE)
})
average_speed = rbind(average_speed, t(average_speed_type))
}
all_average_speeds[j,,] = average_speed;
}

Error in pop * mx : non-comformable arrays?

I'm trying to use the Lee-Carter function call in R for mortality rates, but I keep getting this "Error in pop * mx : non-conformable arrays" message when I try to make the call.
I have the demogdata already stored (I know I don't have a range of ages for the ages argument for demogdata(), but I just want to find the overall mortality rate for the subset of the population I'm looking at).
> (xyz = demogdata(Rates, Pop, ages = mean(data$AGE), years = 2006:2014,
type = "mortality", label = "US", name = "total"))
Mortality data for US
Series: total
Years: 2006 - 2014
Ages: 28.9763585116791 - 28.9763585116791
All of the variables I have are as follows:
> Rates
[,1] [,2] [,3] [,4] [,5]
[1,] 0.002540197 0.002242095 0.001958826 0.001708285 0.001434417
[,6] [,7] [,8] [,9]
[1,] 0.001218796 0.0009218339 0.0006424075 0.0003361666
> years
[1] 2006 2007 2008 2009 2010 2011 2012 2013 2014
> Pop
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 179120 352795 516636 682556 851217 1012475 1180256 1343384 1493307
> (ages <- mean(data$AGE))
[1] 28.97636
Here are my input arguments to the lca() function call
out <- lca(xyz, years = c(2006:2014), ages = mean(all.mod2014$AGE),
adjust="dt", restype = "rates")
Error in pop * mx : non-conformable arrays

Matrix multiplication R

I just have an easy question: I have these two matrices
Matrix Y (264 rows and 4 columns)
[,1] [,2] [,3] [,4]
1751 -1.745529 0.3692280 0.04607022 -0.07004973
1752 -1.532722 0.5642921 0.07477571 0.03380135
1753 -1.657636 0.4660229 0.05772685 -0.03314599
1754 -1.685309 0.4540047 0.08254891 -0.01623810
1755 -1.702469 0.4483389 0.10709689 -0.03936556
1756 -1.761332 0.4505378 0.04801420 -0.06385137
Matrix E (4x4,of elements e)
[,1] [,2] [,3] [,4]
[1,] -0.8769976 -0.4706054 -0.07186508 0.06512449
[2,] -0.4085563 0.8198519 -0.40067903 -0.01951755
[3,] 0.2190770 -0.3206892 -0.86394973 -0.32055350
[4,] -0.1263415 0.0594299 0.29644997 -0.94478745
I want to do this for each year b(t)=∑(e[1,i]∙Y[,i]) with i from 1 to 4.
This is what I should get (a matrix 264x4),and this is the code I've used
betaNew1<-(Y[,1]%*%t(P[1,1]))
betaNew2<-(Y[,2]%*%t(P[1,2]))
betaNew3<-(Y[,3]%*%t(P[1,3]))
betaNew4<-(Y[,3]%*%t(P[1,4]))
beta_t<-data.frame(betaNew1,betaNew2,betaNew3,betaNew4)
betaNew1 betaNew2 betaNew3 betaNew4
1 1.530825 -0.1737607 -0.003310840 0.003000300
2 1.344193 -0.2655589 -0.005373763 0.004869730
3 1.453743 -0.2193129 -0.004148544 0.003759431
4 1.478012 -0.2136570 -0.005932384 0.005375955
5 1.493062 -0.2109907 -0.007696526 0.006974630
6 1.544684 -0.2120255 -0.003450544 0.003126900
How can I avoid to use 4 instructions?
We can try
res <- lapply(seq_len(nrow(P)), function(i) Y*P[i,][col(Y)])

Resources