I have an sf object containing the following columns of data:
HR60 HR70 HR80 HR90 HC60 HC70 HC80 HC90
0.000000 0.000000 8.855827 0.000000 0.0000000 0.0000000 0.3333333 0.0000000
0.000000 0.000000 17.208742 15.885624 0.0000000 0.0000000 1.0000000 1.0000000
1.863863 1.915158 3.450775 6.462453 0.3333333 0.3333333 1.0000000 2.0000000
...
How can I calculate the median of HR60 to HR90 columns for all observations and place it in a different column, let's say HR-median? I tried to use apply(), but this kind of works for the whole dataset only and I need only these 4 columns to be considered.
We can select those columns
df1$HR_median <- apply(subset(df1, select = HR60:HR90), 1, median)
Given the following vector,
x<-c(0,0,0,5.0,5.1,5.0,6.5,6.7,6.0,0,0,0,0,3.8,4.0,3.6)
I would like to have a vector with the cumulative mean, like
cumsum(x)/seq_along(x)
but restarting the computation each time that the difference between two subsequent values is grater than 1.3 or less than -1.3. My aim is to obtain a vector like
d<-c(0,0,0,5,5.05,5.03,6.5,6.6,6.37,0,0,0,0,3.8,3.9,3.8)
You can use cumsum(abs(diff(x)) > 1.3)) to define groups, which are used in aggregate to restart cumsum(x)/seq_along(x) each time when the difference is grater than 1.3 or less than -1.3.
unlist(aggregate(x, list(c(0, cumsum(abs(diff(x)) > 1.3))),
function(x) cumsum(x)/seq_along(x))[,2])
# [1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
# [9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000
Maybe you can try ave + findInterval like below
ave(x,findInterval(seq_along(x),which(abs(diff(x))>1.3)+1),FUN = function(v) cumsum(v)/seq_along(v))
which gives
[1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
[9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000
I have a dataset with five categorical variables. And I ran a multinomial logistic regression with the function multinom in package nnet, and then derived the p values from the coefficients. But I do not know how to interpret the results.
The p values were derived according to UCLA's tutorial: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ .
Just like this:
z <- summary(test)$coefficients/summary(test)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
And I got this:
(Intercept) Age1 Age2 Age3 Age4 Unit1 Unit2 Unit3 Unit4 Unit5 Level1 Level2 Area1 Area2
Not severe 0.7388029 9.094373e-01 0 0.000000e+00 0.000000e+00 0 0.75159758 0 0 0.0000000 0.8977727 0.9333862 0.6285447 0.4457171
Very severe 0.0000000 1.218272e-09 0 6.599380e-06 7.811761e-04 0 0.00000000 0 0 0.0000000 0.7658748 0.6209889 0.0000000 0.0000000
Severe 0.0000000 8.744405e-08 0 1.052835e-06 3.299770e-04 0 0.00000000 0 0 0.0000000 0.8843606 0.4862364 0.0000000 0.0000000
Just so so 0.0000000 1.685045e-07 0 5.507560e-03 2.973261e-06 0 0.08427447 0 NaN 0.3010429 0.5552963 0.7291180 0.0000000 0.0000000
Not severe at all 0.0000000 0.000000e+00 0 0.000000e+00 0.000000e+00 0 NaN NaN 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
But how should I interpret these p values? Age3 is significantly related to Very severe? I am green to statistics and have no idea. Help me understand the results please. Thank you in advance.
I suggest using stargazer package to display coefficients and p-values (I believe that it is a more convenient and common way)
Regarding the interpretation of the results, in a multinomial model you can say: keeping all other variables constant, if Age3 is higher by one unit, the log odds for Very Severe relative to the reference category is higher/lower by that amount indicated by the value of the coefficient. The p-value just shows you whether the association between these two variables (predictor and response) is significant or not. Interpretation is the same that of other models.
Note: in case of p-value the null hypothesis is always that the coefficient is equal to zero (no effect at all). When p-value is less than 0.05, you can safely reject the null hypothesis and state that the predictor has an effect on the response variable.
I hope I could give you some hints
I have an XTS object of monthly returns across multiple columns, I'm trying to calculate rolling annual returns (geometric) for each column.
Date Manager 1 Manager 2 Manager 3 Manager 4 Manager 5
20160430 0.0152000 0.0100700 0.0102210 0.0046160 NA
20160531 0.0462000 0.0515240 0.0287490 0.0374920 NA
20160630 0.0007000 0.0126830 0.0156410 0.0130820 NA
20160731 0.0200000 0.0158810 0.0239540 0.0214950 NA
20160831 0.0339000 0.0531980 0.0021170 0.0476160 0.0457650
20160930 -0.0071000 0.0047540 -0.0088080 0.0031540 -0.0034070
20161031 -0.0224000 -0.0181930 0.0181410 -0.0048280 0.0170850
20161130 -0.0439000 -0.0131600 -0.0243030 -0.0064650 -0.0007180
20161231 -0.0051000 0.0200130 0.0204210 0.0160740 0.0172270
20170131 0.0083000 0.0146560 0.0247000 0.0203410 0.0227060
20170228 0.0211000 -0.0067120 0.0257530 0.0029940 0.0124730
20170331 0.0530000 0.0532190 0.0283950 0.0416190 0.0237900
20170430 0.0638300 0.0592280 0.0341340 0.0437430 0.0293500
20170531 0.0339000 0.0264270 0.0287670 0.0207810 0.0179080
20170630 NA -0.0046950 -0.0091310 -0.0074520 -0.0137600
20170731 NA 0.0109280 0.0029630 0.0146560 0.0167990
20170831 NA 0.0290430 0.0372960 0.0284390 0.0229930
20170930 NA 0.0226390 0.0030190 0.0063850 -0.0087170
Exepcted Results:
Date Manager 1 Manager 2 Manager 3 Manager 4 Manager 5
20160430
20160531
20160630
20160731
20160831
20160930
20161031
20161130
20161231
20170131
20170228
20170331 0.121979182 0.212964432 0.176317288 0.213932804
20170430 0.175724107 0.271996881 0.204161963 0.261212111
20170531 0.161901314 0.241637796 0.204183032 0.240897626
20170630 0.220330851 0.174812396 0.215746067
20170731 0.214381041 0.150728807 0.207606539 0.200188843
20170831 0.186529323 0.191124778 0.185500853 0.174054195
20170930 0.207649992 0.205337395 0.189319163 0.167798654
I've been using the PerformanceAnalytics package, but having some trouble applying the function across each column:
apply.rolling(ManagerReturns, width = 12, trim = FALSE ,FUN = Return.annualized)
apply.rolling is a wrapper around rollapply. For some reason apply.rolling doesn't work correctly with your data, but using rollapply will solve the issue.
using rollapply I can get close to your outcome, with a but. The but is that the Return.annualized removes the NA values but continues to calculate. You can see this happening with Manager1 and Manager5. This is not because rollapply, but because of Return.annualized. For example Return.annualized(my_data$Manager5[1:12]) returns an annualized return of 0.2207884.
ra <- rollapply(my_data, width = 12, FUN = Return.annualized, fill = 0)
Manager1 Manager2 Manager3 Manager4 Manager5
2016-04-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-05-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-06-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-07-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-08-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-09-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-10-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-11-30 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2016-12-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-01-31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-02-28 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
2017-03-31 0.1219792 0.2129644 0.1763173 0.2139328 0.2207884
2017-04-30 0.1757241 0.2719969 0.2041620 0.2612121 0.2409790
2017-05-31 0.1619013 0.2416378 0.2041830 0.2408976 0.2406184
2017-06-30 0.1769613 0.2203309 0.1748124 0.2157461 0.1982881
2017-07-31 0.1682027 0.2143810 0.1507288 0.2076065 0.2001888
2017-08-31 0.1368823 0.1865293 0.1911248 0.1855009 0.1740542
2017-09-30 0.1676742 0.2076500 0.2053374 0.1893192 0.1677987
Now you could do something like ra * !is.na(my_data) which will multiply ra with a 0 in case of NA's and will remove the last 4 records of Manager1. But it will not help with Manager5.
data:
my_data <- structure(c(0.0152, 0.0462, 7e-04, 0.02, 0.0339, -0.0071, -0.0224,
-0.0439, -0.0051, 0.0083, 0.0211, 0.053, 0.06383, 0.0339, NA,
NA, NA, NA, 0.01007, 0.051524, 0.012683, 0.015881, 0.053198,
0.004754, -0.018193, -0.01316, 0.020013, 0.014656, -0.006712,
0.053219, 0.059228, 0.026427, -0.004695, 0.010928, 0.029043,
0.022639, 0.010221, 0.028749, 0.015641, 0.023954, 0.002117, -0.008808,
0.018141, -0.024303, 0.020421, 0.0247, 0.025753, 0.028395, 0.034134,
0.028767, -0.009131, 0.002963, 0.037296, 0.003019, 0.004616,
0.037492, 0.013082, 0.021495, 0.047616, 0.003154, -0.004828,
-0.006465, 0.016074, 0.020341, 0.002994, 0.041619, 0.043743,
0.020781, -0.007452, 0.014656, 0.028439, 0.006385, NA, NA, NA,
NA, 0.045765, -0.003407, 0.017085, -0.000718, 0.017227, 0.022706,
0.012473, 0.02379, 0.02935, 0.017908, -0.01376, 0.016799, 0.022993,
-0.008717), .Dim = c(18L, 5L), .Dimnames = list(NULL, c("Manager1",
"Manager2", "Manager3", "Manager4", "Manager5")), index = structure(c(1461974400,
1464652800, 1467244800, 1469923200, 1472601600, 1475193600, 1477872000,
1480464000, 1483142400, 1485820800, 1488240000, 1490918400, 1493510400,
1496188800, 1498780800, 1501459200, 1504137600, 1506729600), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"), .indexCLASS = "Date", tclass = "Date", .indexTZ = "UTC", tzone = "UTC")
I am doing some clustering of documents using cosine similarity between each document. This is fine. However my problem is a little strange in that I only want to cluster certain documents with others, not all of the documents against each other. Here's an example...
I have two spreadsheets with 3 labels apiece. I want to cluster the labels that are similar to each other BETWEEN the documents but not in the internal of the document, so for instance
Doc1: has labels: sex and gender, tobacco use years, current age
Doc2: has labels: gender, age now, time of use
I want to cluster the labels between the two documents but not inside the document, so I've created a similarity matrix that looks like this:
d1_l1 d1_l2 d1_l3 d2_l1 d2_l2 d2_l3
d1_l1 1.0000000 NA NA 0.5773503 0.0 0.0000000
d1_l2 NA 1.0000000 NA 0.0000000 0.0 0.3333333
d1_l3 NA NA 1.0 0.0000000 0.5 0.0000000
d2_l1 0.5773503 0.0000000 0.0 1.0000000 NA NA
d2_l2 0.0000000 0.0000000 0.5 NA 1.0 NA
d2_l3 0.0000000 0.3333333 0.0 NA NA 1.0000000
where the cosine similarity between labels in the same document is set as NA. The problem is that agnes and other hierarchical clustering methods don't accept NA values. So what should I do? Am I thinking about this the wrong way?