I was creating a summary for an article and I came out with the following behaviour that I cannot understand. two columns of the data frame report the min and max pressure as the following
a <- c(80, 80, 80, 80, 80, 80, 80, 80, 70, 70, 75, 75, 70, 65, 60, 80, 75, 70, 80, 80, 80, 80, 80, 70, 80, 70, 75, 80, 70, 65, 70, 75, 70, 75, 80, 65, 85, 75, 70, 70, 70, 75, 80, 80, 70, 70, 80, 70, 80, 60, 80, 80, 70, 70, 85, 70, 70, 80, 70, 70, 75, 75, 70, 70, 70,
70, 70, 80, 80, 70)
b <- c(120, 120, 120, 120, 120, 120, 120, 120, 120, 125, 120, 135, 130, 120, 115, 110, 125, 120, 130, 125, 110, 120, 130, 110, 125, 130, 105, 100, 110, 110, 130, 120, 110, 120, 135, 125, 145, 135, 130, 110, 115, 145, 120, 125, 100, 120, 120, 130,
115, 120, 110, 160, 120, 130, 155, 125, 135, 155, 110, 130, 145, 155, 125, 130, 140, 110, 100, 150, 130, 130)
pressure <- data.frame(a,b)
str(pressure)
pressure %>% tbl_summary()
and the result is the following
so for b I got the expected behaviour while is formatted as categorical I guess. No matter what change I made (forcing as double, adding decimals etc) worked to have a formatted as b. If I shorten the vectors the behaviour is different and they both looks like.
I've also forced the output with
pressure %>% tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd})"))
but I keep getting same results
Any help appreciated
It appears to be the default behavior of tbl_summary() to interpret any numeric variables with fewer than 10 unique values as categorical. You can observe that when running the following:
library(tidyverse)
library(gtsummary)
d <- map_dfc(8:12, \(x) rep(1:x, length.out = 100)) |>
set_names(letters[1:5])
d |>
tbl_summary()
This behavior can be overridden by specifying the type of the problematic variables:
d |>
tbl_summary(type = list(c(a,b,c) ~ "continuous"))
Related
Part 2 Boston
plot(boston, ylab=" Boston crime data", xlab= "Time")
#Time series seem to have homogeneous variance upon visual inspection
#Q2
#Trend looks linear in the plot, so for trend differencing operator take d=1
newboston= as.numeric(unlist(boston))
xdiff = diff(newboston)
plot(xdiff)
#Q3
#ADF
library(tseries)
adf.test(xdiff)
#From the result, alternative hypothesis is stationary so null hypothesis is rejected
#KPSS test
install.packages('fpp3', dependencies = TRUE)
library ( fpp3 )
unitroot_kpss(xdiff)
#the p-value is >0.05, so fail to reject null hypothesis for KPSS
#Q4
library(astsa)
acf2(xdiff, max.lag = 50)
model1 = sarima(xdiff, p, 1, q)
So this is what I have tried so far. I am quite new to R and so do be kind if my workings make little sense. For context, Boston is the data I imported from an excel, that is simply a column of x axis data.
Firstly, I am trying to do Q4, but I am not sure how I would go about to find p and q.
Second, I am unsure whether what I did in Q2 to detrend my data is correct in the first place.
Here is the output of dput(boston)
dput(boston)
structure(list(x = c(41, 39, 50, 40, 43, 38, 44, 35, 39, 35,
29, 49, 50, 59, 63, 32, 39, 47, 53, 60, 57, 52, 70, 90, 74, 62,
55, 84, 94, 70, 108, 139, 120, 97, 126, 149, 158, 124, 140, 109,
114, 77, 120, 133, 110, 92, 97, 78, 99, 107, 112, 90, 98, 125,
155, 190, 236, 189, 174, 178, 136, 161, 171, 149, 184, 155, 276,
224, 213, 279, 268, 287, 238, 213, 257, 293, 212, 246, 353, 339,
308, 247, 257, 322, 298, 273, 312, 249, 286, 279, 309, 401, 309,
328, 353, 354, 327, 324, 285, 243, 241, 287, 355, 460, 364, 487,
452, 391, 500, 451, 375, 372, 302, 316, 398, 394, 431, 431),
y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-118L))
I'm working with a PCA problem where I have 3 variables and I reduce them to 2 by doing PCA. I've already plot all the points in 3D using scatter3D. My question is, how can I plot the plane determined by two vectors (the first two eigenvectors of the sampled covariance matrix) in R?
This is what I have so far
library(plot3D)
X <- matrix(c(55, 75, 110,
47, 69, 108,
42, 71, 110,
48, 74, 114,
47, 75, 114,
52, 73, 104,
49, 72, 106,
44, 67, 107,
52, 73, 108,
45, 73, 111,
50, 80, 117,
50, 71, 110,
48, 75, 114,
51, 73, 106,
44, 66, 102,
42, 71, 112,
50, 68, 107,
48, 70, 108,
51, 72, 108,
52, 73, 109,
49, 72, 112,
49, 73, 108,
46, 70, 105,
39, 66, 100,
50, 76, 108,
52, 71, 108,
56, 75, 108,
53, 70, 112,
53, 72, 110,
49, 74, 113,
51, 72, 109,
55, 74, 110,
56, 75, 110,
62, 79, 118,
58, 77, 115,
50, 71, 105,
52, 67, 104,
52, 73, 107,
56, 73, 106,
55, 78, 118,
53, 68, 103), ncol = 3,nrow = 41,byrow = TRUE)
S <- cov(X)
Gamma <- eigen(S)$vectors
scatter3D(X[,1], X[,2], X[,3], pch = 18, bty = "u", colkey = FALSE,
main ="bty= 'u'", col.panel ="gray", expand =0.4,
col.grid = "white",ticktype = "detailed",
phi = 25,theta = 45)
pc <- scale(X,center=TRUE,scale=FALSE) %*% Gamma[,c(1,2)]
Now I would like to plot the plane using scatter3D
Perhaps this will do. Using the iris data. It uses scatter3d in package car which can add a regression surface to a 3d plot:
library(car)
data(iris)
iris.pr <- prcomp(iris[, 1:3], scale.=TRUE)
# Draw 3d plot with surface and color points by species
scatter3d(PC3~PC1+PC2, iris.pr$x, point.col=c(rep(2, 50), rep(3, 50), rep(4, 50)))
This plots a regression surface predicting PC3 from PC1 and PC2. By definition the correlation between any two principal components is zero so the surface should be PC3=0 for any values of PC1 and PC2, but I don't see a way to produce exactly that surface. It is pretty close though.
Im trying to use a range (160:280) instead of '160', '161' and so on. How would i do that?
group_by(disp = fct_collapse(as.character(disp), Group1 = c(160:280), Group2 = c(281:400)) %>%
summarise(meanHP = mean(hp)))
Error: Problem adding computed columns in `group_by()`.
x Problem with `mutate()` column `disp`.
i `disp = `%>%`(...)`.
x Each input to fct_recode must be a single named string. Problems at positions: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 17```
For range of values it is better to use cut where you can define breaks and labels.
library(dplyr)
library(forcats)
mtcars %>%
group_by(disp = cut(disp, c(0, 160, 280, 400, Inf), paste0('Group', 1:4))) %>%
summarise(meanHP = mean(hp))
# disp meanHP
# <fct> <dbl>
#1 Group1 93.1
#2 Group2 143
#3 Group3 217.
#4 Group4 217.
So here 0-160 becomes 'Group1', 160-280 'Group2' and so on.
With fct_collapse you can do -
mtcars %>%
group_by(disp = fct_collapse(as.character(disp), Group1 = as.character(160:280), Group2 = as.character(281:400))) %>%
summarise(meanHP = mean(hp)) %>%
suppressWarnings()
However, this works only for exact values which are present so 160 would be in group1 but not 160.1.
We could also do
library(dplyr)
library(stringr)
mtcars %>%
group_by(disp = cut(disp, c(0, 160, 280, 400, Inf), strc('Group', 1:4))) %>%
summarise(meanHP = mean(hp))
I am trying to remove certain rows from the lexicon::hash_valence_shifters in the sentimentr package. Specifically, i want to keep only rows:
c( 1 , 2 , 3 , 6 , 7 , 13, 14 , 16 , 19, 24, 25 , 26, 27 , 28, 36, 38, 39, 41, 42, 43, 45, 46, 53, 54, 55, 56, 57, 59, 60, 65, 70, 71, 73, 74, 79, 84, 85, 87, 88, 89, 94, 95, 96, 97, 98, 99, 100, 102, 103, 104, 105, 106, 107, 114, 115, 119, 120, 123, 124, 125, 126, 127, 128, 129, 135, 136, 138)
I have tried the below approach:
vsm = lexicon::hash_valence_shifters[c, ]
vsm[ , y := as.numeric(y)]
vsm = sentimentr::as_key(vsm, comparison = NULL, sentiment = FALSE)
sentimentr::is_key(vsm)
vsn = sentimentr::update_valence_shifter_table(vsm, drop = c(dropvalue$x), x= lexicon::hash_valence_shifters, sentiment = FALSE, comparison = TRUE )
However, when I am calculating the sentiment using the updated valence shifter table "vsn", it is giving the sentiment as 0.
Can someone please let me know how to just keep specific rows of the valence shifter table ?
Thanks!
I have a dataset of environmental variables I would like to use for a GLMM. I am using the corvif function from the AED package (http://www.highstat.com/Book2/AED_1.0.zip) to identify and remove variables with high inflation factors.
Instead of removing one variable at a time manually from my dataset with a GVIF values > 3 (highest value removed first), I would like to know how to write a loop to accomplish this task automatically with the result being a new dataset with only the remaining variables (i.e. those with GVIF values < 3).
Any suggestions for how to approach this problem for a new R user?
Here is my sample data:
WW_Covs <- structure(list(Latitude = c(62.4419, 67.833333, 65.95, 63.72935,
60.966667, 60.266667, 55.660455, 62.216667, 61.3, 61.4, 62.084139,
55.662566, 64.48508, 63.208354, 62.87591, 62.70856, 62.64009,
63.79488, 59.55, 62.84206), BIO_02 = c(87, 82, 75, 70, 77, 70,
59, 84, 84, 79, 85, 60, 91, 87, 74, 74, 76, 70, 76, 74), BIO_03 = c(26,
23, 25, 26, 25, 24, 25, 25, 26, 25, 26, 26, 24, 25, 24, 25, 25,
25, 26, 24), BIO_04 = c(8443, 9219, 7594, 6939, 7928, 7593, 6160,
8317, 8167, 7972, 8323, 6170, 9489, 8578, 7814, 7680, 7904, 7149,
7445, 7803), BIO_05 = c(201, 169, 151, 166, 194, 210, 202, 205,
204, 186, 205, 200, 200, 195, 170, 154, 180, 166, 219, 170),
BIO_06 = c(-131, -183, -144, -102, -107, -75, -26, -119,
-113, -120, -120, -28, -169, -143, -131, -142, -124, -111,
-72, -129), BIO_08 = c(128, 109, 85, 78, 122, 145, 153, 134,
130, 126, 132, 152, 120, 119, 115, 98, 124, 104, 147, 115
), BIO_09 = c(-31, -81, -16, 13, -60, -6, 25, -25, -25, -70,
-25, 23, -56, -39, -47, -60, -39, 8, 0, -46), BIO_12 = c(667,
481, 760, 970, 645, 557, 645, 666, 652, 674, 670, 670, 568,
598, 650, 734, 620, 868, 571, 658), BIO_13 = c(78, 77, 96,
109, 85, 70, 67, 77, 84, 93, 78, 68, 72, 78, 93, 99, 90,
96, 72, 93), BIO_15 = c(23, 40, 25, 21, 36, 30, 21, 24, 28,
34, 24, 22, 28, 29, 34, 32, 36, 22, 30, 34), BIO_19 = c(147,
85, 180, 236, 108, 119, 154, 149, 135, 118, 148, 162, 117,
119, 120, 141, 111, 204, 111, 122)), .Names = c("Latitude",
"BIO_02", "BIO_03", "BIO_04", "BIO_05", "BIO_06", "BIO_08", "BIO_09",
"BIO_12", "BIO_13", "BIO_15", "BIO_19"), row.names = c(1:20), class = "data.frame")
Sample code:
library(AED)
WW_Final <- corvif(WW_Covs)
test <- corvif(WW_Covs])
test[order(-test$GVIF), ]
if(test$GVIF[1,] > 3, # this is where I get stuck...
Here is an algorithm for doing this. I illustrate with the built-in dataset longley, and I also use function vif in package car, rather than using package AED:
It's not pretty, and should be wrapped inside a function, but I leave that as an exercise for the interested reader.
The code:
library(car)
dat <- longley
cutoff <- 2
flag <- TRUE
while(flag){
fit <- lm(Employed ~ ., data=dat)
vfit <- vif(fit)
if(max(vfit) > cutoff){
dat <- dat[, -which.max(vfit)]
} else {
flag <- FALSE
}
}
print(fit)
print(vfit)
The output:
Call:
lm(formula = Employed ~ ., data = dat)
Coefficients:
(Intercept) Unemployed Armed.Forces
50.66281 0.02265 0.02847
Unemployed Armed.Forces
1.032501 1.032501