tidyr::unite drops decimal 0 - r

I want to combine numbers from two and two columns within a data frame (values in the columns are the upper and lower values for confidence intervals in statistical analysis).
My perferred method would be to use tidyr and the unite function. But take 0.20 as an example, that number will be modified to 0.2, i.e. these last decimals in numbers are dropped if they are equal to zero. Is there any way to keep the original format when using unite?
unite is describe here: https://www.rdocumentation.org/packages/tidyr/versions/0.8.2/topics/unite
Example:
# Dataframe
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Combining (uniting) columns for confidence with unite, using a comma as a separator
library(tidyr)
df <- unite(df, "CI", c("low2.5", "up2.5"), sep = ", ", remove=T)
gives
df
est CI
1 0.05 0.01, 0.09
2 -0.16 -0.2, -0.12
3 -0.02 -0.05, 0
4 0.00 -0.03, 0.04
5 -0.11 -0.2, -0.01
6 0.15 0.1, 0.2
7 -0.26 -0.3, -0.22
8 -0.23 -0.28, -0.17
I would want this:
est CI
1 0.05 0.01, 0.09
2 -0.16 -0.20, -0.12
3 -0.02 -0.05, 0.00
4 0.00 -0.03, 0.04
5 -0.11 -0.20, -0.01
6 0.15 0.10, 0.20
7 -0.26 -0.30, -0.22
8 -0.23 -0.28, -0.17
I believing doing this with Base R will be complicated (having to move/rearrange the many combined columns and delete the old columns). Is there any way to avoid unite from dropping decimals with the value of zero?

This works:
library(tidyverse)
df %>%
mutate_if(is.numeric, ~format(., nsmall = 2)) %>%
unite("CI", c("low2.5", "up2.5"), sep = ", ", remove=T)
# est CI
#1 0.05 0.01, 0.09
#2 -0.16 -0.20, -0.12
#3 -0.02 -0.05, 0.00
#4 0.00 -0.03, 0.04
#5 -0.11 -0.20, -0.01
#6 0.15 0.10, 0.20
#7 -0.26 -0.30, -0.22
#8 -0.23 -0.28, -0.17

Related

Can I use a correlation test function with a correlation matrix as an imput?

I am trying to apply a correlation test function to a correlation matrix and I'm having no success. I don't have the raw data just this matrix, this is why is important to find a way in which I can do this.
My data looks like this:
PM10 T Tmax Tmin P H PT V Vmax
PM10 1.00 -0.41 -0.26 -0.55 0.37 -0.13 -0.25 -0.27 -0.22
T -0.41 1.00 0.95 0.87 -0.18 -0.28 -0.01 -0.14 -0.05
Tmax -0.26 0.95 1.00 0.70 -0.08 -0.41 -0.09 -0.23 -0.08
Tmin -0.55 0.87 0.70 1.00 -0.30 0.07 0.14 -0.03 -0.01
P 0.37 -0.18 -0.08 -0.30 1.00 -0.18 -0.13 -0.29 -0.25
H -0.13 -0.28 -0.41 0.07 -0.18 1.00 0.32 -0.15 -0.19
PT -0.25 -0.01 -0.09 0.14 -0.13 0.32 1.00 0.11 0.07
V -0.27 -0.14 -0.23 -0.03 -0.29 -0.15 0.11 1.00 0.83
Vmax -0.22 -0.05 -0.08 -0.01 -0.25 -0.19 0.07 0.83 1.00
Cor.test() doesn't accept a matrix as input and I can't seem to find any other way in which I can do this.
Barlett's test tests a correlation matrix against the identity based on a chi squared approximation to the distribution of the determinant of the correlation matrix. The sample size, n, should be supplied but if not it will assume 100 -- see warning below.
If the p value is less than 0.05, say, then we can reject the null hypothesis that the correlation matrix is an identity matrix. Apply it to an identity matrix cortest.bartlett(diag(9)) and a matrix of all ones cortest.bartlett(1 + 0*R) to see the difference.
library(psych)
cortest.bartlett(R)
giving:
$chisq
[1] 777.2996
$p.value
[1] 5.038841e-140
$df
[1] 36
Warning message:
In cortest.bartlett(R) : n not specified, 100 used
Note
R <- structure(c(1, -0.41, -0.26, -0.55, 0.37, -0.13, -0.25, -0.27,
-0.22, -0.41, 1, 0.95, 0.87, -0.18, -0.28, -0.01, -0.14, -0.05,
-0.26, 0.95, 1, 0.7, -0.08, -0.41, -0.09, -0.23, -0.08, -0.55,
0.87, 0.7, 1, -0.3, 0.07, 0.14, -0.03, -0.01, 0.37, -0.18, -0.08,
-0.3, 1, -0.18, -0.13, -0.29, -0.25, -0.13, -0.28, -0.41, 0.07,
-0.18, 1, 0.32, -0.15, -0.19, -0.25, -0.01, -0.09, 0.14, -0.13,
0.32, 1, 0.11, 0.07, -0.27, -0.14, -0.23, -0.03, -0.29, -0.15,
0.11, 1, 0.83, -0.22, -0.05, -0.08, -0.01, -0.25, -0.19, 0.07,
0.83, 1), .Dim = c(9L, 9L), .Dimnames = list(c("PM10", "T", "Tmax",
"Tmin", "P", "H", "PT", "V", "Vmax"), c("PM10", "T", "Tmax",
"Tmin", "P", "H", "PT", "V", "Vmax")))

Winsorize function: Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : undefined columns selected

I want to winsorize my data, which looks like following (in total 134 observations):
company id rev size age
1 Adeg 29.9 0.66 160 45
2 Agrana 32.0 2.80 9191 29
3 Allianz 36.5 87.75 142460 128
4 Andritz 34.0 6.89 29096 118
5 Apple 41.0 259.65 132000 41
To use the winsorize function from DescToolspackage, I created a single numeric vector of variable rev, by simply using the select function: rev_vector <- select(data1, -...)
I then ran the function as following, which gives me an error:
> Winsorize(rev_vector)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
Is this caused since i implement a data.frame instead of a vector?
Alternatively, I tried the following:
> Winsorize(rev_vector$rev, probs = c(0.05, 0.95))
[1] 0.66 2.80 87.75 6.89 134.73 0.09 22.78 1.36 5.48 0.70 0.79 0.35 31.37 0.55 0.94 0.06
[17] 12.36 13.58 7.95 0.29 7.80 0.39 73.55 0.09 23.07 0.27 0.32 0.08 0.05 0.41 29.47 0.66
[33] 20.91 0.67 0.05 1.39 0.17 0.14 1.79 0.05 2.52 3.68 0.24 0.09 109.65 8.43 0.20 0.17
[49] 35.93 3.05 0.07 0.05 0.82 0.57 26.21 0.28 0.05 5.72 6.12 4.09 0.05 0.22 134.73 94.43
[65] 41.35 0.20 17.32 5.63 3.25 0.12 0.05 0.07 10.89 3.79 1.89 134.73 9.98 10.58 54.98 134.73
[81] 15.55 15.21 5.93 42.65 1.59 3.00 11.19 6.10 0.08 134.73 31.37 17.74 20.92 6.46 3.18 0.05
[97] 0.81 9.15 29.47 0.05 1.34 7.97 109.65 28.45 35.93 0.38 0.65 134.73 9.44 8.66 5.30 11.83
[113] 20.06 29.55 1.15 2.32 46.14 134.73 9.98 10.58 11.05 54.98 134.73 15.55 15.21 5.93 1.59 1.03
[129] 3.00 11.19 6.10
I am not sure about what the outcome means? Since I don't think that the winsorize actually worked when looking at the summary of the vector: summary(rev_vector$rev), it is unchanged to the one previous winsorizing.
Can somebody help me out here? Thanks!
You are almost there, only that you chose restrictive probs for the quantiles. Your vector has already a considerable number of equal values at its edges. Has it perhaps already been winsorized before?
library(DescTools)
x <- c(0.66, 2.8, 87.75, 6.89, 134.73, 0.09, 22.78, 1.36,
5.48, 0.7, 0.79, 0.35, 31.37, 0.55, 0.94, 0.06, 12.36, 13.58,
7.95, 0.29, 7.8, 0.39, 73.55, 0.09, 23.07, 0.27, 0.32, 0.08,
0.05, 0.41, 29.47, 0.66, 20.91, 0.67, 0.05, 1.39, 0.17, 0.14,
1.79, 0.05, 2.52, 3.68, 0.24, 0.09, 109.65, 8.43, 0.2, 0.17,
35.93, 3.05, 0.07, 0.05, 0.82, 0.57, 26.21, 0.28, 0.05, 5.72,
6.12, 4.09, 0.05, 0.22, 134.73, 94.43, 41.35, 0.2, 17.32, 5.63,
3.25, 0.12, 0.05, 0.07, 10.89, 3.79, 1.89, 134.73, 9.98, 10.58,
54.98, 134.73, 15.55, 15.21, 5.93, 42.65, 1.59, 3, 11.19, 6.1,
0.08, 134.73, 31.37, 17.74, 20.92, 6.46, 3.18, 0.05, 0.81, 9.15,
29.47, 0.05, 1.34, 7.97, 109.65, 28.45, 35.93, 0.38, 0.65, 134.73,
9.44, 8.66, 5.3, 11.83, 20.06, 29.55, 1.15, 2.32, 46.14, 134.73,
9.98, 10.58, 11.05, 54.98, 134.73, 15.55, 15.21, 5.93, 1.59,
1.03, 3, 11.19, 6.1)
summary() is in this case somewhat coarse.
summary(Winsorize(x))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.05 0.48 5.48 19.73 17.53 134.73
Using Desc() gives you a more detailed idea what's going on in your data.
Desc(Winsorize(x))
# -----------------------------------------------------
# Winsorize(x) (numeric)
#
# length n NAs unique 0s mean meanCI
# 131 131 0 95 0 19.73 13.53
# 100.0% 0.0% 0.0% 25.92
#
# .05 .10 .25 median .75 .90 .95
# 0.05 0.08 0.48 5.48 17.53 54.98 134.73
#
# range sd vcoef mad IQR skew kurt
# 134.68 35.84 1.82 7.87 17.05 2.35 4.42
#
# lowest : 0.05 (9), 0.06, 0.07 (2), 0.08 (2), 0.09 (3)
# highest: 73.55, 87.75, 94.43, 109.65 (2), 134.73 (8)
You see, that you have 9 times the value 0.05 and 8 times the value 134.73. So the quantiles with probs 0.05 and 0.95 are the same as the extremes and the winsorized vector remains the same as the original one.
quantile(x=x, probs=c(0.05, 0.95))
# 5% 95%
# 0.05 134.73
Simply increase the probs to say c(0.1, 0.9) and you'll see the effect.
PS: Winsorize() needs a vector as argument and can't handle data.frames. (This is also so described in the help file…)
PPS: a reproducible example would help… ;-)

Select multiple columns from ordered dataframe

I would like to calculate the mean value for each of my variables, and then I would like to create a list of the names of variables with the 3 largest mean values.
I will then use this list to subset my dataframe and will only include the 3 selected variables in additional analysis.
I'm close, but can't quite seem to write the code efficiently. And I'm trying to use pipes for the first time.
Here is a simplified dataset.
FA1 <- c(0.68, 0.79, 0.65, 0.72, 0.79, 0.78, 0.77, 0.67, 0.77, 0.7)
FA2 <- c(0.08, 0.12, 0.07, 0.13, 0.09, 0.12, 0.13, 0.08, 0.17, 0.09)
FA3 <- c(0.1, 0.06, 0.08, 0.09, 0.06, 0.08, 0.09, 0.09, 0.06, 0.08)
FA4 <- c(0.17, 0.11, 0.19, 0.13, 0.14, 0.14, 0.13, 0.16, 0.11, 0.16)
FA5 <- c(2.83, 0.9, 3.87, 1.55, 1.91, 1.46, 1.68, 2.5, 3.0, 1.45)
df <- data.frame(FA1, FA2, FA3, FA4, FA5)
And here is the piece of code I've written that doesn't quite get me what I want.
colMeans(df) %>% rank()
First identify the three columns with the highest means. I use colMeans to calculate the column means. I then sort the means by decreasing order and only keep the first three, which are the three largest.
three <-sort(colMeans(df),decreasing = TRUE)[1:3]
Then, keep only those columns.
df[,names(three)]
> df[,names(three)]
FA5 FA1 FA4
1 2.83 0.68 0.17
2 0.90 0.79 0.11
3 3.87 0.65 0.19
4 1.55 0.72 0.13
5 1.91 0.79 0.14
6 1.46 0.78 0.14
7 1.68 0.77 0.13
8 2.50 0.67 0.16
9 3.00 0.77 0.11
10 1.45 0.70 0.16

Why are my columns getting repeated when I call the function a second time?

I am using this function to create n-day accumulations of a time series.
masterfunction <-function(df,accum_var,accum_days){
df<-cbind(df,rollapply(df[[accum_var]], accum_days, sum, fill=NA, align='right'))
}
odd<-masterfunction(df=odd,accum_var = "PRECIP",accum_days = 4)
odd<-masterfunction(df=odd,accum_var = "OBS_Q",accum_days = 2)
But when I run it the second time (for OBS_Q), the first column (for PRECIP) gets repeated instead. Any pointers how I can fix this? Also, any suggestions if I can improve this code where I can give the function a list of variables (accum_var) instead of calling it again and again?
odd<-structure(list(DATE = 19630101:19630106, PRECIP = c(0, 0, 0,
0, 0, 0.04), OBS_Q = c(1.61, 1.48, 1.4, 1.33, 1.28, 1.27), swb = c(1.75,
1.73, 1.7, 1.67, 1.65, 1.63), gr4j = c(1.9, 1.77, 1.67, 1.58,
1.51, 1.44), isba = c(0.83, 0.83, 0.83, 0.83, 0.83, 0.83), noah = c(1.31,
1.19, 1.24, 1.31, 1.44, 1.55), sac = c(1.99, 1.8, 1.66, 1.57,
1.46, 1.41), swap = c(1.1, 1.05, 1.08, 0.99, 0.88, 0.83), vic.mm.day. = c(2.1,
1.75, 1.55, 1.43, 1.32, 1.17)), .Names = c("DATE", "PRECIP",
"OBS_Q", "swb", "gr4j", "isba", "noah", "sac", "swap", "vic.mm.day."
), row.names = 366:371, class = "data.frame")
Thanks!
This should do the trick.
New function that performs the same operation as your masterfunction
masterfunction2 <-function(accum_var, df = odd, suffix = "_new") {
j <- data.frame(rollapply(data = df[, accum_var[1]],
width = as.numeric(accum_var[2]),
FUN = sum, fill=NA,
align='right'))
names(j) = paste0(accum_var[1], suffix)
return(j) }
Use a list as your input, your column name followed by width parameter
i = list(c("PRECIP", 4),
c("PRECIP", 2),
c("OBS_Q", 2),
c("noah", 3))
Get the output. Use suffix to change your new column names. df to change your data.table name.
cbind(odd, do.call(cbind, sapply(X = i, FUN = masterfunction2, df = odd, suffix = "_roll")))
DATE PRECIP OBS_Q swb gr4j isba noah sac swap vic.mm.day. PRECIP_roll OBS_Q_roll noah_roll
366 19630101 0.00 1.61 1.75 1.90 0.83 1.31 1.99 1.10 2.10 NA NA NA
367 19630102 0.00 1.48 1.73 1.77 0.83 1.19 1.80 1.05 1.75 NA 3.09 NA
368 19630103 0.00 1.40 1.70 1.67 0.83 1.24 1.66 1.08 1.55 NA 2.88 3.74
369 19630104 0.00 1.33 1.67 1.58 0.83 1.31 1.57 0.99 1.43 0.00 2.73 3.74
370 19630105 0.00 1.28 1.65 1.51 0.83 1.44 1.46 0.88 1.32 0.00 2.61 3.99
371 19630106 0.04 1.27 1.63 1.44 0.83 1.55 1.41 0.83 1.17 0.04 2.55 4.30

polr(..) ordinal logistic regression in R

I'm experiencing some trouble when using the polr function.
Here is a subset of the data I have:
# response variable
rep = factor(c(0.00, 0.04, 0.06, 0.13, 0.15, 0.05, 0.07, 0.00, 0.06, 0.04, 0.05, 0.00, 0.92, 0.95, 0.95, 1, 0.97, 0.06, 0.06, 0.03, 0.03, 0.08, 0.07, 0.04, 0.08, 0.03, 0.07, 0.05, 0.05, 0.06, 0.04, 0.04, 0.08, 0.04, 0.04, 0.04, 0.97, 0.03, 0.04, 0.02, 0.04, 0.01, 0.06, 0.06, 0.07, 0.08, 0.05, 0.03, 0.06,0.03))
# "rep" is discrete variable which represents proportion so that it varies between 0 and 1
# It is discrete proportions because it is the proportion of TRUE over a finite list of TRUE/FALSE. example: if the list has 3 arguments, the proportions value can only be 0,1/3,2/3 or 1
# predicted variable
set.seed(10)
pred.1 = sample(x=rep(1:5,10),size=50)
pred.2 = sample(x=rep(c('a','b','c','d','e'),10),size=50)
# "pred" are discrete variables
# polr
polr(rep~pred.1+pred.2)
The subset I gave you works fine ! But my entire data set and some subset of it does not work ! And I can't find anything in my data that differ from this subset except the quantity. So, here is my question: Is there any limitations in terms of the number of levels for example that would yield to the following error message:
Error in optim(s0, fmin, gmin, method = "BFGS", ...) :
the initial value in 'vmin' is not finite
and the notification message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
(I had to translate these two messages into english so they might no be 100% correct)
I sometimes only get the notification message and sometimes everything is fine depending on the what subset of my data I use.
My rep variable have a total of 101 levels for information (and contain nothing else than the kind of data I described)
So it is a terrible question that I am asking becaue I can't give you my full dataset and I don't know where is the problem. Can you guess where my problem comes from thanks to these informations ?
Thank you
Following #joran's advice that your problem is probably the 100-level factor, I'm going to recommend something that probably isn't statistically valid but will probably still be effective in your particular situation: don't use logistic regression at all. Just drop it. Perform a simple linear regression and then discretize your output as necessary using a specialized rounding procedure. Give it a shot and see how well it works for you.
rep.v = c(0.00, 0.04, 0.06, 0.13, 0.15, 0.05, 0.07, 0.00, 0.06, 0.04, 0.05, 0.00, 0.92, 0.95, 0.95, 1, 0.97, 0.06, 0.06, 0.03, 0.03, 0.08, 0.07, 0.04, 0.08, 0.03, 0.07, 0.05, 0.05, 0.06, 0.04, 0.04, 0.08, 0.04, 0.04, 0.04, 0.97, 0.03, 0.04, 0.02, 0.04, 0.01, 0.06, 0.06, 0.07, 0.08, 0.05, 0.03, 0.06,0.03)
set.seed(10)
pred.1 = factor(sample(x=rep(1:5,10),size=50))
pred.2 = factor(sample(x=rep(c('a','b','c','d','e'),10),size=50))
model = lm(rep.v~as.factor(pred.1) + as.factor(pred.2))
output = predict(model, newx=data.frame(pred.1, pred.2))
# Here's one way you could accomplish the discretization/rounding
f.levels = unique(rep.v)
rounded = sapply(output, function(x){
d = abs(f.levels-x)
f.levels[d==min(d)]
}
)
>rounded
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.06 0.07 0.00 0.06 0.15 0.00 0.07 0.00 0.13 0.06 0.06 0.15 0.15 0.92 0.15 0.92 0.15 0.15 0.06 0.06 0.00 0.07 0.15 0.15
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.15 0.15 0.00 0.00 0.15 0.00 0.15 0.15 0.07 0.15 0.00 0.07 0.15 0.00 0.15 0.15 0.00 0.15 0.15 0.15 0.92 0.15 0.15 0.00
49 50
0.13 0.15
orm from rms can handle ordered outcomes with a large number of categories.
library(rms)
orm(rep ~ pred.1 + pred.2)

Resources