'x' and 'y' lengths differ in custom entropy function

'x' and 'y' lengths differ in custom entropy function - r

I am trying to learn R and I am having problems with the way it works. I tried to make an entropy function of variables p and 1-p from scratch and I am having problems when I try to add some ifs to avoid the NaN when dividing by 0.
When I try the custom entropy with the plot, it just works but it shows the NaN when I print the results. But when I try to add the ifs, then it says:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
I don't understand it since I am using a plot of an array as x and a function with that array as parameter as y, so it should be the same lengths with and without ifs.
Console without the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= NaN 0.08079314 0.1414405 0.1943919 0.2422922 0.286397 0.3274449 0.3659237 0.4021792 0.4364698 0.4689956 0.499916 0.5293609 0.5574382 0.5842388 0.6098403 0.6343096 0.6577048 0.680077 0.7014715 0.7219281 0.7414827 0.7601675 0.7780113 0.7950403 0.8112781 0.8267464 0.8414646 0.8554508 0.8687212 0.8812909 0.8931735 0.9043815 0.9149264 0.9248187 0.9340681 0.9426832 0.9506721 0.958042 0.9647995 0.9709506 0.9765005 0.9814539 0.985815 0.9895875 0.9927745 0.9953784 0.9974016 0.9988455 0.9997114 1 0.9997114 0.9988455 0.9974016 0.9953784 0.9927745 0.9895875 0.985815 0.9814539 0.9765005 0.9709506 0.9647995 0.958042 0.9506721 0.9426832 0.9340681 0.9248187 0.9149264 0.9043815 0.8931735 0.8812909 0.8687212 0.8554508 0.8414646 0.8267464 0.8112781 0.7950403 0.7780113 0.7601675 0.7414827 0.7219281 0.7014715 0.680077 0.6577048 0.6343096 0.6098403 0.5842388 0.5574382 0.5293609 0.499916 0.4689956 0.4364698 0.4021792 0.3659237 0.3274449 0.286397 0.2422922 0.1943919 0.1414405 0.08079314 NaN
Console with the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= 0Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ

You did not create a vector but a scalar since you did not used a vectorized functionality in you if else clause. The result of your function has been just one number.
This should work:
entropy <- function(p){
# initialize a vector of the desired length with zeros
result <- numeric(length(p))
# subset the vector for which you want to apply your formula on
x <- p[!(p %in% c(0,1))]
# overwrite only those positions for which you want to calculate values based
# on your formula
result[!(p %in% c(0,1))] <- - x*log2(x)-(1-x)*log2((1-x))
#cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')

EDIT:
Even tho I was suggested to do it vectorizing it, I wanted to do it somewhat similar to other languages I know for the moment, since I am starting. I was able to fix it, althought I ended up using a for and printing 2 arrays instead of the function itself.
entropy <- function(p){
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
return(result)
}
x <- seq(0,1,0.01)
y <- numeric(length(p))
i = 1
for (p in x) {
y[i] = entropy(p)
cat(x[i],"=",y[i],"\n")
i=i+1
}
plot(x, y, type='l', main='Funcion entropia con dos valores posibles')

I just applied your entropy function to the p vector prior to trying to plot it using the sapply function.
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
# Apply the function over all the values of 'p'
entropy_p <- sapply(p,FUN = entropy)
plot(p, entropy_p, type='l', main='Funcion entropia con dos valores posibles')

Related

Relationship between input variables and principal components in PCA

Here is the result of PCA.
RC1 and RC3 can be interpreted which variables are related.
But, can not interpreted in RC2.
When the eigen value is checked, the number of factor is 3.
But can there really be only two? or Which variables should be related in RC2?
Input variable is 7 types. and I used 'principal()' function.
names(mydata)
[1] "A" "B" "C" "D" "E" "F" "G"
> x<-cbind(A, B, C, D, E, F, G)
> e_value<-eigen(cor(x))
> e_value
eigen() decomposition
$values
[1] 2.3502254 1.4170606 1.2658360 0.8148231 0.5608698 0.3438629 0.2473222
$vectors
[,1] [,2] [,3] [,4] [,5]
[,6] [,7]
[1,] 0.2388621 0.46839043 0.37003850 0.47205027 -0.58802244
-0.133939151 -0.009233395
[2,] 0.1671739 -0.71097984 -0.14062597 0.25083439 -0.26726985
-0.502411130 -0.244983436
[3,] 0.2132841 -0.19677142 0.64662974 0.34508779 0.61416969
-0.003950736 0.036814153
[4,] 0.1697817 -0.24468987 0.55631886 -0.69016805 -0.34039757
0.039899816 0.089531675
[5,] 0.4857016 0.36681570 -0.09905329 -0.31456085 0.26225761
-0.344919726 -0.577088755
[6,] -0.5359245 0.20164924 0.17958243 -0.13144417 0.11755661
-0.748885304 0.218966481
[7,] 0.5635252 0.03619081 -0.27131854 -0.05105919 0.08439733
-0.219629096 0.741315659
> PCA<-principal(x,nfactors = 3, rotate = "varimax")
> print(PCA)
Principal Components Analysis
Call: principal(r = x, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
RC1 RC2 RC3
SS loadings 2.23 1.40 1.40
Proportion Var 0.32 0.20 0.20
Cumulative Var 0.32 0.52 0.72
Proportion Explained 0.44 0.28 0.28
Cumulative Proportion 0.44 0.72 1.00
Mean item complexity = 1.3
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.11
with the empirical chi square 63.33 with prob < 1.1e-13
Fit based upon off diagonal values = 0.84

The relationship between the PCA and 7 variables is right in the output:
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
This matrix is telling you which variables are correlated with THE PCA. So you can see that A and B are strongly correlated with principal component 2 (labeled RC2).
PCA is a rotation of the data, so there are as many principal components as there are variables (7). But most people are interested in visualization, so generally only the first 2 or 3 principal components are chosen for plotting.

Logaritmic scale in x-axis

I have the following code:
S = [100 200 500 1000 10000];
H = [0.14 0.15 0.17 0.19 0.28;0.14 0.16 0.18 0.20 0.29;0.15 0.17 0.19 0.21 0.31;0.16 0.17 0.20 0.22 0.32;0.23 0.22 0.28 0.30 0.44;0.23 0.23 0.29 0.3 0.5;0.33 0.32 0.4 0.42 0.63;0.32 0.31 0.39 0.40 0.61;0.23 0.23 0.30 0.30 0.50];
for i = 1:9
hold on
plot(S, H(i,:));
legend('GHM01','GHM02','GHM03','GHM04','GHM05','GHM06','GHM07','GHM08','GHM09'); %legend not correctly
axis([100 10000 0.1 1])
end
set(gca,'xscale','log')
The x-axis looks like this:
Because The S-values are very far from each other, I used a logaritmic x-axis (and linear y-axis).
I have on the axis 5 values (see S), and I only want those 5 values visible on the x-axis with equidistant spacing between the values. How do I do this? Or is there a better alternative to display my x-axis, rather than logaritmic scale?

If you want the X-axis ticks to be equally distant although they are not (neither on a linear nor on a log scale) then you basically treat this axis as categorical, and then it should get and ordinal temporary value (say 1:5) to determine the distance between them.
Here is a quick implementation of your comment above:
S = {'100' '200' '500' '1000' '10000'};
H = [0.14 0.15 0.17 0.19 0.28;...
0.14 0.16 0.18 0.20 0.29;
0.15 0.17 0.19 0.21 0.31;
0.16 0.17 0.20 0.22 0.32;
0.23 0.22 0.28 0.30 0.44;
0.23 0.23 0.29 0.3 0.5;
0.33 0.32 0.4 0.42 0.63;
0.32 0.31 0.39 0.40 0.61;
0.23 0.23 0.30 0.30 0.50];
f = figure;
plot(1:length(S),H);
f.Children.XTick = 1:length(S);
f.Children.XTickLabel = S;
TMHO this is the most straightforward way to solve this problem ;)

How to do row wise operations on .SD columns in data.table

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...
I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.
As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?
Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)
library(data.table)
set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
q1_exam1 = round(rnorm(10, .78, .05), 2),
q1_exam2 = round(rnorm(10, .68, .02), 2),
q1_final = round(rnorm(10, .88, .08), 2),
q2_exam1 = round(rnorm(10, .78, .05), 2),
q2_exam2 = round(rnorm(10, .68, .10), 2),
q2_final = round(rnorm(10, .88, .04), 2))
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89

Here are a few thoughts on your options, largely gathered from the comments:
apply along rows
The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.
rowMeans and similarly-named functions also coerce to a matrix.
Split by rows
As #Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.
Efficiently create new columns
This approach taken from eddi is probably the most efficient if you must keep your data in wide format:
jwts = list(
q1_AVG = c(q1_exam1 = 1 , q1_exam2 = 1 , q1_final = 1)/3,
q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)
for (newj in names(jwts)){
w = jwts[[newj]]
dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}
This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).
Go long
As #alexis_laz suggested, you might gain clarity and efficiency with a different structure, like
# reshape
m = melt(dt, id.vars="id", value.name="score")[,
c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]
# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1" , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]
# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]
This is what I would do.
In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.
apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333
In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:
dt[i = TRUE,
q1_AVG := round(apply(.SD, 1, mean), 2),
.SDcols = grep("^q1", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77
The case with the weighted average can be calculated using matrix multiplication;
dt[i = TRUE,
q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2),
.SDcols = grep("^q2", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77 0.80
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73 0.78
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74 0.82
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72 0.87
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75 0.77
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78 0.80
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73 0.80
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75 0.78
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76 0.77
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77 0.82

Error using corrplot

I need help with interpreting an error message using corrplot.
Here is my script
install.packages("ggplot2")
install.packages("corrplot")
install.packages("xlsx")
library(ggplot2)
library(corrplot)
library(xlsx)
#set working dir
setwd("C:/R")
#read xlsx data into R
df <- read.xlsx("TP_diff_frame.xlsx",1)
#set column as index
rownames(df) <- df$country
#remove column
df2<-subset(df, select = -c(country) )
#round values to to decimals
corrplot(df2, method="shade",shade.col=NA, tl.col="black", tl.srt=45)
My df2:
> df2
a b c d e f g
Sweden 0.09 0.19 0.00 -0.25 -0.04 0.01 0.00
Germany 0.11 0.19 0.01 -0.35 0.01 0.02 0.01
UnitedKingdom 0.14 0.21 0.03 -0.32 -0.05 0.00 0.00
RussianFederation 0.30 0.26 -0.07 -0.41 -0.09 0.00 0.00
Netherlands 0.09 0.16 -0.05 -0.26 0.02 0.02 0.01
Belgium 0.12 0.20 0.01 -0.34 0.01 0.00 0.00
Italy 0.14 0.22 0.01 -0.37 0.00 0.00 0.00
France 0.14 0.24 -0.04 -0.34 0.00 0.00 0.00
Finland 0.16 0.17 0.01 -0.26 -0.08 0.00 0.00
Norway 0.15 0.21 0.10 -0.37 -0.09 0.00 0.00
And the error message:
> corrplot(df2, method="shade",shade.col=NA, tl.col="black", tl.srt=45)
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent

I think the problem is that you are plotting the data frame instead of the correlation matrix. Try to change the last line to this:
corrplot(cor(df2), method="shade",shade.col=NA, tl.col="black", tl.srt=45)
The function cor calculates the correlation matrix, which is what you need to plot

In order to use the corrplot package for heatmap plots you should pass your data.frame to a matrix and also use the is.corr argument.
df2 <- as.matrix(df2)
corrplot(df2, is.corr=FALSE)

Another option is to break it up into two lines of code.
df2 <- cor(df, use = "na.or.complete")
corrplot(df2, method="shade",shade.col=NA, tl.col="black", tl.srt=45)
I'd run a simple corrplot (e.g. corrplot.mixed(df2)) make sure it works, then get into the fine tuning and aesthetics.

Multiple boxplots with predefined statistics using lattice-like graphs in r

I have a dataset which looks like this
VegType 87MIN 87MAX 87Q25 87Q50 87Q75 96MIN 96MAX 96Q25 96Q50 96Q75 00MIN 00MAX 00Q25 00Q50 00Q75
1 0.02 0.32 0.11 0.12 0.13 0.02 0.26 0.08 0.09 0.10 0.02 0.28 0.10 0.11 0.12
2 0.02 0.45 0.12 0.13 0.13 0.02 0.20 0.09 0.10 0.11 0.02 0.26 0.11 0.12 0.12
3 0.02 0.29 0.13 0.14 0.14 0.02 0.27 0.11 0.11 0.12 0.02 0.26 0.12 0.13 0.13
4 0.02 0.41 0.13 0.13 0.14 0.02 0.58 0.10 0.11 0.12 0.02 0.34 0.12 0.13 0.13
5 0.02 0.42 0.12 0.13 0.14 0.02 0.46 0.10 0.11 0.11 0.02 0.28 0.12 0.12 0.13
6 0.02 0.32 0.13 0.14 0.14 0.02 0.52 0.12 0.12 0.13 0.02 0.29 0.13 0.14 0.14
7 0.02 0.55 0.12 0.13 0.14 0.02 0.24 0.10 0.11 0.11 0.02 0.37 0.12 0.12 0.13
8 0.02 0.55 0.12 0.13 0.14 0.02 0.19 0.10 0.11 0.12 0.02 0.22 0.11 0.12 0.13
In reality I have 26 variables and 5 years (87,96 and 00 in the column names are years). In an ideal world I would like to have a lattice-like graph with 26 plots, one per variable, with each plot containing 5 boxes, i.e. one per year. I understand that it is not possible to do this is lattice because lattice won't accept predefined statistics. Is there a fairly unpainful way to do this in R with predefined stats? I have used bxp for simple boxplots plotting all the variables for one year in a single plot e.g.
Yr01 = read.csv('dat.csv',header=T)
dat01=t(Yr01[,c("01Min","01Q25","01Mean","01Q75","01Max")])
bxp(list(stats=dat01, n=rep(26, ncol(dat01))),ylim=c(0.07,0.2))
but I don't know how to go from there to what I need.
Thanks.

This can be done, at least using ggplot2, but you'll have to reshape your data quite a bit. And you really have to have a data where the quantiles actually make sense!! Your quantile values are all messed up! For example, Var1 has 01Max = 0.26 and 01Q75 = .67!!
First, I'll recreate a valid data:
n <- c("01Min", "01Max", "01Med", "01Q25", "01Q75", "02Min",
"02Max", "02Med", "02Q25", "02Q75")
v1 <- c(0.03, 0.76, 0.41, 0.13, 0.67, 0.10, 0.43, 0.27, 0.2, 0.33)
v2 <- c(0.03, 0.28, 0.14, 0.08, 0.20, 0.02, 0.77, 0.13, 0.06, 0.44)
df <- data.frame(v1=v1, v2=v2)
df <- as.data.frame(t(df))
names(df) <- n
df <- cbind(var=c("v1","v2"), df)
> df
# var 01Min 01Max 01Med 01Q25 01Q75 02Min 02Max 02Med 02Q25 02Q75
# v1 v1 0.03 0.76 0.41 0.13 0.67 0.10 0.43 0.27 0.20 0.33
# v2 v2 0.03 0.28 0.14 0.08 0.20 0.02 0.77 0.13 0.06 0.44
Next, we'll reshape the data:
require(reshape2)
df.m <- melt(df, id="var")
# look for a bunch of numbers from the start of the string and capture it
# in the first variable: () captures the pattern. And replace it with the
# captured pattern with the variable "\\1"
df.m$year <- gsub("^([0-9]+)(.*$)", "\\1", df.m$variable)
# the same but instead refer to the captured pattern in the second
# paranthesis using "\\2"
df.m$quan <- gsub("^([0-9]+)(.*)$", "\\2", df.m$variable)
df.f <- dcast(df.m, var+year ~ quan, value.var="value")
To get to this format:
> df.f
# var year Max Med Min Q25 Q75
# 1 v1 01 0.76 0.41 0.03 0.13 0.67
# 2 v1 02 0.43 0.27 0.10 0.20 0.33
# 3 v2 01 0.28 0.14 0.03 0.08 0.20
# 4 v2 02 0.77 0.13 0.02 0.06 0.44
Now, we can plot by directly providing the quantile values to corresponding parameters using the corresponding column names as follows:
require(ggplot2)
require(scales)
p <- ggplot(df.f, aes(x=var, ymin=`Min`, lower=`Q25`, middle=`Med`,
upper=`Q75`, ymax=`Max`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
# if you want facetting:
p + facet_wrap( ~ var, scales="free")
You can now accomplish your task of plotting all years for each var in a separate plot using a lapply with this code and subsetting as follows:
lapply(levels(df.f$var), function(x) {
p <- ggplot(df.f[df.f$var == x, ],
aes(x=var, ymin=`Min`, lower=`Q25`,
middle=`Med`, upper=`Q75`, ymax=`Max`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
ggsave(paste0(x, ".pdf"), last_plot())
})
Edit: Your data is different from the earlier data you provided in some aspects. So, here's the version of the code for your new data:
# change var to VegType everywhere
require(reshape2)
df.m <- melt(df, id="VegType")
df.m$year <- gsub("^X([0-9]+)(.*$)", "\\1", df.m$variable) # pattern has a X
df.m$quan <- gsub("^X([0-9]+)(.*)$", "\\2", df.m$variable) # pattern has a X
df.f <- dcast(df.m, VegType+year ~ quan, value.var="value")
df.f$VegType <- factor(df.f$VegType) # convert integer to factor
require(ggplot2)
require(scales)
p <- ggplot(df.f, aes(x=VegType, ymin=`MIN`, lower=`Q25`, middle=`Q50`,
upper=`Q75`, ymax=`MAX`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
You can facet/write as separate plots using same code as before.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

'x' and 'y' lengths differ in custom entropy function - r

Related

Relationship between input variables and principal components in PCA

Logaritmic scale in x-axis

How to do row wise operations on .SD columns in data.table

Error using corrplot

Multiple boxplots with predefined statistics using lattice-like graphs in r

Categories

Resources