Subsets of a dataset as separate dendrograms, but in the same plot - r

I know I can plot a dendrogram as follows
library(cluster)
d <- mtcars
d[,8:11] <- lapply(d[,8:11], as.factor)
gdist <- daisy(d, metric = c("gower"), stand = FALSE)
dendro <- hclust(gdist, method = "average")
plot(as.dendrogram(dendro))
However I have some groups identified (eg. by an iterative classification method), given as the last column in d
G <- c(1,2,3,3,4,4,5,5,5,5,1,2,1,1,2,4,1,3,4,5,1,7,4,3,3,2,1,1,1,3,5,6)
d$Group <- G
head(d)
mpg cyl disp hp drat wt qsec vs am gear carb Group
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4
I am trying to plot all the dendrograms together on the same plot with the same scale. The groups with only a single member also needs to be plotted. (group 6 and 7)
I am able to plot individual dendrograms for subset of the data except when number of members in a group is only one. But I don't think this is the right approach.
layout(matrix(1:9, 3,3,byrow=TRUE))
gdist <- as.matrix(gdist)
N <- max(G)
for (i in 1:N){
rc_tokeep <- row.names(subset(d, G==i))
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend))
}
The loop is giving this error for the last two groups. (6 and 7) having only a single member.
Error in hclust(dis, method = "average") :
must have n >= 2 objects to cluster
Essentially I wan't to reproduce these type of plots. The clusters with single members are also plotted here.

If you want to mimic the last few graphs, you can do something like this:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,
xlim=c(.8,0),axes=FALSE) # giving the same xlim will scale all of them, here i used 0.8 to fit your data but you can change it to whatever
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5) #I don't know how you intend to compute the length of the branch in a group of 1 element, you might want to change that
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
}
}
With your example it gives:
If you want to add the scale you can add a grid in all graphs and a scale in the last one:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}
}
axis(1,at=seq(0,.8,.1)) #Here the axis
And finally if you want to even the spaces between the different branches in the resulting plot, you can use table(d$Group) to get the number of members of each group and use it as a height for each subplot:
N <- max(G)
layout(matrix(c(0,1:7,0),nc=1), height=c(3,table(d$Group),3)) #Plus the height of the empty spaces.
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3)
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3)
}
}
axis(1,at=seq(0,.8,.1))

Related

Multiply two matrices (sumproduct for multiple functions) - to get Fishers discriminant linear function scores

I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589

Apply variable function to columns in data.table

I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"

R: Geom_histrogram

I am trying to create a histogram using geom_histogram() that uses a numeric variable for both the x and y axis.
The numeric x axis will be bucketed and the numeric x axis will show the sum of some other numeric value for each bucket. Right now, I am not having any luck and was hoping someone could help.
attach(Pre_vitality_HZ_Data)
buckets_pre = seq(min(Pre_V_HzR),max(Pre_V_HzR)+1,0.05)
ggplot() +
geom_histogram(alpha = 0.2, aes(x=Pre_V_HzR, y = sum(Policy_Count)), bins = length(buckets), fill = 'aquamarine3')
`
To make the plot you want with ggplot2, it's necessary to prepare the data before plotting. In the solution below, I propose dividing the continuous x-variable into a discrete variable with cut(), and using aggregate() to sum the y-values for each bin of x-values. Besides the base R function aggregate, there are many ways to summarize, aggregate and reshape your data. You may wish to look into the dplyr package or data.table package (two very powerful, well supported packages).
library(ggplot2)
# Use the built-in data set `mtcars` to make the example reproducible.
# Run ?mtcars to see a description of the data set.
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Let's use `disp` (engine displacement) as the x-variable
# and `mpg` (miles per gallon) as the y-variable.
# Bin the `disp` column into discrete variable with `cut()`
disp_bin_edges = seq(from=71, to=472, length.out=21)
mtcars$disp_discrete = cut(mtcars$disp, breaks=disp_bin_edges)
# Use `aggregate()` to sum `mpg` over levels of `disp_discrete`,
# creating a new data.frame.
dat = aggregate(mpg ~ disp_discrete, data=mtcars, FUN=sum)
# Use `geom_bar(stat="identity") to plot pre-computed y-values.
p1 = ggplot(dat, aes(x=disp_discrete, y=mpg)) +
geom_bar(stat="identity") +
scale_x_discrete(drop=FALSE) +
theme(axis.text.x=element_text(angle=90)) +
ylab("Sum of miles per gallon") +
xlab("Displacement, binned")
# For this example data, a scatterplot conveys a clearer story.
p2 = ggplot(mtcars, aes(x=disp, y=mpg)) +
geom_point(size=5, alpha=0.4) +
ylab("Miles per gallon") +
xlab("Displacement")
library(gridExtra)
ggsave("plots.png", arrangeGrob(p1, p2, nrow=1), height=4, width=8, dpi=150)

Predicting data via regression model and storing in a vector

Apologies for what is probably a very basic question.
I have created a linear model for a massive meteorological dataset using multiple regression. My goal is to use that model to "predict" data during a certain period using predictors 1, 2 and 3. I will then compare those predicted data to the observed data for that period.
My approach thus far has been to create a new vector for the predicted values and loop through the vector, creating predicted values based on the extracted coefficients of the linear model. Then, I will simply subtract the predicted values from the observed values. For some reason, this approach results in the new predicted vector being NULL. Any idea how I could approach this?
A sample is below. "data" refers to the dataset containing the predictors.
coef <- coefficients(multipleRegressionModel)
predictedValues=c()
for(i in 1:length(data$timePeriod)){
predictedValues[i] = append(predictedValues, data$coef[1]+data$predictor1[i]*data$coef[2]+data$predictor2[i]*data$coef[3]+
data$predictor3[i]*data$coef[4])
}
diff=c()
diff=observedValues - predictedValues
It looks like you are making this more difficult than it needs to be. R has a predict() function that does all of this for you. If you had a sample data.frame like so:
set.seed(26)
mydf = data.frame (a=1:20 , b = rnorm(20),
c = 1:20 + runif(20,2,3)*runif(20, 2, 3),
d = 1:20 + rpois(20,5)*runif(1:20)*sin(1:20))
And you wanted to train on some rows, and test on the others
trainRows<-sample(1:20, 16)
mydf.train<-mydf[trainRows,]
mydf.test<-mydf[-trainRows,]
Then fit the model and predict
model<-lm(a~b+c+d, data = mydf.train)
summary(model) #gives info about your model.
mydf.test$pred<-predict(model1, newdata = mydf.test)
MSE<-mean((mydf.test$pred-mydf.test$a)^2) #calculate mean squared error
MSE
#[1] 0.06321
View the predictions with mydf.test$pred
Here is a simple example using a glm on the mtcars data.
Line<- #setting up the linear model function
function (train_dat, test_dat, variables, y_var, family = "gaussian")
{
fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+"))) #formula
glm1 <- glm(fm, data = train_dat, family = family) #run the model
pred <- predict(glm1, newdata = test_dat) #predict the model
}
data(mtcars)
y_var<-'mpg'
x_vars<-setdiff(names(mtcars),y_var)
mtcars[,'linear_prediction']<-Line(mtcars,mtcars,x_vars,y_var)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb linear_prediction
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22.59951
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.11189
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.25064
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.23740
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.69343
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 20.38304

How to get width of dataframe being printed in terminal in R

Im looking for a way to put a bottom line under a dataframe that I print out in R
The Output looks like
I want to print out the bottom line as long as the dataframe output is. But the width is varying.
Any Idea?
EDIT
I'm trying to get rid of
cat("---------------------------------------\n")
and want to make that dynamic to the output size of a given dataframe. The "-----" line should be not longer or shorter then the dataframe.
Use getOption("width"):
> getOption("width")
[1] 80
You can see a description of this option via ?options which states
‘width’: controls the maximum number of columns on a line used in
printing vectors, matrices and arrays, and when filling by
‘cat’.
That doesn't mean that the entire 80 (in my case) characters are used, but R's printing shouldn't extend beyond that so it should be an upper limit.
You should probably also check this in IDE or other front-ends to R. For example, RStudio might do something different depending on the width of the console widget in their app.
To actually format exactly the correct width for the data frame, you'll need to process the data frame into character strings for each line (much as print.data.frame does via its format method. Something like:
df <- data.frame(Price = round(runif(10), 2),
Date = Sys.Date() + 0:9,
Subject = rep(c("Foo", "Bar", "DJGHSJIBIBFUIBSFIUBFUIS"),
length.out = 10),
Category = rep("Media", 10))
class(df) <- c("MyDF", "data.frame")
print.MyDF <- function(x, ...) {
fdf <- format(x)
strings <- apply(x, 2, function(x) unlist(format(x)))[1, ]
rowname <- format(rownames(fdf))[[1]]
strings <- c(rowname, strings)
widths <- nchar(strings)
names <- c("", colnames(x))
widths <- pmax(nchar(strings), nchar(names))
csum <- sum(widths + 1) - 1
print.data.frame(df)
writeLines(paste(rep("-", csum), collapse = ""))
writeLines("Balance: 48") ## FIXME !!
invisible(x)
}
which gives:
> df
Price Date Subject Category
1 0.73 2015-06-29 Foo Media
2 0.11 2015-06-30 Bar Media
3 0.19 2015-07-01 DJGHSJIBIBFUIBSFIUBFUIS Media
4 0.54 2015-07-02 Foo Media
5 0.04 2015-07-03 Bar Media
6 0.37 2015-07-04 DJGHSJIBIBFUIBSFIUBFUIS Media
7 0.59 2015-07-05 Foo Media
8 0.85 2015-07-06 Bar Media
9 0.15 2015-07-07 DJGHSJIBIBFUIBSFIUBFUIS Media
10 0.05 2015-07-08 Foo Media
----------------------------------------------------
Balance: 48
See how this works. Very simple counting of characters, no bells and whistles, but should do the expected job:
EDIT: Printing of the data.frame and the line done in the function.
# create a function that prints the data.frame with the line we want
lineLength <- function( testDF )
{
# start with the characters in the row names,
# plus empty space between columns
dashes <- max( nchar( rownames( testDF ) ) ) + length ( testDF )
# loop finding the longest string in each column, including header
for( i in 1 : length ( testDF ) )
{
x <- nchar( colnames( testDF ) )[ i ]
y <- max( nchar( testDF[ , i ] ) )
if( x > y ) dashes <- dashes + x else dashes <- dashes + y
}
myLine <- paste( rep( "-", dashes ), collapse = "" )
print( testDF )
cat( myLine, "\n" )
}
# sample data
data( mtcars )
# see how it works
lineLength( head( mtcars ) )
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
--------------------------------------------------------------------

Resources