Related
I have a dataset from which I had to remove outliers. I used the boxplot method to remove my outliers however, I feel this method has changed the structure of my data from a table like structure to just a list. I am trying to use NbClust to get a prediction on the amount of clusters I should use. I also applied z-score scaling before attempting to use NbClust. I am really new to R and I am not sure how to change it back and/or if this is the reason the error is occurring with NbClust
The data also showed as "846 obs. of 18 variables" before outlier removal
to "List of 18" after outlier removal (Shown in the Global Environment panel)
Error: Error in t(jeu) %*% jeu :
requires numeric/complex matrix/vector arguments
I think the correct thing is to change it into a data frame but I am not too sure how to do correctly do this.
Data before outlier removal using boxplot method:
After outliers removed using boxplot method:
Reproduceable example
library(reshape2)
library(NbClust)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
remOutliers <- sapply(data, function(x) x[!x %in% OutVals])
return (remOutliers)
}
# Scale data -> same as scale() function
z_score <- function(x){
return ((x - mean(x))/sd(x))
}
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2
vehClass <- vehData$Class
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData)
removeOutliers2 <- removeOutliers(removeOutliers1)
removeOutliers3 <- removeOutliers(removeOutliers2)
removeOutliers4 <- removeOutliers(removeOutliers3)
cleanVehicleData <- removeOutliers4
cl_vehDataScale <- lapply(cleanVehicleData, z_score)
set.seed(26)
clusterNo <- NbClust(cl_vehDataScale, distance="euclidean", min.nc=2, max.nc=10,
method="kmeans", index="all")
I am very new in shiny application.I have already build customer prediction algorithm (linear regression) using R. In that case, I extracted date into month, days, week etc.and considered as independent variable. Now, I would like to build shiny application for the same. Application will take date as input and output will be the predicted customer number. I have date wise historical customer data. for analysis, I produced month, week, day as independent variable and number of customer is my dependent variable. I trained linear regression for prediction purpose. I faced the problem to convert new input date into month, day, week etc.
so that I can predict new date for number of customer. Kindly help me in this regards.
Prediction
library(caret)
mydata <- read.csv("main.csv", header = TRUE)
mydata$date <- as.Date(mydata$date, format = "%m/%d/%Y")
mydata$month <- strftime(mydata$date, "%m")
mydata$day <- strftime(mydata$date, "%d")
mydata$week <- strftime(mydata$date, "%w")
mydata$week_year <- strftime(mydata$date, "%W")
mydata$month <- as.factor(mydata$month)
mydata$day <- as.factor(mydata$day)
mydata$week <- as.factor(mydata$week)
mydata$week_year <- as.factor(mydata$week_year)
mydata <- mydata[c(1, 3, 4, 5, 6)]
ind <- sample(2, nrow(mydata), replace = TRUE, prob=c(0.7, 0.3))
trainset = mydata[ind == 1,]
testset = mydata[ind == 2,]
pred_cus <-glm(no_customer~month+week+day,
data = trainset,
family = gaussian)
testset$prediction <- predict(pred_cus, testset)
RMSE(testset$prediction, testset$no_customer)
structure(list(
Customer = c(94L, 61L, 51L, 28L, 29L, 56L, 99L, 87L, 88L, 71L, 40L, 33L,
57L, 71L, 84L, 81L, 57L, 31L, 28L, 77L, 84L, 69L, 76L, 65L,
36L, 26L, 60L, 70L, 82L, 81L, 49L, 54L, 18L, 66L, 89L, 69L,
61L, 88L, 40L, 25L, 82L, 77L, 88L, 72L, 75L, 40L, 24L, 79L,
67L, 82L, 55L, 78L, 44L, 14L, 76L, 89L, 87L, 93L, 64L, 23L,
34L, 65L, 83L, 92L, 87L, 105L, 40L, 32L, 80L, 76L, 83L, 76L,
70L, 43L, 33L, 75L, 75L, 70L, 55L, 70L, 36L, 13L, 64L, 72L,
79L, 62L, 52L, 30L, 32L, 85L, 87L, 84L, 93L, 73L, 21L, 19L,
101L, ''''''''''''''''''''''''''''''''
Label = c("1/1/2016", "1/10/2016", "1/11/2016", "1/12/2016", "1/13/2016",
"1/14/2016", "1/15/2016", "1/16/2016", "1/17/2016", "1/18/2016",
"1/19/2016", "1/2/2016", "1/20/2016", "1/21/2016", "1/22/2016",
"1/23/2016", "1/24/2016", "1/25/2016", "1/26/2016", "1/27/2016",
"1/28/2016", "1/29/2016", "1/3/2016", "1/30/2016", "1/31/2016",
"1/4/2016", "1/5/2016", "1/6/2016", "1/7/2016", "1/8/2016",
"1/9/2016", "10/1/2015", "10/10/2015", "10/11/2015", "10/12/2015",
"10/13/2015", "10/14/2015", "10/15/2015", "10/16/2015", "10/17/2015",
"10/18/2015", "10/19/2015", "10/2/2015", "10/20/2015", "10/21/2015",
"10/22/2015", "10/23/2015", "10/24/2015",
class = "factor")
),
.Names = c("Customer", date"),
class = "data.frame",
row.names = c(NA, -457L)
)
I have a dataframe of the form
Region Name 3-15 4-15 5-15 ... 3-16
Name1 30 82 56 ... 32
Name2 65 23 38 ... 11
... ... ... ... ... ...
Name18 87 33 11 ... 51
The first column being the names of regions and the other columns being recorded events over time (monthly by column)
I'd like to plot the recorded monthly values over time with respect to their associated name. Specifically, a different line for each Named region with a differentiated colour. Any advice would be appreciated, a lot of the plotting functions for data frames seem to function on frames of a different format.
dput() data:
dataframe <- structure(list("LSOA Name" = c("Lancaster 001", "Lancaster 002",
"Lancaster 003", "Lancaster 004", "Lancaster 005", "Lancaster 006",
"Lancaster 008", "Lancaster 009", "Lancaster 010", "Lancaster 011",
"Lancaster 013", "Lancaster 014", "Lancaster 015", "Lancaster 016",
"Lancaster 017", "Lancaster 018", "Lancaster 019", "Lancaster 020"
), "3-15" = c(49L, 16L, 17L, 28L, 21L, 197L, 57L, 143L, 78L,
121L, 67L, 223L, 41L, 86L, 66L, 27L, 40L, 77L), "4-15" = c(63L,
11L, 26L, 29L, 19L, 203L, 69L, 154L, 82L, 125L, 62L, 198L, 44L,
99L, 64L, 26L, 42L, 99L), "5-15" = c(67L, 10L, 20L, 30L, 10L,
194L, 62L, 186L, 61L, 110L, 75L, 273L, 29L, 126L, 92L, 34L, 41L,
88L), "6-15" = c(58L, 8L, 18L, 36L, 29L, 198L, 62L, 167L, 83L,
110L, 59L, 254L, 26L, 99L, 73L, 17L, 30L, 109L), "7-15" = c(53L,
29L, 27L, 23L, 38L, 188L, 56L, 149L, 90L, 129L, 37L, 226L, 32L,
119L, 57L, 14L, 30L, 96L), "8-15" = c(44L, 9L, 25L, 28L, 29L,
237L, 69L, 171L, 78L, 108L, 45L, 261L, 22L, 103L, 68L, 33L, 35L,
108L), "9-15" = c(59L, 12L, 18L, 35L, 19L, 230L, 45L, 128L, 74L,
144L, 56L, 223L, 26L, 90L, 51L, 27L, 23L, 120L), "10-15" = c(45L,
26L, 31L, 23L, 25L, 195L, 53L, 155L, 74L, 120L, 58L, 276L, 38L,
92L, 72L, 25L, 40L, 123L), "11-15" = c(31L, 11L, 33L, 15L, 19L,
188L, 52L, 127L, 66L, 102L, 50L, 241L, 26L, 74L, 72L, 26L, 35L,
68L), "12-15" = c(34L, 22L, 21L, 22L, 17L, 205L, 80L, 150L, 73L,
109L, 50L, 228L, 29L, 57L, 59L, 14L, 45L, 93L), "1-16" = c(20L,
9L, 25L, 21L, 11L, 199L, 46L, 124L, 65L, 117L, 40L, 224L, 28L,
88L, 43L, 22L, 18L, 94L), "2-16" = c(54L, 11L, 29L, 20L, 11L,
164L, 44L, 117L, 70L, 85L, 46L, 192L, 23L, 89L, 50L, 27L, 29L,
86L), "3-16" = c(53L, 11L, 24L, 26L, 19L, 203L, 45L, 144L, 66L,
109L, 47L, 213L, 15L, 120L, 59L, 15L, 33L, 127L)), .Names = c("LSOA Name",
"3-15", "4-15", "5-15", "6-15", "7-15", "8-15", "9-15", "10-15",
"11-15", "12-15", "1-16", "2-16", "3-16"), row.names = c(NA,
-18L), class = "data.frame")
A typical way of plotting lines by groups in ggplot is to shift the data to long format, where one column identifies the group, and the other columns identify the x and y axis values.
This example shifts your data into long format with three columns: LSOAName, month_col, and values_col. It adds a day value onto the month-year, and converts that column to a date. Then it plots a line for each group.
I've renamed your dataframe d, because dataframe could be easily misinterpreted as the function data.frame().
# load libraries
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
# rename dataframe so it doesn't look so much like the base function
d <- dataframe
# remove spaces in column names
names(d) <- gsub(" ", "", names(d))
# shift data from wide to long and then
# add a day value and convert day-month-year to date class
d %<>% gather(month_col, values_col, -LSOAName) %>%
mutate(month_col = as.Date(paste0("1-", month_col), "%d-%m-%y"))
# plot using ggplot2
ggplot(d, aes(x = month_col, y = values_col, colour = LSOAName)) +
geom_line()
Edit
%<>% is found in the magrittr package. It is a compound pipe assignment operator. While %>% returns the result of a pipeline, %<>% assigns the result back to the left side object.
Instead of writing
d <- d %>% [pipeline]
you can assign the results to d by writing
d %<>% [pipeline]
I know loops should be avoided in R. What is the best way of rewriting this code?
X <- structure(list(ELEMENT = c("TMAX", "TMIN", "PRCP", "AWND", "WDF2", "WSF2"),
VALUE1 = c(309L, 249L, 76L, 27L, 110L, 67L),
VALUE2 = c(317L, 274L, 20L, 66L, 110L, 93L),
VALUE3 = c(311L, 266L, 0L, 41L, 120L,57L),
VALUE4 = c(308L, 262L, 0L, 31L, 120L, 57L),
VALUE5 = c(316L, 240L, 0L, 18L, 90L, 41L),
VALUE6 = c(305L, 242L, 51L, 20L, 100L, 36L),
VALUE7 = c(323L, 245L, 0L, 21L, 90L, 41L),
VALUE8 = c(330L, 250L, 287L, 30L, 70L, 62L)),
.Names = c("ELEMENT", "VALUE1", "VALUE2", "VALUE3", "VALUE4",
"VALUE5", "VALUE6", "VALUE7", "VALUE8"),
row.names = 10240:10245, class = "data.frame")
PRCP <- rep(NA, 8)
PRCP[1] <- X[X$ELEMENT=="PRCP",][[2]]
PRCP[2] <- X[X$ELEMENT=="PRCP",][[3]]
PRCP[3] <- X[X$ELEMENT=="PRCP",][[4]]
PRCP[4] <- X[X$ELEMENT=="PRCP",][[5]]
PRCP[5] <- X[X$ELEMENT=="PRCP",][[6]]
PRCP[6] <- X[X$ELEMENT=="PRCP",][[7]]
PRCP[7] <- X[X$ELEMENT=="PRCP",][[8]]
PRCP[8] <- X[X$ELEMENT=="PRCP",][[9]]
Thanks!
Is unlist what you are looking for?
unlist(X[X$ELEMENT == "PRCP",2:9], use.names=FALSE)
I would use
as.integer(X[X$ELEMENT == "PRCP",2:9])
or else "as.numeric...". (It looks like you want integers.) That's not that different from "unlist", but that's typically what I do.
I have a dataframe (df1) that contains 3 columns (y1, y2, x). I managed to plot a boxplot graph between y1, x and y2, x. I have another dataframe (df2) which contains two columns A, x. I want to plot a line graph (A,x) and add it to the boxplot. Note the variable x in both dataframes is the axis access, however, it has different values. I tried to combine and reshape both dataframes and plot based on the factor(x)... I got 3 boxplots in one graph. I need to plot df2 as line and df1 as boxplot in one graph.
df1 <- structure(list(Y1 = c(905L, 941L, 744L, 590L, 533L, 345L, 202L,
369L, 200L, 80L, 200L, 80L, 50L, 30L, 60L, 20L, 30L, 30L), Y2 = c(774L,
823L, 687L, 545L, 423L, 375L, 249L, 134L, 45L, 58L, 160L, 60L,
20L, 40L, 20L, 26L, 19L, 27L), x = c(10L, 10L, 10L, 20L, 20L,
20L, 40L, 40L, 40L, 50L, 50L, 50L, 70L, 70L, 70L, 90L, 90L, 90L
)), .Names = c("Y1", "Y2", "x"), row.names = c(NA, -18L), class = "data.frame")
df2 <- structure(list(Y3Line = c(384L, 717L, 914L, 359L, 241L, 265L,
240L, 174L, 114L, 165L, 184L, 96L, 59L, 60L, 127L, 54L, 31L,
44L), x = c(36L, 36L, 36L, 56L, 56L, 56L, 65L, 65L, 65L, 75L,
75L, 75L, 85L, 85L, 85L, 99L, 99L, 99L)), .Names = c("A",
"x"), row.names = c(NA, -18L), class = "data.frame")
df_l <- melt(df1, id.vars = "x")
ggplot(df_l, aes(x = factor(x), y =value, fill=variable )) +
geom_boxplot()+
# here I'trying to add the line graph from df2
geom_line(data = df2, aes(x = x, y=A))
Any suggestions?
In the second dataset you have three y values per x value, do you want to draw seperate lines per x value or the mean per x value? Both are shown below. The trick is to first change the x variables in both datasets to factors that contain all the levels of both variables.
df1 <-structure(list(Y1 = c(905L, 941L, 744L, 590L, 533L, 345L, 202L,
369L, 200L, 80L, 200L, 80L, 50L, 30L, 60L, 20L, 30L, 30L), Y2 = c(774L,
823L, 687L, 545L, 423L, 375L, 249L, 134L, 45L, 58L, 160L, 60L,
20L, 40L, 20L, 26L, 19L, 27L), x = c(10L, 10L, 10L, 20L, 20L,
20L, 40L, 40L, 40L, 50L, 50L, 50L, 70L, 70L, 70L, 90L, 90L, 90L
)), .Names = c("Y1", "Y2", "x"), row.names = c(NA, -18L), class = "data.frame")
df2 <- structure(list(Y3Line = c(384L, 717L, 914L, 359L, 241L, 265L,
240L, 174L, 114L, 165L, 184L, 96L, 59L, 60L, 127L, 54L, 31L,
44L), x = c(36L, 36L, 36L, 56L, 56L, 56L, 65L, 65L, 65L, 75L,
75L, 75L, 85L, 85L, 85L, 99L, 99L, 99L)), .Names = c("A",
"x"), row.names = c(NA, -18L), class = "data.frame")
library(ggplot2)
library(reshape2)
df_l <- melt(df1, id.vars = "x")
allLevels <- levels(factor(c(df_l$x,df2$x)))
df_l$x <- factor(df_l$x,levels=(allLevels))
df2$x <- factor(df2$x,levels=(allLevels))
Line per x category:
ggplot(data=df_l,aes(x = x, y =value))+geom_line(data=df2,aes(x = factor(x), y =A)) +
geom_boxplot(aes(fill=variable ))
Connected means of x categories:
ggplot(data=df2,aes(x = factor(x), y =A)) +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
geom_boxplot(data=df_l,aes(x = x, y =value,fill=variable ))