Trying to draw a SPC Chart in R - r

I am trying to create a control chart using the code below but I am getting the error below. The data has the first Column as date then 12 other columns with different variables of data.
library("qcc")
attach(data)
Data_Frame_Data <- as.data.frame.matrix(data)
q <- qcc(Cancer_Activity
, type="xbar"
, nsigmas=3)
Error in sd.xbar(c(1396310400, 1398902400, 1401580800, 1404172800,
1406851200, : group sizes must be larger than one
This is the output when I run str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 48 obs. of 13 variables:
$ Date : POSIXct, format: "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" ...
$ CW_Activity : num 37 29.5 34 46 39.5 41.5 42 40 46 39.5 ...
$ CW_Breach : num 3.5 6 8.5 10 5.5 8 4.5 3 3.5 4 ...
$ ICHT_Activity: num 73.5 89 60 83.5 85 88.5 65.5 80 75.5 74 ...
$ ICHT_Breach : num 8 11.5 11.5 12 11 15 9.5 14 8.5 16.5 ...
$ LNWH_Activity: num 67 76.5 56 79.5 67 83 77.5 67 66 60.5 ...
$ LNWH_Breach : num 10 12.5 13 14 10.5 16 16.5 12 5 13.5 ...
$ THH_Activity : num 30 26 24.5 36 31 25 33 21.5 42 25.5 ...
$ THH_Breach : num 2 3 2 1 5 1.5 3.5 0.5 3.5 3 ...
$ RBH_Activity : num 2.5 5 6.5 7 6.5 7.5 3.5 9 8 6.5 ...
$ RBH_Breach : num 0.5 1 2 2 4 4 1 2 2.5 2 ...
$ NWL_Activity : num 210 226 181 252 229 ...
$ NWL_Breach : num 24 34 37 39 36 44.5 35 31.5 23 39 ...

Related

What is the best way to sample data based on date range?

I have weather dataset from 01 Nov 2007 until 18 May 2008 my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample it if we ignore the bias factor can I sample training data from 01 Nov 2007 to 18 May 2008 and test data from 07 May 2008 to 18 May 2008? or is there a better way to handle this ? or would it be better to first sort my data by date then split my data (ordered) with 80:20 for test and training set then just output the required date?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset$Date <- convert_to_date(dataset$Date)
dataset <- dataset[order(as.Date(dataset$Date, format="%Y/%M/%D")),]
dataset <- dataset[1:200,]
str(dataset)
> str(dataset)
'data.frame': 200 obs. of 24 variables:
$ Date : Date, format: "2007-11-01" "2007-11-02" "2007-11-03" ...
$ Location : chr "Canberra" "Canberra" "Canberra" "Canberra" ...
$ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
$ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
$ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
$ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
$ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
$ WindGustDir : chr "NW" "ENE" "NW" "NW" ...
$ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ...
$ WindDir9am : chr "SW" "E" "N" "WNW" ...
$ WindDir3pm : chr "NW" "W" "NNE" "W" ...
$ WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...
$ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ...
$ Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...
$ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ...
$ Pressure9am : num 1020 1012 1010 1006 1018 ...
$ Pressure3pm : num 1015 1008 1007 1007 1018 ...
$ Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...
$ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ...
$ Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
$ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
$ RainToday : chr "No" "Yes" "Yes" "Yes" ...
$ RISK_MM : num 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ...
$ RainTomorrow : chr "Yes" "Yes" "Yes" "Yes" ...

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels , so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.
If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.
library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)
#> 2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)
#> 4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17 9 Much worse (0.47 0.29 0 0.24 0) *
#> 5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)
#> 10) HP.revs< 4650 13 7 Much worse (0.46 0.23 0.31 0 0) *
#> 11) HP.revs>=4650 19 3 average (0.053 0.053 0.84 0.053 0) *
#> 3) Country=Japan,Japan/USA 27 6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame': 111 obs. of 34 variables:
#> $ Country : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#> $ Disp : num 112 163 141 121 152 209 151 231 231 189 ...
#> $ Disp2 : num 1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#> $ Eng.Rev : num 2935 2505 2775 2835 2625 ...
#> $ Front.Hd : num 3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#> $ Frt.Leg.Room: num 41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#> $ Frt.Shld : num 53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#> $ Gear.Ratio : num 3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#> $ Gear2 : num 3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#> $ HP : num 130 160 130 108 168 208 110 165 165 101 ...
#> $ HP.revs : num 6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#> $ Height : num 47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#> $ Length : num 177 191 193 176 175 186 189 197 197 192 ...
#> $ Luggage : num 16 14 17 10 12 12 16 16 16 15 ...
#> $ Mileage : num NA 20 NA 27 NA NA 21 NA 23 NA ...
#> $ Model2 : Factor w/ 21 levels ""," Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#> $ Price : num 11950 24760 26900 18900 24650 ...
#> $ Rear.Hd : num 1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#> $ Rear.Seating: num 26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#> $ RearShld : num 52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#> $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#> $ Rim : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#> $ Sratio.m : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Sratio.p : num 0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#> $ Steering : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Tank : num 13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#> $ Tires : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#> $ Trans1 : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#> $ Trans2 : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#> $ Turning : num 37 42 39 35 35 39 41 43 42 41 ...
#> $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#> $ Weight : num 2700 3265 2935 2670 2895 ...
#> $ Wheel.base : num 102 109 106 100 101 109 105 111 111 108 ...
#> $ Width : num 67 69 71 67 65 69 69 72 72 71 ...

How to apply dist_google (from stplanr package) to a list of data frames?

I am a beginner user of stplanr package. I have splitted a large data frame with long/lat points per 25 rows because dist_google function can be applied up to 25 pairs of origin - destination. So here is the original large data frame:
GPSLatitude GPSLongitude
1 40.66126 22.89565
2 40.66127 22.89565
3 40.66128 22.89565
4 40.66130 22.89566
5 40.66131 22.89567
6 40.66132 22.89569
7 40.66134 22.89573
8 40.66136 22.89577
9 40.66137 22.89582
10 40.66141 22.89594
11 40.66142 22.89601
12 40.66145 22.89609
13 40.66147 22.89618
14 40.66150 22.89627
15 40.66152 22.89635
16 40.66155 22.89644
17 40.66160 22.89650
18 40.66165 22.89654
19 40.66172 22.89656
20 40.66178 22.89658
21 40.66186 22.89659
22 40.66193 22.89660
23 40.66200 22.89662
24 40.66207 22.89663
25 40.66213 22.89664
26 40.66218 22.89665
27 40.66223 22.89665
28 40.66227 22.89664
29 40.66230 22.89663
30 40.66234 22.89662
31 40.66238 22.89661
32 40.66242 22.89662
33 40.66244 22.89664
34 40.66245 22.89666
35 40.66247 22.89669
36 40.66248 22.89671
37 40.66249 22.89673
38 40.66250 22.89674
39 40.66251 22.89676
40 40.66253 22.89679
41 40.66255 22.89683
42 40.66257 22.89686
43 40.66261 22.89694
44 40.66263 22.89698
45 40.66265 22.89700
46 40.66267 22.89702
47 40.66268 22.89705
48 40.66270 22.89707
49 40.66272 22.89709
50 40.66273 22.89710
51 40.66274 22.89711
52 40.66275 22.89712
53 40.66275 22.89714
54 40.66276 22.89716
55 40.66276 22.89718
56 40.66276 22.89721
57 40.66275 22.89725
58 40.66273 22.89728
Then, I splitted this data frame per 25 rows with the following command:
pointssplit<- split(pointsdf, (0:nrow(pointsdf))%/%25)
Finally, I have the following list of the smaller data frames:
List of 3
$ 0:'data.frame': 25 obs. of 2 variables:
..$ GPSLatitude : num [1:25] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:25] 22.9 22.9 22.9 22.9 22.9 ...
$ 1:'data.frame': 25 obs. of 2 variables:
..$ GPSLatitude : num [1:25] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:25] 22.9 22.9 22.9 22.9 22.9 ...
$ 2:'data.frame': 8 obs. of 2 variables:
..$ GPSLatitude : num [1:8] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:8] 22.9 22.9 22.9 22.9 22.9 ...
I've tried to use lapply to apply the dist_google() function:
lapply(length(pointssplit), dist_google(from = point2, to = pointssplit, mode = "driving", google_api = "my api key")) #point2 is my reference point
The problem is that I don't know how to manage with "to" inside dist_google() in order to get the long/lat from each data frame separatelly so I get an Error in match.fun(FUN)
Any ideas? Thank you in advance

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

Undefined columns selected when subsetting data frame

I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected
You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]

Resources