Summary of a Subset in R does not work - Why? - r

I am doing the Analytics Edge course on EdX and ran into this problem. We have a dataset which we are subsetting. Running a Str on the subset works as intended, however trying summary on the same subset throws an error. Can someone explain why?
> str(WHO_Europe)
'data.frame': 53 obs. of 13 variables:
$ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
$ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
$ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
$ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
$ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
$ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
$ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
$ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
$ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
$ GNI : num 8820 NA 6100 42050 8960 ...
$ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
$ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
> Summary(WHO_Europe)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘Summary’ for signature ‘"data.frame"’
> write.csv(WHO_Europe,"WHO_Europe.CSV")
> Summary(WHO_Europe)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘Summary’ for signature ‘"data.frame"’

Related

Agricolae, tapply, error: arguments must have same length

I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...

How to fix problem with "Error in plot.window(...) : need finite 'xlim' values" in R

I am trying to write a function which will do scatter plotting for me,
the data structure that I am working with looks as follow:
'data.frame': 129 obs. of 15 variables:
$ Player : Factor w/ 129 levels "Abbrederis, Jared",..: 1 2 3 4 5 6 7 8 9 10 ...
$ College : Factor w/ 79 levels "Alabama","Arizona",..: 78 20 65 77 27 48 67 31 31 19 ...
$ Position : Factor w/ 7 levels "DB","LB","OL",..: 7 7 6 4 4 4 2 2 7 7 ...
$ OverallGrade: num 5.2 5.96 5.4 5.16 5.45 5.1 6.6 5.37 5.9 6.4 ...
$ Height : int 73 73 77 70 68 73 77 73 71 77 ...
$ ArmLength : num 31.4 32.6 34 31.2 31 ...
$ Weight : int 195 212 265 225 173 218 255 237 198 240 ...
$ HandLength : num 9.62 9 9 9.5 8.88 ...
$ Dash40 : num 4.5 4.56 4.74 4.82 4.26 4.48 4.66 4.64 4.43 4.61 ...
$ BenchPress : int 4 14 28 20 20 19 15 22 7 13 ...
$ VerticalJump: num 30.5 39.5 33 29.5 38 38 34.5 35 38.5 32.5 ...
$ BroadJump : int 117 123 118 106 122 121 119 123 122 119 ...
$ Cone3Drill : num 6.8 6.82 7.42 7.24 6.86 7.07 6.82 7.24 6.69 7.33 ...
$ Shuttle20 : num 4.08 4.3 4.3 4.49 4.06 4.46 4.19 4.35 3.94 4.39 ...
$ Position1 : Factor w/ 7 levels "WO","DB","S",..: 1 1 6 4 4 4 5 5 1 1 ...
..- attr(*, "scores")= num [1:7(1d)] 4.54 4.75 5.22 4.59 4.58 ...
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr "DB" "LB" "OL" "RB" ...
I managed to do the plotting without writing a function and the code works:
with(nfl,plot(nfl$Dash40,nfl$BenchPress,
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab = "Bench Press weight",
ylab="40-year dash time in seconds"),
panel.first = grid())
legend("bottomright", legend=levels(nfl$Position),
pch=c(1,3,4,2,0,8,5),
cex=0.5,
col=c("black","red","blue","darkgreen","purple","orange","gray"))
a<-paste(nfl$Player,nfl$BenchPress)
text(nfl$Dash40,nfl$BenchPress,label=as.character(a),cex=0.5)
So basically I want to see relationship between different numeric variables, and I thought if the code above work, the following function should work as well,
myplot<-function(xvar,yvar,xlab,ylab){
b<-paste("xlab","vs","ylab")
xvar<-nfl$"xvar"
yvar<-nfl$"yvar"
with(nfl,plot(yvar,xvar),
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab="xlab",ylab="ylab",
main="b")
}
myplot(Dash40,BenchPress,dash,bench)
I used Dash40 and BenchPress to test the function but it turns out the function doesn't work:
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) :
Show Traceback
Rerun with Debug
Error in plot.window(...) : need finite 'xlim' values
Essentially I am trying to do the same job using two different codes, why the second one doesn't work? Could someone please give me some hinds on how to solve the problem?
I am very confused by your function. Why does it have arguments xvar,yvar that it overwrites without use?
What is meant by nfl$"xvar" and nfl$"yvar"?
Why is this done at all since you then use with(nfl,...?
If nfl doesn't have columns called xvar or yvar then they will cause an error or be interpreted as NULL; if you plot NULL then you need to specify the limits of the plot since this cannot be found from the data.
The function fails because you are plotting NULL
You also need to pass the data from nfl to the function as there is a danger that Dash40 and BenchPress will not be found, since they only exist within nfl. To do the you use the $ symbol. Instead of BenchPress you should pass nfl$BenchPress to the function.
The below should work.
myplot<-function(xvar,yvar,xlab,ylab){
b<-paste(xlab,"vs",ylab)
plot(yvar,xvar,
pch=c(1,3,4,2,0,8,5),
col=c("black","red","blue","darkgreen","purple","orange","gray"),
xlab=xlab,ylab=ylab,
main=b)
}
myplot(nfl$Dash40,nfl$BenchPress,"dash","bench")

Trying to draw a SPC Chart in R

I am trying to create a control chart using the code below but I am getting the error below. The data has the first Column as date then 12 other columns with different variables of data.
library("qcc")
attach(data)
Data_Frame_Data <- as.data.frame.matrix(data)
q <- qcc(Cancer_Activity
, type="xbar"
, nsigmas=3)
Error in sd.xbar(c(1396310400, 1398902400, 1401580800, 1404172800,
1406851200, : group sizes must be larger than one
This is the output when I run str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 48 obs. of 13 variables:
$ Date : POSIXct, format: "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" ...
$ CW_Activity : num 37 29.5 34 46 39.5 41.5 42 40 46 39.5 ...
$ CW_Breach : num 3.5 6 8.5 10 5.5 8 4.5 3 3.5 4 ...
$ ICHT_Activity: num 73.5 89 60 83.5 85 88.5 65.5 80 75.5 74 ...
$ ICHT_Breach : num 8 11.5 11.5 12 11 15 9.5 14 8.5 16.5 ...
$ LNWH_Activity: num 67 76.5 56 79.5 67 83 77.5 67 66 60.5 ...
$ LNWH_Breach : num 10 12.5 13 14 10.5 16 16.5 12 5 13.5 ...
$ THH_Activity : num 30 26 24.5 36 31 25 33 21.5 42 25.5 ...
$ THH_Breach : num 2 3 2 1 5 1.5 3.5 0.5 3.5 3 ...
$ RBH_Activity : num 2.5 5 6.5 7 6.5 7.5 3.5 9 8 6.5 ...
$ RBH_Breach : num 0.5 1 2 2 4 4 1 2 2.5 2 ...
$ NWL_Activity : num 210 226 181 252 229 ...
$ NWL_Breach : num 24 34 37 39 36 44.5 35 31.5 23 39 ...

r subset(df, condition) different result from df$[condition, ] [duplicate]

This question already has answers here:
How to subset data in R without losing NA rows?
(3 answers)
Closed 4 years ago.
Some wired output with subsetting data.frame in R.
here is files I used
https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+2T2017+type#asset+block/WHO.csv
After read data in R , there are 194 obs. with 13 vars.
> str(WHO)
'data.frame': 194 obs. of 13 variables:
$ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
$ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
$ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
$ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
$ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
$ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
$ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
$ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
$ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
$ GNI : num 1140 8820 8310 NA 5230 ...
$ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
$ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
But the result of subsetting with function subset differ from df[,] as example below.
> Outliers <- WHO[WHO$GNI > 10000 & WHO$FertilityRate > 2.5,]
> nrow(Outliers)
[1] 27
Country Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality CellularSubscribers
NA <NA> <NA> NA NA NA NA NA NA NA
23 Botswana Africa 2004 33.75 5.63 2.71 66 53.3 142.82
NA.1 <NA> <NA> NA NA NA NA NA NA NA
NA.2 <NA> <NA> NA NA NA NA NA NA NA
(trimmed ...)
There is a lot of NA obs.
While use subset function, yield correct results.
> Outliers <- subset(WHO, GNI > 10000 & FertilityRate > 2.5)
> nrow(Outliers)
[1] 7
> Outliers
Country Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality CellularSubscribers
23 Botswana Africa 2004 33.75 5.63 2.71 66 53.3 142.82
56 Equatorial Guinea Africa 736 38.95 4.53 5.04 54 100.3 59.15
63 Gabon Africa 1633 38.49 7.38 4.18 62 62.0 117.32
83 Israel Europe 7644 27.53 15.15 2.92 82 4.2 121.66
88 Kazakhstan Europe 16271 25.46 10.04 2.52 67 18.7 155.74
131 Panama Americas 3802 28.65 10.13 2.52 77 18.5 188.60
150 Saudi Arabia Eastern Mediterranean 28288 29.69 4.59 2.76 76 8.6 191.24
(trimmed ...)
What about making sure you get rid of the NAs first?
Outliers <- WHO[!is.na(WHO$GNI) & WHO$GNI > 10000 &
!is.na(WHO$FertilityRate) & WHO$FertilityRate > 2.5,]

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

Resources