I'm a beginner at ggplot, and I tried to use it to draw some timeserie data.
I want to draw bound_transporter_in_evolution.mean as a function of time, in different conditions where the attribute p_off (float) varies.
p4 <- ggplot(data=df, aes(x=timesteps.mean)) +
geom_line(aes(y=bound_transporter_in_evolution.mean, color=p_off)) +
xlab(label="Time (s)") +
ylab(label="Number of bound 'in' transporters")
ggsave("p4.pdf", width=8, height=3.3)
I get the following plot:
I expected this result, but with a line instead of points:
Thank you
since p_off is a numeric variable, ggplot will create only one line connecting all the dots and color it along the values. If you want separated lines, you have do transform your colouring variable into a factor(assuming you have a limited number of different values). Let's take an example with a numeric color variable:
df=data.frame(x=c(1:5, 1:5), y=rnorm(10), z=c(1,1,1,1,1,2,2,2,2,2))
ggplot(data=df, aes(x=x)) + geom_line(aes(x=x, y=y, color=z))
Which doesn't make any sense since consecutive points come from different categories. And now turn it into a factor:
ggplot(data=df, aes(x=x)) + geom_line(aes(x=x, y=y, color=factor(z)))
In your first graph, the line constantly goes from one p_off value to another, and since you have a really big dataset it quickly saturates the screen.
Here is the output of str(df):
'data.frame': 150010 obs. of 34 variables:
$ bound_transporter_evolution.low : num [1:150010(1d)] 0 11.4 26.1 41.8 48.2 ...
$ bound_transporter_evolution.mean : num [1:150010(1d)] 0 15 28.2 45 53.8 63.8 71.6 77.8 86.2 91.2 ...
$ bound_transporter_evolution.up : num [1:150010(1d)] 0 18.6 30.3 48.2 59.4 ...
$ bound_transporter_in_evolution.low : num [1:150010(1d)] 0 11.4 26.1 41.8 48.2 ...
$ bound_transporter_in_evolution.mean : num [1:150010(1d)] 0 15 28.2 45 53.8 63.8 71.6 77.8 86.2 91.2 ...
$ bound_transporter_in_evolution.up : num [1:150010(1d)] 0 18.6 30.3 48.2 59.4 ...
$ bound_transporter_out_evolution.low : num [1:150010(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ bound_transporter_out_evolution.mean: num [1:150010(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ bound_transporter_out_evolution.up : num [1:150010(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ free_transporter_evolution.low : num [1:150010(1d)] 200 181 170 152 141 ...
$ free_transporter_evolution.mean : num [1:150010(1d)] 200 185 172 155 146 ...
$ free_transporter_evolution.up : num [1:150010(1d)] 200 189 174 158 152 ...
$ free_transporter_in_evolution.low : num [1:150010(1d)] 186 172 158 139 127 ...
$ free_transporter_in_evolution.mean : num [1:150010(1d)] 188 173 160 143 135 ...
$ free_transporter_in_evolution.up : num [1:150010(1d)] 191 175 162 148 142 ...
$ free_transporter_out_evolution.low : num [1:150010(1d)] 9.18 9.18 9.18 9.18 9.18 ...
$ free_transporter_out_evolution.mean : num [1:150010(1d)] 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 ...
$ free_transporter_out_evolution.up : num [1:150010(1d)] 14 14 14 14 14 ...
$ glutamate_evolution.low : num [1:150010(1d)] 2000 1981 1970 1951 1939 ...
$ glutamate_evolution.mean : num [1:150010(1d)] 2000 1985 1971 1954 1943 ...
$ glutamate_evolution.up : num [1:150010(1d)] 2000 1989 1973 1957 1948 ...
$ p_off : num [1:150010(1d)] 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ...
$ simulation_name : Factor w/ 1 level "Variable p-off large diffusion-limited area": 1 1 1 1 1 1 1 1 1 1 ...
$ timesteps.low : num [1:150010(1d)] 0e+00 1e-06 2e-06 3e-06 4e-06 5e-06 6e-06 7e-06 8e-06 9e-06 ...
$ timesteps.mean : num [1:150010(1d)] 0e+00 1e-06 2e-06 3e-06 4e-06 5e-06 6e-06 7e-06 8e-06 9e-06 ...
$ timesteps.up : num [1:150010(1d)] 0e+00 1e-06 2e-06 3e-06 4e-06 5e-06 6e-06 7e-06 8e-06 9e-06 ...
$ transporter_in_evolution.low : num [1:150010(1d)] 186 186 186 186 186 ...
$ transporter_in_evolution.mean : num [1:150010(1d)] 188 188 188 188 188 ...
$ transporter_in_evolution.up : num [1:150010(1d)] 191 191 191 191 191 ...
$ transporter_out_evolution.low : num [1:150010(1d)] 9.18 9.18 9.18 9.18 9.18 ...
$ transporter_out_evolution.mean : num [1:150010(1d)] 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 11.6 ...
$ transporter_out_evolution.up : num [1:150010(1d)] 14 14 14 14 14 ...
$ variable_parameter : Factor w/ 1 level "p_off": 1 1 1 1 1 1 1 1 1 1 ...
$ variable_value : num [1:150010(1d)] 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ...
Related
I have built a logistic regression model with the dependent variable WinParty, which outputs fine. Then when trying to do variable selection with stepAIC I keep getting this error
Data Structure
tibble [2,467 × 25] (S3: tbl_df/tbl/data.frame)
$ PollingPlace : chr [1:2467] "Abbotsbury" "Abbotsford" "Abbotsford East" "Aberdare" ...
$ CoalitionVotes : int [1:2467] 9438 15548 3960 3164 2370 4524 3186 10710 372 5993 ...
$ VoteDifference : num [1:2467] 0.1397 -0.0579 0.0796 -0.2454 0.2623 ...
$ Liberal.National.Coalition.Percentage: num [1:2467] 57 47.1 54 37.7 63.1 ...
$ WinParty : num [1:2467] 1 0 1 0 1 0 0 0 1 0 ...
$ Median_age_persons : num [1:2467] 43 46 41.5 37 41 31 37 36 57.5 41 ...
$ Median_mortgage_repay_monthly : num [1:2467] 2232 3000 2831 1452 1559 ...
$ Median_tot_prsnl_inc_weekly : num [1:2467] 818 1262 1380 627 719 ...
$ Median_rent_weekly : num [1:2467] 550 595 576 310 290 ...
$ Median_tot_fam_inc_weekly : num [1:2467] 2541 3062 3126 1521 2021 ...
$ Average_household_size : num [1:2467] 3.27 2.35 2.28 2.46 2.38 ...
$ Indig_Percent : num [1:2467] 0 0 1.09 10.94 10.61 ...
$ BirthPlace_Aus : num [1:2467] 60.9 67.9 61.7 90.9 89 ...
$ Other_lang_Percen : num [1:2467] 44.97 25.85 28.71 2.58 2.45 ...
$ Aus_Cit_Percent : num [1:2467] 91.5 91.5 86.6 93.7 91.9 ...
$ Yr12_Comp_Percent : num [1:2467] 49.7 57.1 62.7 25 23.1 ...
$ Pop_Density_SQKM : num [1:2467] 2849 6112 7951 1686 334 ...
$ Industrial_Percent : num [1:2467] 6.24 3.95 4.69 8.3 15.31 ...
$ Population_Serving_Percent : num [1:2467] 16 12.9 15.1 16.1 13.6 ...
$ Health_Education_Percent : num [1:2467] 9.26 11.43 10.28 9.07 7.79 ...
$ Knowledge_Intensive_Percent : num [1:2467] 11.31 19.64 17.06 7.44 6.56 ...
$ Over60_Yr : num [1:2467] 25.1 31.6 24.9 20.6 25.3 ...
$ GenZ : num [1:2467] 24.5 20 25.9 26.2 23.6 ...
$ GenX : num [1:2467] 27 29.1 26.6 25.8 26.1 ...
$ Millenials : num [1:2467] 23.3 20.3 19.7 27.3 27.1 ...
- attr(*, "na.action")= 'omit' Named int [1:8] 264 647 843 1332 1774 2033 2077 2138
..- attr(*, "names")= chr [1:8] "264" "647" "843" "1332" ...
The glm function computes the logistic regression with no errors
mod1 <- glm(WinParty~Median_age_persons+Median_rent_weekly+
Median_tot_fam_inc_weekly+Indig_Percent+BirthPlace_Aus+
Other_lang_Percen+Aus_Cit_Percent+Yr12_Comp_Percent+
Industrial_Percent+Population_Serving_Percent+Health_Education_Percent+
Knowledge_Intensive_Percent+Over60_Yr+GenZ+GenX+Millenials,
family = binomial(link = "logit"), data = GS_PP_Agg)
summary(mod1)
step1 <- stepAIC(mod1, scope = list(lower = "~1",upper = "~Median_age_persons+Median_rent_weekly+
Median_tot_fam_inc_weekly+Indig_Percent+BirthPlace_Aus+
Other_lang_Percen+Aus_Cit_Percent+Yr12_Comp_Percent+
Industrial_Percent+Population_Serving_Percent+Health_Education_Percent+
Knowledge_Intensive_Percent+Over60_Yr+GenZ+GenX+Millenials"), data = GS_PP_Agg)
Step AIC function returns the error:
"Error in FUN(left, right) : non-numeric argument to binary operator"
Some help in solving this error would be greatly appreciated!
I'm trying to execute the Cross-Validation for the boosting regression/classification trees using the function gbm.step() from the R package dismo, but it returns a empty output and I can't figure out why. This is the code I'm using:
ColIndexCov <- match(names(myRS),colnames(DFbrt_df2))
ColIndexResp <- match(c("HasRes"),colnames(DFbrt_df2))
DFbrt_df <- DFbrt#data
DFbrt_df2 <- na.omit(DFbrt_df)
myBRT = gbm.step(data=DFbrt_df2,
gbm.x = ColIndexCov,
gbm.y = ColIndexResp,
tree.complexity = 3,
learning.rate = 10^(-8),
n.trees = 50,
family = "bernoulli",
n.folds = 4,
fold.vector = DFbrt_df2$Region.num,
step.size = 50,
verbose = F,
silent = T
)
str(DFbrt_df2)
'data.frame': 560845 obs. of 18 variables:
$ Nsamples : num 310 310 310 310 310 310 310 310 310 310 ...
$ cluster : num 39 39 39 39 39 39 39 39 39 39 ...
$ R : num 44.9 44.9 44.9 44.9 44.9 ...
$ P50 : num 0.565 0.544 0.609 0.605 0.593 ...
$ regions : Factor w/ 6 levels "China_east","China_middlesouth",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HasRes : num 1 0 1 0 0 0 1 1 0 0 ...
$ use : num 10.02 9.75 0 9.38 8.77 ...
$ acc : num 0 0 0.4103 0.0769 0.0779 ...
$ tmp : num 2.46 2.46 2.46 2.46 2.45 ...
$ irg : num 1.788 0.399 1.205 1.836 1.841 ...
$ PgExt : num 3.11 0 3.7 3.11 3.18 ...
$ PgInt : num 4.69 2.76 0 3.99 2.22 ...
$ ChExt : num 3.74 0 4.33 3.74 3.81 ...
$ ChInt : num 5.01 5.99 5.35 4.88 4.97 ...
$ Ca : num 0 0 2.71 0 2.8 ...
$ veg : num 0 0 0 0 0 0 0 0 0 0 ...
$ Region.num: num 4 4 4 4 4 4 4 4 4 4 ...
$ Region : num 4 4 4 4 4 4 4 4 4 4 ...
- attr(*, "na.action")= 'omit' Named int 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "names")= chr "1" "2" "3" "4" ...
the answer variable is the variable HasRes and the covariates are the variables use, acc, tmp, irg, PgExt, PgInt, ChExt, ChInt, ca, veg.
I have a list in R that I want to loop through all the elements.
This is the structure of the object:
> str(AAPL.OPT[c])
List of 1
$ jun.12.2020:List of 2
..$ calls:'data.frame': 52 obs. of 7 variables:
.. ..$ Strike: num [1:52] 180 185 200 210 240 ...
.. ..$ Last : num [1:52] 123 118 131 120 85 ...
.. ..$ Chg : num [1:52] 0 0 7.61 9.48 0 ...
.. ..$ Bid : num [1:52] 149 144 129 119 89 ...
.. ..$ Ask : num [1:52] 153.3 148.5 133.5 123.7 93.5 ...
.. ..$ Vol : int [1:52] NA 15 16 2 1 1 3 36 1 2 ...
.. ..$ OI : int [1:52] 0 15 25 4 50 3 4 36 6 10 ...
..$ puts :'data.frame': 56 obs. of 7 variables:
.. ..$ Strike: num [1:56] 150 165 170 180 185 190 195 200 205 210 ...
.. ..$ Last : num [1:56] 0.05 0.02 0.14 0.05 0.03 0.02 0.01 0.02 0.01 0.01 ...
.. ..$ Chg : num [1:56] 0 0 0 0 0 0 0 0 0 0 ...
.. ..$ Bid : num [1:56] NA 0 0 0 0 0 0 0 0 0 ...
.. ..$ Ask : num [1:56] 2.13 0.11 0.11 1.8 1.87 0.01 1.88 0.5 1.88 2.13 ...
.. ..$ Vol : int [1:56] NA 1 1 2 1 16 1 17 1 21 ...
.. ..$ OI : int [1:56] 1 10 7 9 76 201 113 314 92 264 ...
I cannot access the next level of the object programatically (by indexing the value)
I want to do something like this:
AAPL.OPT[c][1]
instead of this
AAPL.OPT[c]$jun.12.2020
Sample data of AAPL.OPT[c]
$`jun.12.2020`$`calls`
Strike Last Chg Bid Ask Vol OI
AAPL200612C00180000 180.0 123.29 0.00000000 149.00 153.35 NA 0
AAPL200612C00185000 185.0 117.60 0.00000000 144.00 148.50 15 15
AAPL200612C00200000 200.0 131.15 7.60999300 129.00 133.50 16 25
AAPL200612C00210000 210.0 119.95 9.47999600 119.30 123.65 2 4
....
AAPL.OPT[c] gives a list of length 1 which has two other lists in them. If we use [[c]] it gives a list of length 2 andtTo access each dataframe you can subset them further using [[ so AAPL.OPT[[c]][[1]] and AAPL.OPT[[c]][[2]].
We can use
AAPL.OPT[[c]]$jun.12.2020
I want to extract temperature (temp_c) at specific pressure level (press_hpa). As I am filtering my data (dat) using dplyr, I'm creating another data frame which contains the same columns numbers (15) and different length of observation. There were so many solution to plot multiple time series from column but I cant match the solution.. How to plot a multiple time series showing temperature at different level(x = date, y = temp_c, legend = Press_1000, Press_925, Press_850, Press_700)? Kindly help.. Thank you..
library(ggplot2),
library(dplyr)
library(reshape2)
setwd("C:/Users/Hp/Documents/yr/climatology/")
dat <- read.csv("soundingWMKD.csv", head = TRUE, stringsAsFactors = F)
str(dat)
'data.frame': 6583 obs. of 15 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ pres_hpa : num 1006 1000 993 981 1005 ...
$ hght_m : int 16 70 132 238 16 62 141 213 302 329 ...
$ temp_c : num 24 23.6 23.2 24.6 24.2 24.2 24 23.8 23.3 23.2 ...
$ dwpt_c : num 23.4 22.4 21.5 21.6 23.6 23.1 22.9 22.7 22 21.8 ...
$ relh_pct : int 96 93 90 83 96 94 94 94 92 92 ...
$ mixr_g_kg: num 18.4 17.4 16.6 16.9 18.6 ...
$ drct_deg : int 0 0 NA NA 190 210 212 213 215 215 ...
$ sknt_knot: int 0 0 NA NA 1 3 6 8 11 11 ...
$ thta_k : num 297 297 297 299 297 ...
$ thte_k : num 350 347 345 349 351 ...
$ thtv_k : num 300 300 300 302 300 ...
$ date : chr "2017-11-02" "2017-11-02" "2017-11-02" "2017-11-02" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
Press_1000 <- filter(dat,dat$pres_hpa == 1000)
Press_925 <- filter(dat,dat$pres_hpa == 925)
Press_850 <- filter(dat,dat$pres_hpa == 850)
Press_700 <- filter(dat,dat$pres_hpa == 700)
date <- as.Date(dat$date, "%m-%d-%y")
str(Press_1000)
'data.frame': 80 obs. of 15 variables:
$ X : int 2 6 90 179 267 357 444 531 585 675 ...
$ pres_hpa : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ hght_m : int 70 62 63 63 62 73 84 71 74 78 ...
$ temp_c : num 23.6 24.2 24.4 24.2 25.4 24 23.8 24 23.8 24 ...
$ dwpt_c : num 22.4 23.1 23.2 22.3 23.9 23.1 23.4 23 23 23.1 ...
$ relh_pct : int 93 94 93 89 91 95 98 94 95 95 ...
$ mixr_g_kg: num 17.4 18.2 18.3 17.3 19.1 ...
$ drct_deg : int 0 210 240 210 210 340 205 290 315 0 ...
$ sknt_knot: int 0 3 2 3 3 2 4 1 1 0 ...
$ thta_k : num 297 297 298 297 299 ...
$ thte_k : num 347 350 351 348 354 ...
$ thtv_k : num 300 301 301 300 302 ...
$ date : chr "2017-11-02" "2017-11-03" "2017-11-04" "2017-11-05" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
str(Press_925)
'data.frame': 79 obs. of 15 variables:
$ X : int 13 96 187 272 365 450 537 593 681 769 ...
$ pres_hpa : num 925 925 925 925 925 925 925 925 925 925 ...
$ hght_m : int 745 747 746 748 757 764 757 758 763 781 ...
$ temp_c : num 21.8 22 22.4 23.2 22.2 20.6 22.4 22 22.4 22.2 ...
$ ... 'truncated'
all_series = rbind(date,Press_1000,Press_925,Press_850,Press_700)
meltdf <- melt(all_series,id.vars ="date")
ggplot(meltdf,aes(x=date,y=value,colour=variable,group=variable)) +
geom_line()
There are two ways of approaching this. What you go for may depend on the bedrock question (which we don't know).
1) For each data.frame, you have all the necessary columns and you can plot each source (data.frame) using e.g.
ggplot()... +
geom_line(data = Press_1000, aes(...)) +
geom_line(data = Press_925, aes(...)) ...
Note that you will have to specify color for each source and having a legend with that is PITA.
2) Combine all data.frames into one big object and create an additional column indicating the origin of the data (from which data.frame the observation is from). This would be your mapping variable (e.g. color, fill, group)in your current aes call. Instant legend.
I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.