predict() should display one value, but generates way too much values - r

I have a dataset of the german soccer league, which shows every team of the league, the player value, goals and points. Freiburg soccer team has scored 19 goals with a value of 1.12. Now I want to predict out of the created linear model, how many goals the team of Freiburg could expect with a player value of 5.
If I run the stated code line, the function shows me not one value, but 18 for each team. How can I change that, that I just get the value for the team of Freiburg? (Which should be the prediction 27.52 using the linear model.)
m3 <- lm(bundesliga$Goals ~ bundesliga$PlayerValue)
summary(m3)
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
Dataset:

You have specified your model in a way that R discourages.
The preferred way is:
m3 <- lm(Goals ~ PlayerValue, data=bundesliga)
Then the prediction works as expected using your command:
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
# 1
#27.52412
Although the help page of lm does say that the data argument is optional, specifying it in the model allows other functions, such as predict, to work. There is a note in the help page of predict.lm:
Note
Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit). A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.
This is why your original command doesn't work and you get the warning message:
predict(m3, newdata = nd)
1 2 3 4 5 6 7 8 9
40.06574 28.31378 26.08416 25.45708 25.31773 25.22483 24.22614 23.55261 23.36681
10 11 12 13 14 15 16 17 18
21.60169 20.51011 20.23140 20.25463 19.58110 19.48820 18.60564 18.60564 18.51274
#Warning message:
#'newdata' had 1 row but variables found have 18 rows
The environment of your formula is not the bundesliga data frame, so R cannot find PlayerValue.
Data:
bundesliga <- structure(list(PlayerValue = c(10.4, 5.34, 4.38, 4.11, 4.05, 4.01,
3.58, 3.29, 3.21, 2.45, 1.98, 1.86, 1.87, 1.58, 1.54, 1.16, 1.16, 1.12),
Goals = c(34, 32, 34, 35, 32, 16, 26, 27, 23, 13, 10, 21, 22, 18, 24, 21, 12, 19)),
class = "data.frame", row.names = c(NA, -18L))

Related

Plot variables as slope of line between points

Due to the nature of my specification, the results of my regression coefficients provide the slope (change in yield) between two points; therefore, I would like to plot these coefficients using the slope of a line between these two points with the first point (0, -0.7620) as the intercept. Please note this is a programming question; not a statistics question.
I'm not entirely sure how to implement this in base graphics or ggplot and would appreciate any help. Here is some sample data.
Sample Data:
df <- data.frame(x = c(0, 5, 8, 10, 12, 15, 20, 25, 29), y = c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.01035) )
Output:
x y
1 0 -7.620e-01
2 5 -4.340e-04
3 8 1.580e-03
4 10 8.220e-05
5 12 -2.940e-03
6 15 2.460e-03
7 20 -5.210e-04
8 25 -9.287e-05
9 29 -1.035e-02
Example:
You can use cumsum, the cumulative sum, to calculate intermediate values
df <- data.frame(x=c(0, 5, 8, 10, 12, 15, 20, 25, 29),y=cumsum(c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.0103)))
plot(df$x,df$y)

Generating separate plots for each unique subject ID and save them in the working directory with the subject ID number

I have a huge data with many subjects. The data has the following columns:
ID TIME CONC
7030104 2.0 0.536
7030104 2.5 1.320
7030104 3.0 1.460
7030104 4.0 5.070
7030104 5.0 17.300
7030104 6.0 38.600
70304 8.0 0.589
70304 10.0 35.400
70304 12.0 29.400
70304 24.0 10.900
70304 36.0 3.260
70304 48.0 1.290
I would like to draw a separate plot (CONC versus TIME) for each subject ID and automatically save it to the working directory with the ID number of the subject.
I am using simple plotting but I need the help in how I can apply it for all subject IDs and automatically save the plots into my working directory.
setwd("..")
plotobj <- NULL
plotobj <- plot(sub$TIME,sub$CONC,type="b")
i am using RStudio
Your assistance is appreciated!
You could save it in a single "pdf" file, single page per plot. The "title" of the plot identifies the subset "ID". Here, I am using lapply after splitting (split) the dataset by "ID". Specify the plot arguments ad wrap it in invisible so that the NULL loop won't get printed on the R console.
par(mfrow=c(1,1))
pdf('Amer.pdf')
lst <- split(df, df$ID)
invisible(lapply(lst, function(sub) with(sub,
plot(TIME, CONC, type='b', main= paste('Plot of', ID[1])) )))
dev.off()
Or if you need "separate", .jpg plots, lapply can still be used
invisible(lapply(lst, function(sub) {
jpeg(paste0(sub$ID[1],'.jpg'))
with(sub, plot(TIME, CONC, type='b', main=paste('Plot of', ID[1])))
dev.off()
}))
data
df <- structure(list(ID = c(7030104L, 7030104L, 7030104L, 7030104L,
7030104L, 7030104L, 70304L, 70304L, 70304L, 70304L, 70304L, 70304L
), TIME = c(2, 2.5, 3, 4, 5, 6, 8, 10, 12, 24, 36, 48), CONC = c(0.536,
1.32, 1.46, 5.07, 17.3, 38.6, 0.589, 35.4, 29.4, 10.9, 3.26,
1.29)), .Names = c("ID", "TIME", "CONC"), class = "data.frame",
row.names = c(NA, -12L))
First try getting the list of IDs
id_arr = unique(sub$ID)
After, save the plot for each possible ID
for(i in id_arr) {
sub_id = subset(sub, ID == i)
jpeg(paste(i, ".jpg", sep=""))
plot(sub_id$TIME, sub_id$CONC, type="b")
dev.off()
}

Rank a vector based on order and replace ties with their average

I'm new to R, and i find it quite interesting.
I have MATLAB code to rank a vector based on order which works fine. Now I want to convert it to R code, a typical spearman ranking with ties:
# MATLAB CODE
function r=drank(x)
u = unique(x);
[xs,z1] = sort(x);
[z1,z2] = sort(z1);
r = (1:length(x))';
r=r(z2);
for i=1:length(u)
s=find(u(i)==x);
r(s,1) = mean(r(s));
end
This is what i tried:
# R CODE
x = c(10.5, 8.2, 11.3, 9.1, 13.0, 11.3, 8.2, 10.1)
drank <- function(x){
u = unique(x)
xs = order(x)
r=r[xs]
for(i in 1:length(u)){
s=which(u[i]==x)
r[i] = mean(r[s])
}
return(r)
}
r <- drank(x)
Results:
r = 5, 1.5, 6.5, 3, 8, 6.5, 1.5, 4
1.5 is average of 8.2 occurring twice ie. tie
6.5 is average of 11.3 occurring twice
Can anyone help me check it?
Thanks,
R has a built-in function for ranking, called rank() and it gives precisely what you are looking for. rank has the argument ties.method, "a character string specifying how ties are treated", which defaults to "average", i.e. replaces ties by their mean.
x = c(10.5, 8.2, 11.3, 9.1, 13.0, 11.3, 8.2, 10.1)
expected <- c(5, 1.5, 6.5, 3, 8, 6.5, 1.5, 4)
rank(x)
# [1] 5.0 1.5 6.5 3.0 8.0 6.5 1.5 4.0
identical(expected, rank(x))
# [1] TRUE

Permuting strings and passing as function arguments in Julia

In my continuing odyssey to kick the tires on Julia more thoroughly, I'm going back and reimplementing my solutions to some bayesian coursework exercises. Last time, I discovered the conjugate distributions facilities in Julia and decided to play with those this time. That part works rather well (as an aside, I haven't figured out if there's a good reason the NormalInverseGamma function won't take sufficient statistics rather than a vector of data, or if it's just not implemented yet).
Here, I'd like to make some comparisons between samples from several posterior distributions. I have three posterior samples that I'd like to compare all permutations of. I am able to permute what should be the arguments to my compare function:
using Distributions
# Data, the expirgated versions
d1 = [2.11, 9.75, 13.88, 11.3, 8.93, 15.66, 16.38, 4.54, 8.86, 11.94, 12.47]
d2 = [0.29, 1.13, 6.52, 11.72, 6.54, 5.63, 14.59, 11.74, 9.12, 9.43]
d3 = [4.33, 7.77, 4.15, 5.64, 7.69, 5.04, 10.01, 13.43, 13.63, 9.9]
# mu=5, sigsq=4, nu=2, k=1
# think I got those in there right... docs were a bit terse
pri = NormalInverseGamma(5, 4, 2, 1)
post1 = posterior(pri, Normal, d1)
post1_samp = [rand(post1)[1] for i in 1:5000]
post2 = posterior(pri, Normal, d2)
post2_samp = [rand(post2)[1] for i in 1:5000]
post3 = posterior(pri, Normal, d3)
post3_samp = [rand(post2)[1] for i in 1:5000];
# Where I want my permutations passed in as arguments
compare(a, b, c) = mean((a .> b) & (b .> c))
#perm = permutations([post1_samp, post2_samp, post3_samp]) # variables?
#perm = permutations([:post1_samp, :post2_samp, :post3_samp]) # symbols?
perm = permutations(["post1_samp", "post2_samp", "post3_samp"]) # strings?
[x for x in perm] # looks like what I want; now how to feed to compare()?
If I'm reading you correctly and you want six outputs, you could pass a tuple containing the arrays to permutations, and then use apply:
julia> perm = permutations((post1_samp, post2_samp, post3_samp))
Permutations{(Array{Any,1},Array{Any,1},Array{Any,1})}(({10.562517942895859,10.572164090071183,10.736702907907505,10.210772173751444,14.366729334490795,10.592629893299842,9.89659091860089,8.116412691836256,10.349724070315517,11.268377549210639 … 12.064725902593915,10.303602433314985,10.002042635051714,9.055831122365928,10.110819233623218,11.562207296236382,10.64265460839246,13.450063260877014,12.017400480458447,10.4932272939257},{7.405568037651908,8.02078920939688,7.511497830660621,7.887748694407902,7.698862774251405,6.007663099515951,7.848174806167786,9.23309632138448,7.205139036914154,8.277223275210972 … 7.06835013863376,6.488809918983307,9.250388581506368,7.350669918529516,5.546251008276725,8.778324046008263,10.833297020230216,9.2006982752771,9.882075423462595,3.253723211533207},{9.531489314208752,7.395780786761686,6.224734811478234,5.200474665890965,8.044992565567913,7.764939771450804,6.646382928269485,5.501893299017636,6.993003549302548,7.243273003116189 … 10.249365688182436,7.499165465689278,6.056692905419897,7.411776062227991,9.829197784956492,7.014685931227273,6.156474145474993,10.258900762434248,-1.044259248117803,7.284861693401341}))
julia> [apply(compare, p) for p in perm]
6-element Array{Any,1}:
0.4198
0.4182
0.0636
0.0154
0.0672
0.0158
Remember though that it's usually a misstep to have a bunch of variables with numbers in their names: that usually suggests they should be together in a named collection of some kind ("post_samples", say.)

Disk space Prediction using R

I have 2 vectors
days = c(1, 2, 3, 4, 5, 6, 7)
pct_used = c(22.3, 22.1, 22.1, 22.1, 55.660198413, 56.001746032, 55.988769841)
fit <- lm(days ~ poly(pct_used,2,raw=TRUE))
prediction <- predict(fit, data.frame(pct_used=85))
days_remain <- (prediction - tail(days,1))
pct_used is basically disk space . So this code predicts when disk space will reach 85.
The prediction value returned is 325.something which is wierd I feel.Does that mean it will take 325 days to reach pct_used = 85 ?
Where am i going wrong ?
Try this to see what is happening:
plot(pct_used, days)
lines(pct_used, predict(fit
plot(pct_used, days, xlim=c(min(pct_used), 85) ,ylim= c(-50,350))
lines(seq(min(pct_used), 85, length=50), predict(fit, newdata=data.frame(
pct_used=seq( min(pct_used), 85, length=50))))

Resources