Hi I am trying to find a non-parametric regression smoother to the difference between the control and treatment groups so as to determine the effectiveness of the appetite suppressant over time. then I need to use my model to estimate the difference between the treatment and control group at t=0 and t=50.
I want to use P-spline smoother ,but I do not have enough background about it
This is my data :
t
0 1 3 7 8 10 14 15 17 21 22 24 28 29 31 35 36 38 42 43 45 49 50 52 56 57 59 63 64 70 73 77 80 84 87 91 94 98 105
con
20.5 19.399 22.25 17.949 19.899 21.449 16.899 21.5 22.8 24.699 26.2 28.5 24.35 24.399 26.6 26.2 26.649 29.25 27.55 29.6 24.899 27.6 28.1 27.85 26.899 27.8 30.25 27.6 27.449 27.199 27.8 28.199 28 27.3 27.899 28.699 27.6 28.6 27.5
trt
21.3 16.35 19.25 16.6 14.75 18.149 14.649 16.7 15.05 15.5 13.949 16.949 15.6 14.699 14.15 14.899 12.449 14.85 16.75 14.3 16 16.85 15.65 17.149 18.05 15.699 18.25 18.149 16.149 16.899 18.95 22 23.6 23.75 27.149 28.449 25.85 29.7 29.449
where:
t - the time in days since the experiment started.
con - the median food intake of the control group.
trt - the median food intake of the treatment group.
Can anybody help please?
Only to give you a start. mgcv package implements various regression spline basis, including P-splines (penalized B-splines with difference penalty).
First, you need to set up your data:
dat <- data.frame(time = rep(t, 2), y = c(con, trt),
grp = gl(2, 39, labels = c("con", "trt")))
Then call gam for non-parametric regression:
library(mgcv) # no need to install; it comes with R
fit <- gam(y ~ s(time, bs = 'ps', by = grp) + grp, data = dat)
Read mgcv: how to specify interaction between smooth and factor? for specification of interaction. bs = 'ps' sets P-spline basis. By default, 10 (evenly spaced interior) knots are chosen. You can change k if you want.
More about P-splines in mgcv, read mgcv: how to extract knots, basis, coefficients and predictions for P-splines in adaptive smooth?.
Related
I created in r-studio a null logistic model.
nullModel <- glm(train$bigFire ~ 1, data = train, family = binomial)
Then it is asked to the model to make predictions on the test-set.
nullModel.pred <- predict(nullModel, test, type = "response")
At this point i want to compute the confusion matrix in order to evaluate the performances of the model.
CM <- table(test$bigFire, nullModel.pred>0.5)
The resulting output is the following:
TRUE
0 58
1 46
Even if i change the cutoff value (now set to 0.5) the result is always the same. I don't understand why since the model should perform in a different way having different cutoff values.
The dataset is the following:
month day FFMC DMC DC ISI temp RH wind rain zone bigFire
1 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 75 0
2 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 74 0
3 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 74 0
4 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 86 0
5 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 86 0
6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 86 0
It counts 517 rows.
The test and train are generated from the previous datafram with a split of 80% for train and 20% for test (104 rows).
The length of the prediction vector is:
> length(nullModel.pred)
[1] 104
and contains always the same value -> 0.542.
This is reasonable since it is only able to estimate the expected value for the response to be 1.
I am trying to write a script to get some specific values for the equation 25a+20b=1600 with a in the range between 24:60 and b in 20:50
I need to get the pairs of a and b satisfying the equation.
My first problem was how to define a and b with a single digit decimal place (a=24.0,24.1,24.2...etc.) but I overcame that defining a<-c(240:600)/10, so my first question is: Is there any direct method to do that?
Now, I did a couple of nested loops and I am able to get each time the equation is satisfied in a vector, I want to use rbind() to attach this vector to a matrix or a dataframe but it is not working without any error or warning. it just takes the value of the first vector and that's it !
Here is my code, can someone help me define where the problem is?
solve_ms <- function() {
index<-1
sol<-data.frame()
temp<-vector("numeric")
a<-c(240:600)/10
b<-c(200:500)/10
for (i in 1:length(a)){
for (j in 1:length(b)) {
c <- 25*a[i]+20*b[j]
if(c == 1600) {
temp<-c(a[i], b[j])
if(index == 1) {
sol<-temp
index<-0
}
else rbind(sol,temp)
}
}
}
return(sol)
}
I found our where my code problem is, it is using rbind without assigning its return to a dataframe. I had to do {sol<-rbind(sol,temp)} and it will work.
I will check other suggestions as well.. thanks.
Try this instead:
#define a function
fun <- function(a,b) (25*a+20*b) == 1600
Since floating point precision could be an issue:
#alternative function
fun <- function(a,b,tol=.Machine$double.eps ^ 0.5) abs(25*a+20*b-1600) < tol
#create all possible combinations
paras <- expand.grid(a=c(240:600)/10, b=20:50)
paras[fun(paras$a,paras$b),]
a b
241 48.0 20
594 47.2 21
947 46.4 22
1300 45.6 23
1653 44.8 24
2006 44.0 25
2359 43.2 26
2712 42.4 27
3065 41.6 28
3418 40.8 29
3771 40.0 30
4124 39.2 31
4477 38.4 32
4830 37.6 33
5183 36.8 34
5536 36.0 35
5889 35.2 36
6242 34.4 37
6595 33.6 38
6948 32.8 39
7301 32.0 40
7654 31.2 41
8007 30.4 42
8360 29.6 43
8713 28.8 44
9066 28.0 45
9419 27.2 46
9772 26.4 47
10125 25.6 48
10478 24.8 49
10831 24.0 50
If the problem is really this simple i.e. solving for roots of 2 variable linear equation, you can always rearrange the equation to write b in terms of a i.e. b = (1600-25*a)/20 and get all the values of b for corresponding values of a and filter the combinations by b
e.g.
a = c(240:600)/10
b = 20:50
RESULTS <- data.frame(a, b = (1600 - 25 * a)/20)[((1600 - 25 * a)/20) %in% b, ]
RESULTS
## a b
## 1 24.0 50
## 9 24.8 49
## 17 25.6 48
## 25 26.4 47
## 33 27.2 46
## 41 28.0 45
## 49 28.8 44
## 57 29.6 43
## 65 30.4 42
## 73 31.2 41
## 81 32.0 40
## 97 33.6 38
## 105 34.4 37
## 121 36.0 35
## 137 37.6 33
## 145 38.4 32
## 161 40.0 30
## 177 41.6 28
## 185 42.4 27
## 193 43.2 26
## 201 44.0 25
## 209 44.8 24
## 217 45.6 23
## 225 46.4 22
## 233 47.2 21
## 241 48.0 20
Is there a way to transform a ffdf into a normal dataframe?
Assuming that the thing is small enough to fit in the ram.
for example:
library(ff)
library(ffbase)
data(trees)
Girth <- ff(trees$Girth)
Height <- ff(trees$Height)
Volume <- ff(trees$Volume)
aktiv <- ff(as.factor(sample(0:1,31,replace=T)))
#Create data frame with some added parameters.
data <- ffdf(Girth=Girth,Height=Height,Volume=Volume,aktiv=aktiv)
rm(Girth,Height,Volume,trees,aktiv)
aktiv <- subset.ffdf(data, data$aktiv== "1" )
and then convert aktiv to data frame and save the RData
(sadly the person waiting the output don't want to learn how to work with the ff package, so I have no choise)
Thanks
Just use as.data.frame:
aktiv <- subset(as.data.frame(data), aktiv == 1)
Girth Height Volume aktiv
2 8.6 65 10.3 1
7 11.0 66 15.6 1
9 11.1 80 22.6 1
12 11.4 76 21.0 1
13 11.4 76 21.4 1
15 12.0 75 19.1 1
17 12.9 85 33.8 1
20 13.8 64 24.9 1
21 14.0 78 34.5 1
23 14.5 74 36.3 1
26 17.3 81 55.4 1
27 17.5 82 55.7 1
28 17.9 80 58.3 1
31 20.6 87 77.0 1
From here you can easily use save or write.csv, e.g.:
save(aktiv, file="aktiv.RData")
I have a simple data frame called msq:
sex wing index
1 h 54 67.4
2 m 60.5 67.9
3 m 60 64.5
4 m 59 66.6
5 m 63.5 63.3
6 m 63 66.7
7 m 61.5 71.8
8 m 62 67.9
9 m 63 67.8
10 m 62.5 72.7
11 m 61.5 70.3
12 h 54.5 70.7
13 m 60 61.1
14 m 63.5 50.9
15 m 63 72.1
My intention is to make a boxplot with ggplot for which I use this code that works fine:
gplot(msq, aes("index",index))+ geom_boxplot (aes(group="sex"))
and then to plot an outlier that should stand alone up in the graph (a value 73.9). The problem is that if I include it in the data set, the boxplot "absorbs" it making the error line longer... I have been looking in Hmisc and to stat_summary but I can't get any clear idea.
thank you.
You could use geom_point to add points to a plot generated with ggplot2.
library(ggplot2)
ggplot(msq, aes(sex, index)) + # Note. I modified the aes call
geom_boxplot() +
geom_point(aes(y = 73.9)) # add points
Is there a way to aggregate multiple sub-totals with reshape2? E.g. for the airquality dataset
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
aqm <- subset(aqm, month %in% 5:6 & day %in% 1:7)
I can make a subtotal column for each month, that has the average for all variables within that month:
dcast(aqm, day ~ month+variable, mean, margins = "variable")
day 5_ozone 5_solar.r 5_wind 5_temp 5_(all) 6_ozone 6_solar.r
1 1 41 190 7.4 67 76.350 NaN 286
2 2 36 118 8.0 72 58.500 NaN 287
3 3 12 149 12.6 74 61.900 NaN 242
4 4 18 313 11.5 62 101.125 NaN 186
5 5 NaN NaN 14.3 56 35.150 NaN 220
6 6 28 NaN 14.9 66 36.300 NaN 264
7 7 23 299 8.6 65 98.900 29 127
6_wind 6_temp 6_(all)
1 8.6 78 124.20000
2 9.7 74 123.56667
3 16.1 67 108.36667
4 9.2 84 93.06667
5 8.6 85 104.53333
6 14.3 79 119.10000
7 9.7 82 61.92500
I can also make a subtotal column for each variable, that has the average for all months within that variable:
dcast(aqm, day ~ variable+month, mean, margins = "month")
day ozone_5 ozone_6 ozone_(all) solar.r_5 solar.r_6 solar.r_(all)
1 1 41 NaN 41 190 286 238.0
2 2 36 NaN 36 118 287 202.5
3 3 12 NaN 12 149 242 195.5
4 4 18 NaN 18 313 186 249.5
5 5 NaN NaN NaN NaN 220 220.0
6 6 28 NaN 28 NaN 264 264.0
7 7 23 29 26 299 127 213.0
wind_5 wind_6 wind_(all) temp_5 temp_6 temp_(all)
1 7.4 8.6 8.00 67 78 72.5
2 8.0 9.7 8.85 72 74 73.0
3 12.6 16.1 14.35 74 67 70.5
4 11.5 9.2 10.35 62 84 73.0
5 14.3 8.6 11.45 56 85 70.5
6 14.9 14.3 14.60 66 79 72.5
7 8.6 9.7 9.15 65 82 73.5
Is there a way to tell reshape2 to calculate both sets of subtotals in one command? This command is close, adding in the grand total, but omits the monthly subtotals:
dcast(aqm, day ~ variable+month, mean, margins = c("variable", "month"))
If I get your question right, you can use
acast(aqm, day ~ variable ~ month, mean, margins = c("variable", "month"))[,,'(all)']
The acast gets you the summary for each day over each variable over each month. The total aggregate "slice" ([,,'(all)']) has a row for each day, with a column for each variable (averaged over all months) and a '(all)' column averaging each day, over all variables over all months.
Is this what you needed?