There was a need to build an approximation of data using the formula
y = a(exp(x/b) - 1) (below the code).
library("ggplot2")
df <- read.table(file='vah_p_1',header =TRUE)
p <- ggplot(df, aes(x = x, y = y)) + geom_point() +
geom_smooth(data = df, method = "nls",size=0.4, se=FALSE,color ='cyan2',
formula = y ~ a(exp^(x*b)-1),method.args = list(start=c(a=1.0,b=0.0)))
p
Unfortunately the approximation line is not being built.I think the problem is in method.args = list(start=c(a=1.0,b=0.0). How to find a, b?
In vah_p_1 is located:
x y
0 4
0.25 5
0.27 6
0,29 7
0.31 8
0.33 10
0.34 13
0.36 16
0.37 20
0.38 23
0.39 28
0.4 37
0.41 43
0.42 55
0.43 67
0.44 81
0.45 94
0.46 118
0.47 143
0.48 187
0.49 225
Related
I am having trouble with a legend in a lda analysis. Here is toy data:
>d_e_a.train
Lymphoprol. CORT Testo FDL RBC.Ab. ifn.g il.4 Profile
52 0.00 0.58 1.94 2.54 6 98 40 Med
81 22.23 0.58 0.05 1.56 4 203 45 Med
66 5.31 1.75 0.30 2.73 3 49 74 High
62 35.00 0.81 0.95 4.30 6 322 60 Low
9 6.52 2.36 0.03 0.92 4 51 75 High
70 13.27 0.47 1.67 2.57 3 278 75 Med
56 18.23 0.46 1.89 2.99 4 54 60 High
72 31.25 0.31 1.52 3.37 5 305 57 Low
90 22.09 0.40 0.06 1.62 5 254 58 Med
37 4.32 1.34 0.05 0.71 3 41 73 High
3 15.65 0.50 0.07 0.97 5 280 67 Med
17 39.32 1.71 0.30 2.06 2 93 53 High
57 19.25 1.15 0.05 1.75 5 95 73 Med
24 17.03 0.14 1.28 3.22 4 79 77 Med
85 13.73 0.52 1.59 2.20 3 62 75 Med
41 23.16 0.89 0.09 1.48 2 99 57 Med
65 29.25 0.28 0.04 2.56 5 298 55 Low
75 0.00 0.86 0.11 1.65 3 110 47 Med
22 14.25 1.09 1.46 1.96 5 76 69 Med
20 35.14 0.26 1.12 5.16 6 282 47 Low
83 36.94 0.55 1.62 2.15 4 298 60 Low
45 28.58 1.50 0.21 1.41 5 201 65 Med
2 13.91 0.65 1.34 2.27 6 195 58 Med
73 0.00 0.99 0.09 0.92 3 133 77 Med
29 35.80 0.12 0.01 1.80 7 307 65 Low`
and this is the model: model_a <- lda(Profile ~., data = d_e_a.train)
when I try to plot it using the following code I get two legends as it can be seen in the plot
library(ggplot2)
library(ggfortify)
library(devtools)
install_github('fawda123/ggord')
library(ggord)
plota<-ggord(model_a, d_e_a.train$Profile)+
theme_classic()+
scale_fill_manual(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
theme(text = element_text(size = 20 ),
axis.line.x = element_line(color="black", size = 1),
axis.line.y = element_line(color="black", size = 1),
axis.text.x=element_text(colour="black",angle = 360,vjust = 0.6),
axis.text.y=element_text(colour="black"))
plota
I would like to get only the legend that is seen in the top.
Regards
You need to have both a fill scale and a color scale with the same labels. You also need to remove the shape guide that this function seems to add, even though the shape of the points appears constant.
ggord(model_a, d_e_a.train$Profile)+
theme_classic()+
scale_fill_discrete(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
scale_color_discrete(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
theme(text = element_text(size = 20 ),
axis.line.x = element_line(color="black", size = 1),
axis.line.y = element_line(color="black", size = 1),
axis.text.x=element_text(colour="black",angle = 360,vjust = 0.6),
axis.text.y=element_text(colour="black")) +
guides(shape = guide_none())
Suposse I have a dataset with two variables x and y, with the purpose to run a linear regression y ~ x.
We have all x values equal and y varying between 1 and 10.
For example (in R code):
x <- rep(100, 50)
y <- runif(50, 1, 10)
If I add a new value, being x value 75, this new value will be considerated a leverage:
x <- c(x, 75)
y <- c(y, runif(1, 1, 10))
fit <- lm(y ~ x)
im <- influence.measures(fit)
tail(im$is.inf)
How many 75's I need to add to the dataset for not being considerated a leverage?
Is there any R package that returns that critical N size?
Edit after #RuiBarradas comments
hatvalues with 51 observations (50 100's and 1 75) are:
> im$infmat[, 6]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
49 50 51
0.02 0.02 1.00
What i want to know is how many 75's i have to add so that 75 are not considered leverages, because I'm deleting observations with high leverage from my analysis.
This have to be done programatically for over 1000 cases.
Well I don't think anyone is trying to question your approach just point out that as you drop high leverage rows more will appear. potentially with even higher leverage.
But best to simply arm you to do what you want. Here's a function that will take a dataset, a regression formula, and how_deep you want to go.
We'll apply it to mtcars and say that we want to make sure we have 27 rows of 32 left which means how_deep = 27. You get back a vector of cars. Initially the Lincoln has the highest leverage when you eliminate it, it's now a Chrysler with what is actually higher leverage. Hopefully your data will be much different
recursive_leverage <- function(data, formula, how_deep) {
while (nrow(data) >= how_deep) {
data$hatval <- hatvalues(lm(formula = formula, data = data))
print(paste(rownames(data[which.max(data$hatval),]), max(data$hatval)))
data <- data[-which(data$hatval == max(data$hatval)),]
}
}
recursive_leverage(mtcars, mpg ~ wt + disp, 27)
#> [1] "Lincoln Continental 0.198891254526384"
#> [1] "Chrysler Imperial 0.240276382749128"
#> [1] "Cadillac Fleetwood 0.274528186610731"
#> [1] "Lotus Europa 0.224979770471144"
#> [1] "Honda Civic 0.223347867723116"
#> [1] "Ford Pantera L 0.21141405715464"
This code below creates map of UK postcodes using ggplot, however leaves some of the parts white/missing from the map, could you please advise how to make sure that whole map is filled and that the postcode areas have a border ? Thanks.
MAP OF UK from the below code
rm(list=ls())
library(tidyverse)
library(maptools)
library(raster)
library(plotrix)
library(ggrepel)
df2016 <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
name value amount
LD 1 3
ZE 1 2
WS 0.79 19
ML 0.75 12
HS 0.75 4
TQ 0.74 38
WN 0.73 15
CA 0.71 28
HU 0.7 33
FY 0.69 16
HG 0.69 16
IV 0.68 19
DL 0.68 25
CB 0.68 115
TS 0.67 46
IP 0.67 87
AB 0.67 66
NP 0.67 45
FK 0.67 18
IM 0.67 9
SM 0.66 50
HD 0.66 32
EN 0.66 61
CO 0.65 52
ME 0.65 54
PE 0.64 266
EX 0.64 81
WV 0.63 49
JE 0.63 24
NE 0.62 148
YO 0.62 47
DE 0.62 78
LN 0.61 36
SN 0.61 109
IG 0.6 63
NR 0.6 90
SP 0.59 37
BA 0.59 93
UB 0.59 127
TN 0.59 95
BT 0.59 180
BD 0.59 51
HP 0.59 126
TA 0.59 46
PO 0.58 113
DH 0.58 55
WD 0.58 102
BH 0.57 96
DG 0.57 14
CV 0.57 225
RG 0.57 255
BN 0.56 158
DY 0.56 48
HA 0.56 148
W 0.56 359
WA 0.56 77
DA 0.55 38
CT 0.55 62
GU 0.55 231
RH 0.55 132
BL 0.55 33
HX 0.55 11
BS 0.54 184
SS 0.54 46
EH 0.54 185
DT 0.54 37
G 0.54 137
B 0.54 283
LU 0.54 41
NG 0.54 97
OX 0.53 208
S 0.53 179
CM 0.53 100
DD 0.53 17
GL 0.53 87
AL 0.53 89
HR 0.53 38
LS 0.52 122
TF 0.52 21
RM 0.52 44
SL 0.52 155
MK 0.52 136
SY 0.52 46
DN 0.52 81
N 0.52 191
M 0.52 226
SR 0.52 29
SK 0.52 64
BB 0.51 140
KY 0.51 41
WF 0.51 51
PR 0.51 63
L 0.51 81
KT 0.5 185
CF 0.5 118
ST 0.5 84
TR 0.5 46
CW 0.5 44
TD 0.5 12
P 0.5 2
SW 0.5 317
LL 0.49 49
CH 0.49 43
E 0.49 275
EC 0.48 364
PA 0.48 27
SO 0.48 157
CR 0.48 84
PL 0.48 61
SG 0.47 59
KA 0.47 15
LA 0.47 43
SA 0.46 78
LE 0.46 194
TW 0.45 125
OL 0.44 41
SE 0.44 297
NN 0.43 143
NW 0.42 236
WC 0.41 138
WR 0.38 73
BR 0.37 62
GY 0.26 35
PH 0.23 13
")
#df2016$amount <- NULL
df2016$name <- as.character(df2016$name)
# Download a shapefile of postal codes into your working directory
download.file(
"http://www.opendoorlogistics.com/wp-content/uploads/Data/UK-postcode-boundaries-Jan-2015.zip",
"postal_shapefile"
)
# Unzip the shapefile
unzip("postal_shapefile")
# Read the shapefile
postal <- readShapeSpatial("./Distribution/Areas")
postal.df <- fortify(postal, region = "name")
# Join your data to the shapefile
colnames(postal.df)[colnames(postal.df) == "id"] <- "name"
postal.df <- raster::merge(postal.df, df2016, by = "name")
postal.df$value[is.na(postal.df$value)] <- 0.50
# Get centroids of spatialPolygonDataFrame and convert to dataframe
# for use in plotting area names.
postal.centroids.df <- data.frame(long = coordinates(postal)[, 1],
lat = coordinates(postal)[, 2],
id=postal$name)
p <- ggplot(postal.df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = cut(value,5))) +
geom_text_repel(data = postal.centroids.df, aes(label = id, x = long, y = lat, group = id), size = 3, check_overlap = T) +
labs(x=" ", y=" ") +
theme_bw() + scale_fill_brewer('Success Rate 2016', palette = 15) +
coord_map() +
theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p
Try arranging the postal code by name or number just before plotting
postal.centroids.df %>%
arrange(id)
My county maps of the US did the same thing when they weren't in order. If that doesn't work try by lat or long as well.
Solution was to use left_join from dplyr instead of merge:
rm(list=ls())
library(tidyverse)
library(maptools)
library(raster)
library(plotrix)
library(ggrepel)
df2016 <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
name value amount
LD 1 3
ZE 1 2
WS 0.79 19
ML 0.75 12
HS 0.75 4
TQ 0.74 38
WN 0.73 15
CA 0.71 28
HU 0.7 33
FY 0.69 16
HG 0.69 16
IV 0.68 19
DL 0.68 25
CB 0.68 115
TS 0.67 46
IP 0.67 87
AB 0.67 66
NP 0.67 45
FK 0.67 18
IM 0.67 9
SM 0.66 50
HD 0.66 32
EN 0.66 61
CO 0.65 52
ME 0.65 54
PE 0.64 266
EX 0.64 81
WV 0.63 49
JE 0.63 24
NE 0.62 148
YO 0.62 47
DE 0.62 78
LN 0.61 36
SN 0.61 109
IG 0.6 63
NR 0.6 90
SP 0.59 37
BA 0.59 93
UB 0.59 127
TN 0.59 95
BT 0.59 180
BD 0.59 51
HP 0.59 126
TA 0.59 46
PO 0.58 113
DH 0.58 55
WD 0.58 102
BH 0.57 96
DG 0.57 14
CV 0.57 225
RG 0.57 255
BN 0.56 158
DY 0.56 48
HA 0.56 148
W 0.56 359
WA 0.56 77
DA 0.55 38
CT 0.55 62
GU 0.55 231
RH 0.55 132
BL 0.55 33
HX 0.55 11
BS 0.54 184
SS 0.54 46
EH 0.54 185
DT 0.54 37
G 0.54 137
B 0.54 283
LU 0.54 41
NG 0.54 97
OX 0.53 208
S 0.53 179
CM 0.53 100
DD 0.53 17
GL 0.53 87
AL 0.53 89
HR 0.53 38
LS 0.52 122
TF 0.52 21
RM 0.52 44
SL 0.52 155
MK 0.52 136
SY 0.52 46
DN 0.52 81
N 0.52 191
M 0.52 226
SR 0.52 29
SK 0.52 64
BB 0.51 140
KY 0.51 41
WF 0.51 51
PR 0.51 63
L 0.51 81
KT 0.5 185
CF 0.5 118
ST 0.5 84
TR 0.5 46
CW 0.5 44
TD 0.5 12
P 0.5 2
SW 0.5 317
LL 0.49 49
CH 0.49 43
E 0.49 275
EC 0.48 364
PA 0.48 27
SO 0.48 157
CR 0.48 84
PL 0.48 61
SG 0.47 59
KA 0.47 15
LA 0.47 43
SA 0.46 78
LE 0.46 194
TW 0.45 125
OL 0.44 41
SE 0.44 297
NN 0.43 143
NW 0.42 236
WC 0.41 138
WR 0.38 73
BR 0.37 62
GY 0.26 35
PH 0.23 13
")
# Download a shapefile of postal codes into your working directory
download.file(
"http://www.opendoorlogistics.com/wp-content/uploads/Data/UK-postcode-boundaries-Jan-2015.zip",
"postal_shapefile"
)
# Unzip the shapefile
unzip("postal_shapefile")
# Read the shapefile
postal <- readShapeSpatial("./Distribution/Areas")
postal.df <- fortify(postal, region = "name")
# Join your data to the shapefile
colnames(postal.df)[colnames(postal.df) == "id"] <- "name"
library(dplyr)
test <- left_join(postal.df, df2016, by = "name", copy = FALSE)
#postal.df <- raster::merge(postal.df, df2016, by = "name")
test$value[is.na(test$value)] <- 0.50
# for use in plotting area names.
postal.centroids.df <- data.frame(long = coordinates(postal)[, 1],
lat = coordinates(postal)[, 2],
id=postal$name)
p <- ggplot(test, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = cut(value,5))) +
geom_text_repel(data = postal.centroids.df, aes(label = id, x = long, y = lat, group = id), size = 3, check_overlap = T) +
labs(x=" ", y=" ") +
theme_bw() + scale_fill_brewer('Success Rate 2016', palette = 15) +
coord_map() +
theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p
I have my data in long-format like this with 20 different variables (but they all have the same Time points):
Time variable value
1 0 P1 0.07
2 1 P1 0.02
3 2 P1 0.12
4 3 P1 0.17
5 4 P1 0.10
6 5 P1 0.17
66 0 P12 0.02
67 1 P12 0.11
68 2 P12 0.20
69 3 P12 0.19
70 4 P12 0.07
71 5 P12 0.20
72 6 P12 0.19
73 7 P12 0.19
74 8 P12 0.12
75 10 P12 0.13
76 12 P12 0.08
77 14 P12 NA
78 24 P12 0.07
79 0 P13 0.14
80 1 P13 0.17
81 2 P13 0.24
82 3 P13 0.24
83 4 P13 0.26
84 5 P13 0.25
85 6 P13 0.21
86 7 P13 0.21
87 8 P13 NA
88 10 P13 0.19
89 12 P13 0.14
90 14 P13 NA
91 24 P13 0.12
I would like to calculate the area under the curve for each variable between time=0 and time=24. Ideally I would also like to calculate area under the curve where y>0.1.
I have tried the pracma package but it just comes out with NA.
trapz(x=P2ROKIlong$Time, y=P2ROKIlong$value)
Do I have to split my data into lots of different vectors and then do it manually or is there a way of getting it out of the long-format data?
The following code runs fine for me:
require(pracma)
df = data.frame(Time =c(0,1,2,3,4,5),value=c(0.07,0.02,0.12,0.17,0.10,0.17))
AUC = trapz(df$Time,df$value)
Is there anything strange (NA's?) in your the rest of your dataframe?
EDIT: New code based on comments
May not be the most efficient, but the size of your data seems limited. This returns a vector AUC_result with the AUC per variable. Does this solve your issue?
require(pracma)
df = data.frame(Time =c(0,1,2,3,4,5),value=c(0.07,0.02,0.12,0.17,NA,0.17),variable = c("P1","P1","P1","P2","P2","P2"))
df=df[!is.na(df$value),]
unique_groups = as.character(unique(df$variable))
AUC_result = c()
for(i in 1:length(unique_groups))
{
df_subset = df[df$variable %in% unique_groups[i],]
AUC = trapz(df_subset$Time,df_subset$value)
AUC_result[i] = AUC
names(AUC_result)[i] = unique_groups[i]
}
I'm trying to use "stat_sum_single" with a factor variable but I get the error:
Error: could not find function "stat_sum_single"
I tried converting the factor variable to a numeric but it doesn't seem to work - any ideas?
Full code:
ggplot(sn, aes(x = person,y = X, group=Plan, colour = Plan)) +
geom_line(size=0.5) +
scale_y_continuous(limits = c(0, 1.5)) +
scale_x_discrete(breaks = c(0,50,100), labels= c(0,50,100)) +
labs(x = "X",y = "%") +
stat_sum_single(mean, geom = 'line', aes(x = as.numeric(as.character(person))), size = 3, colour = 'red')
Data:
Plan person X m mad mmad
1 1 95 0.323000 0.400303 0.12
1 2 275 0.341818 0.400303 0.12
1 3 2 0.618000 0.400303 0.12
1 4 75 0.320000 0.400303 0.12
1 5 13 0.399000 0.400303 0.12
1 6 20 0.400000 0.400303 0.12
2 7 219 0.393000 0.353350 0.45
2 8 50 0.060000 0.353350 0.45
2 9 213 0.390000 0.353350 0.45
2 15 204 0.496100 0.353350 0.45
2 19 19 0.393000 0.353350 0.45
2 24 201 0.388000 0.353350 0.45
3 30 219 0.567 0.1254 0.89
3 14 50 0.679 0.1254 0.89
3 55 213 0.1234 0.1254 0.89
3 18 204 0.6135 0.1254 0.89
3 59 19 0.39356 0.1254 0.89
3 101 201 0.300 0.1254 0.89
Person is a factor variable.
Function stat_sum_single() isn't directly implemented in library ggplot2 but this function should be defined before using as shown in the help file of function stat_summary().
stat_sum_single <- function(fun, geom="point", ...) {
stat_summary(fun.y=fun, colour="red", geom=geom, size = 3, ...)
}
Here is the ggplot2 cran package:
http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
on page 185, there is an example of using stat_sum_single.
I believe you need to somehow define it first in stat_summary.