I am a newbie in R. I am using autofitVariogram to daily rainfall data of 50 stations.The sample data is provided below.Some of stations have missing values represented by "NaN" values.
My question is regarding the variogramfit. The variogram covers only a distance of 60,000m. Why are the points in bins beyond 60Km not plotted. I had seen from spatial correlation plot maximum distance from lon-lat information is >200Km.
The summary of latitide and longitude information is provided below.
summary(lonlat)
lon lat
Min. :74.78 Min. :15.77
1st Qu.:75.14 1st Qu.:16.04
Median :75.56 Median :16.33
Mean :75.54 Mean :16.37
3rd Qu.:75.94 3rd Qu.:16.66
Max. :76.31 Max. :17.23
$ Sample data given below:
dput(rain[140:145,])
structure(list(Col0 = c(0, 0, 1, 9, 6.5, 0), Col1 = c(1.5, 36,
21, 44, 4, 0), Col2 = c(0, 0, 24.5, 21.5, 7.5, 1), Col3 = c(0,
1, 45, 3, 0, 0), Col4 = c(2, 0, 5, 54.5, 13.5, 0), Col5 = c(0.5,
2, 0, 3.5, 13.5, 0), Col6 = c(0.5, 0, 0, 59, 15.5, 0), Col7 = c(0,
0, 2.5, 1, 0, 0), Col8 = c(0, 6, 24, 2, 5.5, 0), Col9 = c(0,
3, 6, 1, 0, 7), Col10 = c(0.5, 1, 64, 20, 1, 0.5), Col11 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Col12 = c(0, 11, 75, 19, 15.5, 0),
Col13 = c(0, 4, 57.5, 50.5, 8.5, 0), Col14 = c(1.5, 0.5,
127, 33.5, 34.5, 0), Col15 = c(0, 7, 0.5, 13, 1, 0), Col16 = c(0,
0.5, 81.5, 15, 49, 0), Col17 = c(0, 0, 4.5, 17, 5.5, 1),
Col18 = c(0, 3, 2.5, 0.5, 0, 0), Col19 = c(NaN, NaN, NaN,
NaN, NaN, NaN), Col20 = c(0, 0, 0, 0, 7, 0), Col21 = c(0,
1, 0, 5, 3.5, 0), Col22 = c(0, 0, 11.5, 28, 3.5, 0), Col23 = c(0,
0, 48.5, 0, 24.5, 0), Col24 = c(0, 0, 0, 10, 0.5, 14), Col25 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Col26 = c(0, 7.5, 16, 28.5, 20.5,
0), Col27 = c(1.5, 0.5, 38, 28.5, 50, 0), Col28 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Col29 = c(NaN, NaN, NaN, NaN, NaN,
NaN), Col30 = c(2.5, 0, 0, 80.5, 28, 13.5), Col31 = c(1,
0, 17, 85.5, 3.5, 0), Col32 = c(0, 0.5, 8, 101, 20, 4), Col33 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Col34 = c(4, 3, 17, 122, 2, 2),
Col35 = c(0, 15.5, 14.5, 20, 3.5, 0), Col36 = c(0, 6.5, 8.5,
21, 7, 0), Col37 = c(0, 0, 1.5, 14.5, 0, 1.5), Col38 = c(0,
28, 30, 4, 0, 73), Col39 = c(28.5, 0, 4.5, 9.5, 1, 0), Col40 = c(1.5,
11.5, 32.5, 55, 0, 1), Col41 = c(0, 14.5, 0, 19, 12.5, 47.5
), Col42 = c(0, 28, 29, 17, 0.5, 20.5), Col43 = c(NaN, NaN,
NaN, NaN, NaN, NaN), Col44 = c(0, 19, 3.5, 42, 0, 0), Col45 = c(0,
0, 85, 15.5, 1, 0), Col46 = c(0, 0.5, 8, 24, 0.5, 0), Col47 = c(0,
1.5, 7, 12, 8.5, 0), Col48 = c(0, 0, 0, 43.5, 0, 1.5), Col49 = c(0,
13.5, 1, 16, 1, 1)), .Names = c("Col0", "Col1", "Col2", "Col3",
"Col4", "Col5", "Col6", "Col7", "Col8", "Col9", "Col10", "Col11",
"Col12", "Col13", "Col14", "Col15", "Col16", "Col17", "Col18",
"Col19", "Col20", "Col21", "Col22", "Col23", "Col24", "Col25",
"Col26", "Col27", "Col28", "Col29", "Col30", "Col31", "Col32",
"Col33", "Col34", "Col35", "Col36", "Col37", "Col38", "Col39",
"Col40", "Col41", "Col42", "Col43", "Col44", "Col45", "Col46",
"Col47", "Col48", "Col49"), row.names = 143:148, class = "data.frame")
# Import the required libraries
library(rgdal)
library(maptools)
library(gstat)
library(sp)
library(automap)
library(XLConnect)
# Read the station data from xls file
stnrain = readWorksheetFromFile(path_fileName,"Sheet1", region = "D1:BA187", header = FALSE)
N = nrow(stnrain)
rain = stnrain[4:N,]
lat = as.numeric(t(stnrain[2,]))
lon = as.numeric(t(stnrain[3,]))
lonlat = cbind(lon,lat)
#Transform from GCS to UTM protection
sp = SpatialPoints(lonlat,proj4string = CRS("+proj=longlat"))
sp_utm = spTransform(sp, CRS("+proj=utm +zone=43N +datum=WGS84"))
krige_value = list() #prepare a list for storing the autokrige output
krige_stderr = list()
nRows = nrow(rain)
for (i in 1:nRows)
{
irain = rain[i,]
miss_indx = (irain == "NaN")
irain = irain[!miss_indx]
irain = as.numeric(irain)
isallZeros = (max(irain) == 0) # To take care of the cases of dry day(irain =0)
irain = as.data.frame(irain)
M = nrow(irain)
if ((M > 5) & (!isallZeros)) # To avoid cases of NaN across many stations
{
print(i)
foo_utm = sp_utm[!indx]# Removing the locations with NaN values
data = data.frame(foo_utm,irain)
names(data) = c("Easting","Northing","rain")
coordinates(data) = c("Easting","Northing")
variogram = autofitVariogram(rain~1,data,model = "Sph",fix.values=c(0,NA,NA))
p = plot(variogram, main="Semi-variogram (Spherical Model)",xlab="Distance(m)",ylab="Semi-Variance(mm2)", sub=paste("Range: ",variogram$var_model$range[2], "Day",i))
print(p)
png(p)
dev.off()
}
else
{
krige_value[[i]] = list(rep(0, L))
krige_stderr[[i]] = list(rep(0, L))
}
}
}
Q2) How can i save the variogram fit png file in a loop. I understand that dev.off() should be used after each saving the figure, which i had done, but I am not able to save the the figure.
Any help would be appreciated.
Thanks,
Any suggestions would be appreciated?
In regard to your first question, the sample variogram is built using points up to a maximum distance of around 1/3 of the diagonal of the area of interest. The assumption here is that points farther away form that are not related, and because they are not in the sample variogram or variogram model they are plotted. This is just a choice, and might not be the correct choice, but when I wrote autofitVariogram it seemed to work well for my data. The variogram model you show confirms this, the range is smaller than 60 km.
For saving your png's I have two suggestions. First, call the plot command inside the png() dev.off pair, so not:
print(p)
png()
dev.off()
but:
png()
print(p)
dev.off()
In addition, I would create meaningful names for the png files.
To create sets of variogram plots, I would use ggplot2. This uses geom_line and facet_wrap. ggplot2 cannot deal directly with gstat/automap variogram models, luckily you can create distance semivariance data using the function variogramLine from gstat. See for example figure 3.1, and the plots in appendix A of this report I wrote. This answer I wrote earlier does also include an example of using ggplot2 for spatial data, this time to plot a grid map.
Related
This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
I have the dataframe DATA1 as shown for a few rows:
structure(list(S = c(12, 12, 15, 15, 15, 9, 9), UG = c(84, 84,
84, 84, 84, 84, 84), CSi = c(0.487181441487271, 0.623551085193489,
0.505057492620447, 0.704318096382286, 0.575388552145397, 0.400731851672016,
0.490770631112789), N_l = c(1, 3, 1, 3, 5, 1, 3), N_b = c(5,
5, 5, 5, 5, 5, 5), m = c(1.2, 0.85, 1.2, 0.85, 0.65, 1.2, 0.85
), A = c(-12, -12, -15, -15, -15, -9, -9), x.sqr = c(1440, 1440,
2250, 2250, 2250, 810, 810), e_1 = c(21.8, 21.8, 29, 29, 29,
14.6, 14.6), e_2 = c(0, 9.8, 0, 17, 17, 0, 2.6), e_3 = c(0, -2.2,
0, 5, 5, 0, -9.4), e_4 = c(0, 0, 0, 0, -7, 0, 0), e_5 = c(0,
0, 0, 0, -19, 0, 0), K_g = c(6340598.65753794, 6340598.65753794,
6429472.98493414, 6429472.98493414, 6429472.98493414, 6296482.86883766,
6296482.86883766), stiff.girder = c(0.517988322166146, 0.517988322166146,
0.643978136780243, 0.643978136780243, 0.643978136780243, 0.416960174810184,
0.416960174810184), stiff.deck = c(276.422028597005, 276.422028597005,
147.89589537037, 147.89589537037, 147.89589537037, 642.725952664716,
642.725952664716)), row.names = c(10L, 30L, 50L, 70L, 90L, 110L,
130L), class = "data.frame")
I try to run the function proposed with nonlinear regression such as:
Proposed <- function(N_b,N_l,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,K_g,a,b,c,d) {
e <- data.frame(e_1,e_2,e_3,e_4,e_5,N_l)
CSi <- m * ((N_l/N_b) * ((a*K_g)^b) +
(max(A * apply(e,1,function(v) combn(v[1:5],v["N_l"],sum))) / x.sqr) * ((c*K_g)^d))
return(CSi)
}
library(minpack.lm)
G_1 <- nlsLM(CSi ~ Proposed(N_b,N_l,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,K_g,a,b,c,d),
data = DATA1,
start = c(a = 0.01, b = 0.01, c = 0.01, d = 0.01))
I get the error:
Error in A * apply(e, 1, function(v) combn(v[1:5], v["N_l"], sum)) :
non-numeric argument to binary operator
probit5 <- glm(shot_success ~ age + risk + loc + self_est + compet + regfoc + self_eff +
+ first_mover + gender + ten_throws + incentive + score_diff + first_mover*gender,
family = binomial(link = "probit"),
data=data)
intEff(probit5, vars=c("first_mover", "gender"), data=data)
Error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In cbind(deriv1, deriv2, deriv3, nn, deriv0) :
number of rows of result is not a multiple of vector length (arg 4)
Reproducable example:
example <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), id_opp = c(2,
2, 2, 2, 2, 2, 2, 2, 2, 2), shot_success = c(0, 1, 0, 0, 0, 0,
0, 0, 1, 0), tiebreak = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), score_diff = c(0,
0, 1, 0, 0, 0, 0, 0, 0, -2), first_mover = c(1, 0, 1, 0, 1, 0,
1, 0, 1, 0), incentive = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ten_throws = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1), pre_training = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), gender = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), age = c(18,
18, 18, 18, 18, 18, 18, 18, 18, 18), self_est = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), risk = c(3.6667, 3.6667, 3.6667, 3.6667, 3.6667,
3.6667, 3.6667, 3.6667, 3.6667, 3.6667), loc = c(4.8, 4.8, 4.8,
4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8), self_eff = c(2.8, 2.8, 2.8,
2.8, 2.8, 2.8, 2.8, 2.8, 2.8, 2.8), compet = c(4.54, 4.54, 4.54,
4.54, 4.54, 4.54, 4.54, 4.54, 4.54, 4.54), regfoc = c(3.6364,
3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364,
3.6364)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
probit_example <- glm(shot_success ~ gender + age + risk + loc + self_est + compet + regfoc + self_eff + first_mover + ten_throws
+ incentive + score_diff + first_mover*gender,
family = binomial(link = "probit"),
data = example)
intEff(probit_example, vars=c("first_mover", "gender"), data=example)
Using only this few rows returns a different error. When using all 1040 rows of my data, it returns the error that I mentioned, but including all rows here would be too long...
When I try changing the axis ina forest plot I generated using ggforest (from the package survminer) the plot changes completely.
For Example:
#Example data set
mydata <- data.frame(A = c(8, 6, 42, 97, 55, 1, 5, 7, 55, 4),
B = c(93, 9, 65, 2, 51, 89, 1, 1, 5, 62),
C = c(68.41, 68.86, 47.26, 31.06, 42.97, 69.16, 47.39, 56.57, 19.63, 45.58),
D = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0),
time = c(1.6, 34.6, 1.5, 35.8, 7.7, 38.6, 40.2, 4.7, 37.6, 8.6),
event= c(1, 0, 0, 0, 1, 0, 0, 1, 0, 1))
OS <- mydata$time
Event<-mydata$event
A<-mydata$A
B<-mydata$B
C<-mydata$C
D<-mydata$D
# load packages
library("survival")
library("survminer")
# dependent and independent variables
y <- Surv(OS, Event)
x <- cbind(A, B, C, D)
#cox regression
cox<-coxph(y~x, data=mydata, method= "breslow")
summary(cox)
#Forest PLot using surv miner
ggforest(cox, alpha = 0.05, plot.title = "Forest plot for Cox proportional hazard model")
Which produces this plot
And if I try to change the axis, like so..
ggforest(cox, alpha = 0.05, plot.title = "Forest plot for Cox proportional hazard model", xlim=c(-10,10000))
..it looks like this
Does anyone know a solution to this?
Here is a solution based on coord_flip with the ylim option:
ggforest(cox, alpha = 0.05, plot.title = "Forest plot for Cox proportional hazard model") +
coord_flip(ylim=c(10^(-4),10^6))
I am getting the following error when calculating VIF on a small dataset in Rstudio. Could anyone help? I can provide more information on the dataset if needed.
"Error in as.vector(y) - mean(y) non-numeric argument to binary
operator".
Dataset: 80 obs. and 15 variables (all variables are numeric)
Steps Followed:
# 1. Determine correlation
library(corrplot)
cor.data <- cor(train)
corrplot(cor.data, method = 'color')
cor.data
# 2. Build Model
model2 <- lm(Volume~., train)
summary(model2)
# 3. Calculate VIF
library(VIF)
vif(model2)
Here is a sample dataset with 20 obs.
train <- structure(list(Price = c(949, 2249.99, 399, 409.99, 1079.99,
114.22, 379.99, 65.29, 119.99, 16.99, 6.55, 15, 52.5, 21.08,
18.98, 3.6, 3.6, 174.99, 9.99, 670), X.5.Star.Reviews. = c(3,
2, 3, 49, 58, 83, 11, 33, 16, 10, 21, 75, 10, 313, 349, 8, 11,
170, 15, 20), X.4.Star.Reviews. = c(3, 1, 0, 19, 31, 30, 3, 19,
9, 1, 2, 25, 8, 62, 118, 6, 5, 100, 12, 2), X.3.Star.Reviews. = c(2,
0, 0, 8, 11, 10, 0, 12, 2, 1, 2, 6, 5, 13, 27, 3, 2, 23, 4, 4
), X.2.Star.Reviews. = c(0, 0, 0, 3, 7, 9, 0, 5, 0, 0, 4, 3,
0, 8, 7, 2, 2, 20, 0, 2), X.1.Star.Reviews. = c(0, 0, 0, 9, 36,
40, 1, 9, 2, 0, 15, 3, 1, 16, 5, 1, 1, 20, 4, 4), X.Positive.Service.Review. = c(2,
1, 1, 7, 7, 12, 3, 5, 2, 2, 2, 9, 2, 44, 57, 0, 0, 310, 3, 4),
X.Negative.Service.Review. = c(0, 0, 0, 8, 20, 5, 0, 3, 1,
0, 1, 2, 0, 3, 3, 0, 0, 6, 1, 3), X.Would.consumer.recommend.product. = c(0.9,
0.9, 0.9, 0.8, 0.7, 0.3, 0.9, 0.7, 0.8, 0.9, 0.5, 0.2, 0.8,
0.9, 0.9, 0.8, 0.8, 0.8, 0.8, 0.7), X.Shipping.Weight..lbs.. = c(25.8,
50, 17.4, 5.7, 7, 1.6, 7.3, 12, 1.8, 0.75, 1, 2.2, 1.1, 0.35,
0.6, 0.01, 0.01, 1.4, 0.4, 0.25), X.Product.Depth. = c(23.94,
35, 10.5, 15, 12.9, 5.8, 6.7, 7.9, 10.6, 10.7, 7.3, 21.3,
15.6, 5.7, 1.7, 11.5, 11.5, 13.8, 11.1, 5.8), X.Product.Width. = c(6.62,
31.75, 8.3, 9.9, 0.3, 4, 10.3, 6.7, 9.4, 13.1, 7, 1.8, 3,
3.5, 13.5, 8.5, 8.5, 8.2, 7.6, 1.4), X.Product.Height. = c(16.89,
19, 10.2, 1.3, 8.9, 1, 11.5, 2.2, 4.7, 0.6, 1.6, 7.8, 15,
8.3, 10.2, 0.4, 0.4, 0.4, 0.5, 7.8), X.Profit.margin. = c(0.15,
0.25, 0.08, 0.08, 0.09, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05,
0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15), Volume = c(12,
8, 12, 196, 232, 332, 44, 132, 64, 40, 84, 300, 40, 1252,
1396, 32, 44, 680, 60, 80)), .Names = c("Price", "X.5.Star.Reviews.",
"X.4.Star.Reviews.", "X.3.Star.Reviews.", "X.2.Star.Reviews.",
"X.1.Star.Reviews.", "X.Positive.Service.Review.", "X.Negative.Service.Review.",
"X.Would.consumer.recommend.product.", "X.Shipping.Weight..lbs..",
"X.Product.Depth.", "X.Product.Width.", "X.Product.Height.",
"X.Profit.margin.", "Volume"), row.names = c(NA, 20L), class = "data.frame")
The vif function from the VIF package does not estimates the Variance Inflation Factor(VIF). "It selects variables for a linear model" and "returns a subset of variables for building a linear model."; see here for the description.
What you want is the vif function from the car package.
install.packages("car")
library(car)
vif(model2) # This should do it
Edit: I won't comment specifically on the statistics side, but it seems like you have a perfect fit, something quite unusual, suggesting some problem in your data.
You're giving vif the wrong input. It wants the response y and predictor variables x:
vif(train$Volume,subset(train,select=-Volume),subsize=19)
I had to set the subsize argument to a value <= the number of observations (the default is 200).
There are 2 R libraries "car" and "VIF" which have the same function vif() defined differently. Your result/error depends on which package you have loaded in your current session.
If you use "VIF" library in the session and pass the linear model as parameter to the vif() function then you will get the error given in the initial query, as shown below:
> model1 = lm(Satisfaction~., data1)
> library(VIF)
Attaching package: ‘VIF’
The following object is masked from ‘package:car’:
vif
> vif(model1)
Error in as.vector(y) - mean(y) : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA
If you load "car" library in R session and not "VIF", then you will get the vif numbers as expected for a linear model as shown below:
> model1 = lm(Satisfaction~., data1)
> library(car)
Loading required package: carData
Attaching package: ‘car’
The following object is masked from ‘package:psych’:
logit
> vif(model1)
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage ComPricing
1.635797 2.756694 2.976796 4.730448 1.508933 3.488185 3.439420 1.635000
WartyClaim OrdBilling DelSpeed
3.198337 2.902999 6.516014
All the columns in data1 are numeric. Hope that helps