I get geojson from Austin Government with polygons of districts and names.
gjson_austin <- geojson_read('Neighborhood Planning Areas.geojson',
method = "local", what = "sp")
gjson_austin#proj4string<-CRS("+proj=longlat +datum=WGS84")
gjson_austin<-as(gjson_austin, 'SpatialPolygons')
place_name_austin = fread("ci6h-vmgm.csv")
After that, i make DF:
austin_frame <- gjson_austin%>%fortify()
austin_frame$id <- as.numeric(austin_frame$id)
objects_austin <- left_join(austin_frame, place_name_austin%>%select(FID, PLANNING_A), by = c("id"="FID"))
Now i have a dataframe:
'data.frame': 10268 obs. of 8 variables:
$ long : num -97.7 -97.7 -97.7 -97.7 -97.7 ...
$ lat : num 30.3 30.3 30.3 30.3 30.3 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ piece : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ id : num 0 0 0 0 0 0 0 0 0 0 ...
$ group : Factor w/ 126 levels "0.1","1.1","2.1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PLANNING_A: chr "CHESTNUT" "CHESTNUT" "CHESTNUT" "CHESTNUT" ...
Polygons grouped by id. Every polygon has a name in PLANNING_A column.
I want to draw this polygons with leaflet.
leaflet() %>% addTiles()%>%
addPolygons(data = objects_austin, lng = ~long, lat = ~lat)
But i get not what i need.
I can draw polygons if I use
leaflet() %>% addTiles()%>%
addPolygons(data=gjson_austin)
but I need values from json for my calculations.
Help me please with this problem.
After that I calculate any special values grouped by polygons. I want draw correct polygons by this values like a heatmap.
Related
This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider
I am trying run an image classification using svm but I am facing an error that, while being already reported on this forum, the solutions do not fit my case.
The data I want to classify is a raster stack of 2 layers:
> S1_images
class : RasterStack
dimensions : 1000, 1414, 1414000, 2 (nrow, ncol, ncell, nlayers)
resolution : 10, 10 (x, y)
extent : 670860, 685000, 6163420, 6173420 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
names : X20180415_VH, X20180415_VV
min values : 1.621079e-05, 1.929869e-04
max values : 24.6396, 159.7452
My training data is obtained using polygons as reference and extracting the pixel values at those locations:
training_S<-raster::extract(S_images_t, training, df=TRUE)
training_S$Class<-factor(training_S$Class)
> head(training_S)
ID X20180415_VH X20180415_VV Class
1 1 0.006463605 0.05813200 1
2 1 0.006663103 0.06266786 1
3 1 0.007048910 0.06308612 1
4 1 0.006351015 0.04774158 1
5 1 0.006822301 0.05248845 1
6 1 0.007194918 0.05911565 1
and
> str(training_S)
'data.frame': 33239 obs. of 4 variables:
$ ID : num 1 1 1 1 1 1 1 1 1 1 ...
$ X20180415_VH: num 0.00646 0.00666 0.00705 0.00635 0.00682 ...
$ X20180415_VV: num 0.0581 0.0627 0.0631 0.0477 0.0525 ...
$ Class : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
After tune.svm for best parameter choice, I create the model (so far, so good)
SVM<-svm(x=training_S[ ,c(1:(length(training_S)-1))], y=training_S$Class, gamma = 0.1, cost = 10)
Next, I try to use predict to classify my input data:
LC<-predict(S1_images, model=SVM, na.rm=TRUE)
and here comes my error:
> LC<-predict(S1_images, model=SVM, na.rm=TRUE)
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
Following the example of R Bloggers, I converted my raster stack into a dataframe, and rename the columns properly:
S1_images_df <- data.frame(getValues(S1_images))
names(S1_images_df) <- c("X20180415_VH", "X20180415_VV")
When trying to run the classification again:
LC<-predict(SVM, S1_images_df)
> LC<-predict(SVM, S1_images_df)
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
Some extra information on my data:
> str(training_S)
'data.frame': 33239 obs. of 4 variables:
$ ID : num 1 1 1 1 1 1 1 1 1 1 ...
$ X20180415_VH: num 0.00646 0.00666 0.00705 0.00635 0.00682 ...
$ X20180415_VV: num 0.0581 0.0627 0.0631 0.0477 0.0525 ...
$ Class : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
> str(S1_images_df)
'data.frame': 1414000 obs. of 2 variables:
$ X20180415_VH: num 0.005 0.00531 0.00514 0.0048 0.00461 ...
$ X20180415_VV: num 0.0954 0.0947 0.0933 0.0952 0.0951 ...
> dim(training_S)
[1] 33239 4
> dim(S1_images_df)
[1] 1414000 2
I have been checking this two older posts but not sure how to implement there solution in my case:
Here and Here
It looks like you're including ID as a covariate when training the model. If ID is meaningful and you want to include it in the model, you need to add a corresponding ID field to S1_images_df. More likely, you should exclude it when passing your training data to svm:
SVM<-svm(x=training_S[, -c(1, ncol(training_S))], y=training_S$Class, gamma = 0.1, cost = 10)
Hej hej,
I would like to calculate growth rates, storing them in a new column of my data frame e.g. named growth.per.day. I am - as always - looking for a way that doesn't include hundreds and hundreds of lines of manually edited code.
I have six levels of algae and 25 levels of nutrients.
This means i have 150 "subgroups" for which i want to calculate the rates. Those subsets differ in length based on the individual algae.
So, basically:
Algae A ->
Nutrient (1) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... ;
Nutrient (2) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... etc.
I already split the data frame by algae
X <- split(data, data$ALGAE)
names(X) <- c("ANKI", "CHLAMY", "MIX_A", "MIX_B", "SCENE", "STAURA")
list2env(X, envir = .GlobalEnv)
and i have also split those again, creating the aforementioned lovely 150 subsets. Then i applied
ratio1$growth.per.day <- c(NA,ratio1[2:nrow(ratio1), 16] - ratio1[1:(nrow(ratio1)-1), 16])
which is perfect and does what i want, BUT i would really very much appreciate a shorter, more elegant way without butchering my dataframe.
'data.frame': 3550 obs. of 16 variables:
$ SAMPLE.ID : Factor w/ 150 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ COMMUNITY : chr "com.1" "com.1" "com.1" "com.1" ...
$ NUTRIENT : Factor w/ 25 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RATIO : Factor w/ 23 levels "3.2","4","5.4",..: 11 9 6 4 1 14 10 8 5 2 ...
$ PHOS : Factor w/ 5 levels "0.09","0.195",..: 5 5 5 5 5 4 4 4 4 4 ...
$ NIT : Factor w/ 5 levels "1.5482","3.0964",..: 5 4 3 2 1 5 4 3 2 1 ...
$ DATUM : Factor w/ 35 levels "30.08.16","31.08.16",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 0 0 0 0 0 0 0 0 0 0 ...
$ TYPE : chr "mono" "mono" "mono" "mono" ...
$ ALGAE : Factor w/ 6 levels "ANK","CHLA","MIX A",..: 5 5 5 5 5 5 5 5 5 5 ...
$ MEAN : num 864 868 882 873 872 ...
$ GROW : num 0.00116 0.00115 0.00113 0.00115 0.00115 ...
$ FLUORO : num NA NA NA NA NA NA NA NA NA NA ...
$ MEAN.MQ : num 0.964 0.969 0.985 0.975 0.973 ...
$ GROW.MQ : num 1.04 1.03 1.02 1.03 1.03 ...
$ C.mikro.gr.L: num -764 -913 -1394 -1085 -1039 ...
I hope this sufficiently describes the problem,
Thanks so much!
Hope it is what you asked for:
df = data.frame(algae = sort(rep(LETTERS[1:6], 20)),
nutrient = rep(letters[22:26], 24),
day = rep(c(rep(1, 5),
rep(2, 5),
rep(3, 5),
rep(4, 5)), 6),
growth = runif(120, 30, 60))
library(dplyr)
df = df %>% group_by(algae, nutrient) %>% mutate(rate = c(NA, diff(growth, lag = 1)))
And there the table for alga A and nutrient v:
algae nutrient day growth rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 A v 1 48.68547 NA
2 A v 2 55.63570 6.950232
3 A v 3 53.28569 -2.350013
4 A v 4 44.83022 -8.455465
I would like to plot a 3D surface graph like on the figure:
My attempt with the plotly package is below:
library(plotly)
packageVersion("plotly")
# [1] ‘4.5.2’
# random data
a <- 0; s <- c(1:16)
x <- seq(a-3*max(s), a+3*max(s), len=10)
f <- sapply(s, function(ss) dnorm(x, a, ss))
df0=data.frame(x=rep(x,length(s)),
y=rep(s,each=length(x)),
z=f,
col=rep(seq(1,31,2),each=length(x)))
df0 %>% group_by(y) %>%
plot_ly(x = ~x, y = ~y, z = ~f, type = 'scatter3d', mode = 'lines',
line = list(width = 6,color = ~col,colorscale = 'Viridis'))
I have the error message:
Error in function_list[[i]](value) : could not find function "group_by"
The group argument is deprecated and I have not had success with group_by.
Question. How to rewrite the group_by argument?
There is a problem in the construction of the dataset 'df0'. If we look at the
str(df0)
#'data.frame': 160 obs. of 19 variables:
# $ x : num -48 -37.33 -26.67 -16 -5.33 ...
# $ y : int 1 1 1 1 1 1 1 1 1 1 ...
# $ z.1 : num 0.00 8.83e-304 1.53e-155 1.03e-56 2.66e-07 ...
# $ z.2 : num 1.67e-126 4.33e-77 4.97e-40 2.53e-15 5.70e-03 ...
# $ z.3 : num 3.42e-57 3.13e-35 9.26e-19 8.85e-08 2.74e-02 ...
# $ z.4 : num 5.37e-33 1.21e-20 2.23e-11 3.35e-05 4.10e-02 ...
# $ z.5 : num 7.76e-22 6.25e-14 5.31e-08 4.77e-04 4.52e-02 ...
# $ z.6 : num 8.42e-16 2.60e-10 3.42e-06 1.90e-03 4.48e-02 ...
# $ z.7 : num 3.51e-12 3.79e-08 4.02e-05 4.18e-03 4.26e-02 ...
# $ z.8 : num 7.59e-10 9.31e-07 1.93e-04 6.75e-03 3.99e-02 ...
# $ z.9 : num 2.95e-08 8.13e-06 5.50e-04 9.13e-03 3.72e-02 ...
# $ z.10: num 3.96e-07 3.75e-05 1.14e-03 1.11e-02 3.46e-02 ...
# $ z.11: num 2.66e-06 1.14e-04 1.92e-03 1.26e-02 3.22e-02 ...
# $ z.12: num 1.12e-05 2.63e-04 2.81e-03 1.37e-02 3.01e-02 ...
# $ z.13: num 3.36e-05 4.97e-04 3.74e-03 1.44e-02 2.82e-02 ...
# $ z.14: num 7.98e-05 8.14e-04 4.64e-03 1.48e-02 2.65e-02 ...
# $ z.15: num 0.000159 0.001201 0.005477 0.015058 0.024967 ...
# $ z.16: num 0.000277 0.001639 0.006217 0.015123 0.023586 ...
# $ col : num 1 1 1 1 1 1 1 1 1 1 ...
it will be evident. the f returns a matrix and it should be converted to vector to create the 'z'
df0 <- data.frame(x=rep(x,length(s)),
y=rep(s,each=length(x)),
z=c(f), ######
col=rep(seq(1,31,2),each=length(x)))
str(df0)
#'data.frame': 160 obs. of 4 variables:
#$ x : num -48 -37.33 -26.67 -16 -5.33 ...
#$ y : int 1 1 1 1 1 1 1 1 1 1 ...
#$ z : num 0.00 8.83e-304 1.53e-155 1.03e-56 2.66e-07 ...
#$ col: num 1 1 1 1 1 1 1 1 1 1 ...
Another error mentioned is the group_by. If we have loaded
library(dplyr)
that error message would be gone as well.
When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}