How to generate a map for property cluster - r
Could you help me make a graph in R similar to the one I inserted in the image below, which shows the properties on a map, differentiating by cluster. See in my database that I have 4 properties, properties 1 and 3 are of cluster 1 and properties 2 and 4 are of cluster 2. In addition, the database has the coordinates of the properties, so I believe that with this information I can generate a graph similar to what I inserted. Surely, there must be some package in R that does something similar. Any help is welcome!
This link can help: https://rstudio-pubs-static.s3.amazonaws.com/176768_ec7fb4801e3a4772886d61e65885fbdd.html
#database
df<-structure(list(Properties = c(1,2,3,4),
Latitude = c(-24.930473, -24.95575,-24.924161,-24.95579),
Longitude = c(-49.994889, -49.990162,-50.004343, -50.007371),
cluster = c(1,2,1,2)), class = "data.frame", row.names = c(NA, -4L))
Properties Latitude Longitude cluster
1 1 -24.93047 -49.99489 1
2 2 -24.95575 -49.99016 2
3 3 -24.92416 -50.00434 1
4 4 -24.95579 -50.00737 2
Example of figure:
Your code
#database
df<-structure(list(Propertie = c(1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
30, 31, 32, 33, 34, 35, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163,
164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176,
177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189,
190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202,
203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215,
216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228,
229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241,
242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254,
255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267,
268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280,
281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293,
294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306,
307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319,
320, 321, 322, 323, 324, 325), Latitude = c(-24.927417, -24.927417,
-24.927417, -24.927417, -24.930195, -24.930473, -24.946306, -24.949361,
-24.949361, -24.950195, -24.950195, -24.951584, -24.95575, -24.954084,
-24.96075, -24.957139, -24.95825, -24.96825, -24.961334, -24.968806,
-24.976861, -24.982139, -24.986584, -24.985487, -24.994362, -24.994362,
-24.999084, -24.771583, -24.77186, -24.772138, -24.772138, -24.78686,
-24.78436, -24.872139, -24.822222, -24.83549, -24.874916, -24.874916,
-24.874639, -24.865472, -24.873838, -24.87325, -24.858611, -24.874361,
-24.874361, -24.86, -24.860472, -24.874916, -24.814638, -24.814666,
-24.818527, -24.818527, -24.822694, -24.822694, -24.845472, -24.844638,
-24.878528, -24.879639, -24.879639, -24.906028, -24.897972, -24.900278,
-24.900278, -24.90075, -24.902972, -24.899361, -24.898611, -24.899083,
-24.913889, -24.908333, -24.914361, -24.914361, -24.924361, -24.915472,
-24.91075, -24.913805, -24.913528, -24.912139, -24.919917, -24.914083,
-24.914361, -24.914361, -24.925194, -24.92575, -24.928528, -24.929361,
-24.934361, -24.935278, -24.922694, -24.927139, -24.927972, -24.931861,
-24.936861, -24.878537, -24.887972, -24.882972, -24.901583, -24.901667,
-24.902139, -24.902139, -24.90325, -24.902972, -24.90299, -24.90575,
-24.905791, -24.899639, -24.899083, -24.875472, -24.878805, -24.883805,
-24.884916, -24.8905, -24.884083, -24.884087, -24.905194, -24.904125,
-24.894722, -24.895222, -24.895194, -24.911028, -24.907972, -24.908805,
-24.919916, -24.919361, -24.919639, -24.919639, -24.920194, -24.920472,
-24.917972, -24.908805, -24.911305, -24.91325, -24.917416, -24.928528,
-24.929083, -24.92325, -24.923805, -24.93188, -24.932139, -24.936028,
-24.935472, -24.937139, -24.923805, -24.922139, -24.922139, -24.926861,
-24.908805, -24.908333, -24.908805, -24.913805, -24.913805, -24.929638,
-24.939917, -24.943806, -24.942695, -24.94325, -24.944639, -24.946028,
-24.94825, -24.954084, -24.956111, -24.958611, -24.958806, -24.959084,
-24.958528, -24.958528, -24.956584, -24.955833, -24.95825, -24.960833,
-24.967417, -24.962695, -24.958611, -24.959083, -24.96075, -24.96075,
-24.964361, -24.961306, -24.961028, -24.962417, -24.965833, -24.964639,
-24.963806, -24.964917, -24.965472, -24.966861, -24.968528, -24.942972,
-24.948611, -24.950556, -24.951028, -24.951028, -24.93825, -24.941889,
-24.943528, -24.944639, -24.945194, -24.945472, -24.949083, -24.946861,
-24.94825, -24.949361, -24.951306, -24.948805, -24.948, -24.95075,
-24.952694, -24.959722, -24.961583, -24.96325, -24.96325, -24.96325,
-24.964639, -24.96575, -24.959361, -24.954639, -24.960472, -24.960472,
-24.966583, -24.970195, -24.972417, -24.976306, -24.974084, -24.974167,
-24.974639, -24.979362, -24.979639, -24.980278, -24.982973, -24.982973,
-24.977417, -24.979639, -24.981028, -24.981028, -24.98325, -24.969361,
-24.988056, -24.987139, -24.987139, -24.986584, -24.984639, -24.984639,
-24.984917, -24.984917, -24.994917, -24.987139, -24.989917, -24.992139,
-24.991861, -24.991861, -24.989639, -24.989917, -24.989917, -24.991861,
-24.989639, -24.992417, -24.975195, -24.97325, -24.979361, -24.972694,
-24.972972, -24.942417, -24.941861, -24.93825, -24.938273, -24.949639,
-24.948333, -24.948805, -24.949639, -24.949639, -24.951615, -24.951583,
-24.951615, -24.953611, -24.954639, -24.954639, -24.954639, -24.956861,
-24.956861, -24.966028, -24.956861, -24.955556, -24.957176, -24.96075,
-24.960194, -24.960231, -24.980194, -24.969106, -24.986306, -24.986306,
-24.993806, -24.877972, -24.878889, -24.87686, -24.886305, -24.875749,
-24.876305, -24.876319, -24.878805, -24.891027, -24.898527, -24.898527,
-24.904083, -24.904083, -24.905, -24.901328, -24.902138, -24.898268,
-24.900782, -24.901305, -24.88493, -24.887138, -24.929638, -25.001862,
-25.004084, -25.011028, -25.000194, -25.000472), Longitude = c(-49.98793,
-49.98793, -49.98793, -49.988778, -49.98962, -49.994889, -49.999912,
-49.991273, -49.991273, -49.996551, -49.996551, -49.995704, -49.990162,
-49.992945, -49.990718, -49.999056, -49.998222, -49.981259, -49.997389,
-49.979357, -49.999908, -49.995713, -49.980449, -49.995736, -49.980444,
-49.980444, -49.986852, -50.200149, -50.200172, -50.199602, -50.199603,
-50.199339, -50.209899, -50.038787, -50.243338, -50.235446, -50.139343,
-50.139348, -50.154871, -50.164607, -50.179621, -50.179895, -50.226412,
-50.196297, -50.196297, -50.233639, -50.234066, -50.242649, -50.251816,
-50.252098, -50.258233, -50.258233, -50.288502, -50.288525, -50.251001,
-50.261575, -50.039037, -50.044333, -50.044333, -50.015148, -50.115163,
-50.094472, -50.094472, -50.094899, -50.108204, -50.111829, -50.113653,
-50.114079, -50.010278, -50.017523, -50.010704, -50.010704, -50.004343,
-50.087667, -50.106547, -50.103487, -50.116283, -50.117968, -50.101301,
-50.119913, -50.120191, -50.120191, -50.079593, -50.080167, -50.082112,
-50.093519, -50.070172, -50.074194, -50.095459, -50.117959, -50.121024,
-50.094079, -50.102677, -50.129635, -50.140468, -50.143492, -50.166288,
-50.166426, -50.166816, -50.166844, -50.166024, -50.169635, -50.169635,
-50.165154, -50.165154, -50.175427, -50.182686, -50.188496, -50.203515,
-50.208765, -50.208487, -50.220728, -50.24933, -50.24933, -50.190159,
-50.204603, -50.241421, -50.241576, -50.241849, -50.135746, -50.144894,
-50.142117, -50.14408, -50.146839, -50.148223, -50.148223, -50.143802,
-50.144066, -50.151269, -50.163802, -50.159357, -50.160168, -50.159066,
-50.138232, -50.137107, -50.151288, -50.151001, -50.137376, -50.139061,
-50.132691, -50.132968, -50.152399, -50.170709, -50.176566, -50.176566,
-50.173237, -50.195182, -50.196949, -50.197376, -50.209608, -50.209608,
-50.239872, -50.007371, -50.006579, -50.007931, -50.008523, -50.01044,
-50.013787, -50.014607, -50.014037, -50.013056, -50.004181, -50.006569,
-50.004607, -50.008482, -50.008482, -50.026278, -50.030861, -50.018523,
-50.019444, -50.014903, -50.020181, -50.045875, -50.046301, -50.057121,
-50.057121, -50.036278, -50.040176, -50.043227, -50.044894, -50.036125,
-50.050158, -50.055186, -50.04876, -50.053213, -50.062385, -50.061561,
-50.085727, -50.093361, -50.083352, -50.083227, -50.083228, -50.10488,
-50.10351, -50.108783, -50.121816, -50.121279, -50.098487, -50.093788,
-50.104315, -50.10238, -50.107121, -50.108482, -50.111024, -50.124043,
-50.115723, -50.124343, -50.083375, -50.074315, -50.073515, -50.073514,
-50.073769, -50.070459, -50.072959, -50.106561, -50.116857, -50.113797,
-50.113797, -50.103802, -50.007107, -50.001815, -50.005185, -50.022371,
-50.021685, -50.022111, -50.004597, -50.006269, -50.007778, -50.001843,
-50.001843, -50.01906, -50.020185, -50.020185, -50.020426, -50.021843,
-50.06044, -50.00362, -50.00519, -50.00519, -50.007102, -50.024079,
-50.024079, -50.023778, -50.023778, -50.010732, -50.037686, -50.032936,
-50.03657, -50.038204, -50.038223, -50.041283, -50.042375, -50.044885,
-50.043227, -50.05851, -50.03988, -50.062653, -50.087385, -50.077112,
-50.110996, -50.119061, -50.126279, -50.132691, -50.149052, -50.149052,
-50.137371, -50.141431, -50.141858, -50.170992, -50.170992, -50.176288,
-50.176844, -50.176844, -50.14225, -50.142404, -50.142404, -50.142408,
-50.155432, -50.155432, -50.14852, -50.159344, -50.160579, -50.157409,
-50.158209, -50.170436, -50.170436, -50.132121, -50.165154, -50.144052,
-50.144052, -50.13408, -50.263247, -50.264755, -50.26821, -50.257386,
-50.28265, -50.2924, -50.2924, -50.303516, -50.264891, -50.251543,
-50.251543, -50.261302, -50.261539, -50.264755, -50.270455, -50.270747,
-50.294067, -50.290159, -50.290432, -50.315715, -50.320456, -50.251849,
-49.989338, -49.986551, -49.976296, -50.127404, -50.127654),
cluster = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1,
4, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1,
1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 4, 4, 5, 5, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 5,
5, 5, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 5, 5)), row.names = c(NA,
-318L), class = c("tbl_df", "tbl", "data.frame"))
w1<-convexhull.xy(df$Longitude[df$cluster==1], df$Latitude[df$cluster==1])
w2<-convexhull.xy(df$Longitude[df$cluster==2], df$Latitude[df$cluster==2])
w3<-convexhull.xy(df$Longitude[df$cluster==3], df$Latitude[df$cluster==3])
w4<-convexhull.xy(df$Longitude[df$cluster==4], df$Latitude[df$cluster==4])
w5<-convexhull.xy(df$Longitude[df$cluster==5], df$Latitude[df$cluster==5])
p1<-st_as_sf(w1, crs=4269)
p2<-st_as_sf(w2, crs=4269)
p3<-st_as_sf(w3, crs=4269)
p4<-st_as_sf(w4, crs=4269)
p5<-st_as_sf(w5, crs=4269)
poly<-rbind(p1,p2,p3,p4,p5)
poly[,"cluster"]<-c(1,2,3,4,5)
pts<-st_as_sf(df, coords=c("Longitude", "Latitude"), crs=4269)
tmap_mode("plot")
tm_shape(poly)+
tm_polygons(col="cluster", palette=c("darkolivegreen","skyblue","skyblue","yellow","pink"), style="cat", title="cluster")+
tm_shape(pts)+
tm_dots(size=2)+
tm_layout(legend.outside = TRUE)
One approach would be use a voronoi partition. ggvoronoi will do this for you with ggplot2, and you could easily overlay it on a ggmap map.
There is also a st_voronoi function in the sf package which will create a voronoi partition shapefile from a MULTIPOINT shape (see update below).
Here is a simple example using your data. I have removed duplicated points (i.e. the same point in different clusters) which break the voronoi algorithm!
library(tidyverse) #specifically ggplot2 and dplyr for the pipe
library(ggvoronoi)
df %>% distinct(Longitude, Latitude, .keep_all = TRUE) %>%
ggplot(aes(x = Longitude, y = Latitude, fill = factor(cluster), label = cluster)) +
geom_voronoi() +
geom_text()
Update:
To do this with sf_voronoi you can do the following (unlike ggvoronoi, sf_voronoi works without having to weed out the duplicates)...
pts_vor <- pts %>% st_union() %>% #merge points into a MULTIPOINT
st_voronoi() %>% #calculate voronoi polygons
st_cast() %>% #next three lines return it to a useable list of polygons
data.frame(geometry = .) %>%
st_sf(.) %>%
st_join(., pts) #merge with clusters and other variables
pts_vor %>% ggplot(aes(fill = factor(cluster))) +
geom_sf(colour = NA) + #draw voronoi tiles (no borders)
geom_sf_text(data = pts, aes(label = cluster)) + #plot points
coord_sf(xlim = c(-50.35, -49.95), ylim = c(-25.05, -24.75))
#Antonio, I think this might be the solution you are after, but it requires at least three points per cluster to work, which from your figure I am assuming you have in your full dataset. The trick is to create convex hulls and convert them into polygons. This can be accomplished using the convexhull.xy() function in the spatstat:: package. Then these can be converted into simple features in the sf:: package, and then drawn with your mapping package of choice. I personally am a fan of the tmap:: package. Here is a reproducible example. Note, I had to add two more points to your example data to make this work (you cannot compute a polygon from only two points).
##Loading Necessary Packages##
library(spatstat)#For convexhull.xy() function
library(tmap)# For drawing the map
library(sf) #To create simple features for mapping
##Loading Example Data##
df<-structure(list(Properties = c(1,2,3,4,5,6),
Latitude = c(-24.930473, -24.95575,-24.924161,-24.95579, -24.94557, -24.93267),
Longitude = c(-49.994889, -49.990162,-50.004343, -50.007371, -50.01542, -50.00702),
cluster = c(1,2,1,2,1,2)), class = "data.frame", row.names = c(NA, -6L))
##Calculating convexhulls for each cluster##
w1<-convexhull.xy(df$Longitude[df$cluster==1], df$Latitude[df$cluster==1])
w2<-convexhull.xy(df$Longitude[df$cluster==2], df$Latitude[df$cluster==2])
##Converting hulls to simple features. Note, I assumed that you are using the EPSG 4269 projection (WGS84)
p1<-st_as_sf(w1, crs=4269)
p2<-st_as_sf(w2, crs=4269)
#Combining the two simple features together
poly<-rbind(p1,p2)
#Labelling the clusters
poly[,"cluster"]<-c(1,2)
#Creating a point simple feature from your property data in the dataframe
pts<-st_as_sf(df, coords=c("Longitude", "Latitude"), crs=4269)
#Setting the mapping mode to plot. Change this to "view" if you want an interactive map
tmap_mode("plot")
#Drawing the map
tm_shape(poly)+
tm_polygons(col="cluster", palette=c("darkolivegreen", "skyblue"), style="cat", title="cluster")+
tm_shape(pts)+
tm_dots(size=2)+
tm_layout(legend.outside = TRUE)
Related
How to calculate the conditional expectation Weibull model?
I would like to calculate the conditional expectation of the Weibull model. In specific, I would like to estimate the remaining tenure of a client looking at random moments (time = t) in his total tenure. To do so, I have calculated the total tenure for each client (currently active or inactive) and based on the random moment for each client, calculated his/her tenure at that moment. The example below is a snapshot of my attempt. I use 2 variables STED and TemporalTenure to predict the dependent variable tenure which has either status 0 = active or 1 = inactive. I use the survival package for obtaining the survival object (km_surv). df = structure(list(ID = c(16008, 21736, 18851, 20387, 30749, 42159), STED = c(2, 5, 1, 3, 2, 2), TemporalTenure = c(84, 98, 255, 392, 108, 278 ), tenure = c(152, 166, 273, 460, 160, 289), status = c(0, 0, 1, 0, 1, 1)), row.names = c(NA, 6L), class = "data.frame") km_surv <- Surv(time = df$tenure, event = df$status) df <- data.frame(y = km_surv, df[,!(names(df) %in% c("tenure","status", "ID"))]) weibull_fit <- psm(y ~. , dist="weibull", data = df) quantsurv <- Quantile(weibull_fit, df) lp <- predict(weibull_fit, df, type="lp") print(quantsurv(0.5, lp)) The output of these estimations are way too high. I assume this is caused by including the TemporalTenure, but I can't find out how the psm package calculates this and if there are other packages where it's possible to estimate the remaining tenure of client i at time t. How can I obtain the predicted tenure conditioned over the time that a client is already active (random moment in time: TemporalTenure) where the dependent tenure can either be a client that is still active or one that is inactive? EDIT To clarify, whenever I add time conditional variables such as: TemporalTenure, number of received payments and number of complaints until time t, the predicted lifetime explodes in many cases. Therefore, I suspect that the psm is not the right way to go. Similar question is asked here, but the solution given doesn't work for the same reasons. Below a slightly bigger dataset which already causes problems. df = structure(list(ID= c(16008, 21736, 18851, 20387, 30749, 42159, 34108, 47511, 47917, 61116, 66600, 131380, 112668, 90799, 113615, 147562, 166247, 191603, 169698, 1020841, 1004077, 1026953, 1125673, 1129788, 22457, 1147883, 1163870, 1220268, 2004623, 1233924, 2009026, 2026688, 2031284, 2042982, 2046137, 2043214, 2033631, 2034252, 2068467, 2070284, 2070697, 2084859, 2090567, 2087133, 2087685, 2095100, 2095720, 2100482, 2105150, 2109353, 28852, 29040, 29592, 29191, 31172, 2126369, 2114207, 2111947, 2102678, 237687, 1093221, 2111607, 2031732, 2105275, 2020226, 1146777, 1028487, 1030165, 1098033, 1142093, 1186763, 2005605, 2007182, 2021092, 2027676, 2027525, 2070471, 2070621, 2072706, 2081862, 2085084, 2085353, 2094429, 2096216, 2109774, 2114526, 2115510, 2117329, 2122045, 2119764, 2122522, 2123080, 2128547, 2130005, 30025, 24166, 61529, 94568, 70809, 159214), STED = c(2, 5, 1, 3, 2, 2, 3, 1, 2, 2, 2, 2, 2, 1, 2, 2, 4, 1, 4, 3, 2, 4, 1, 1, 2, 1, 4, 1, 1, 1, 2, 4, 2, 5, 4, 1, 4, 2, 5, 3, 2, 1, 4, 2, 1, 5, 3, 1, 1, 5, 2, 2, 2, 2, 3, 4, 3, 5, 1, 1, 5, 2, 5, 1, 3, 5, 3, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 3, 5, 2, 2, 1, 2, 1, 2, 3, 1, 1, 3, 5, 1, 2, 2, 2, 2, 1, 2, 1, 3, 1), TemporalTenure = c(84, 98, 255, 392, 108, 278, 120, 67, 209, 95, 224, 198, 204, 216, 204, 190, 36, 160, 184, 95, 140, 256, 142, 216, 56, 79, 194, 172, 155, 158, 78, 24, 140, 87, 134, 111, 15, 126, 41, 116, 66, 60, 0, 118, 22, 116, 110, 52, 66, 0, 325, 323, 53, 191, 60, 7, 45, 73, 42, 161, 30, 17, 30, 12, 87, 85, 251, 120, 7, 6, 38, 119, 156, 54, 11, 141, 50, 25, 33, 3, 48, 58, 13, 113, 25, 18, 23, 2, 102, 5, 90, 0, 101, 83, 44, 125, 226, 213, 216, 186), tenure = c(152, 166, 273, 460, 160, 289, 188, 72, 233, 163, 266, 266, 216, 232, 247, 258, 65, 228, 252, 99, 208, 324, 201, 284, 124, 84, 262, 180, 223, 226, 146, 92, 208, 155, 202, 179, 80, 185, 64, 184, 120, 65, 6, 186, 45, 120, 170, 96, 123, 12, 393, 391, 64, 259, 73, 42, 69, 141, 47, 229, 37, 19, 37, 17, 155, 99, 319, 188, 75, 11, 49, 187, 180, 55, 52, 209, 115, 93, 88, 6, 53, 126, 31, 123, 26, 26, 24, 9, 114, 6, 111, 4, 168, 84, 112, 193, 294, 278, 284, 210), status = c(0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1), TotalValue = c(2579.35, 2472.85, 581.19, 2579.35, 2472.85, 0, 1829.18, 0, 936.79, 2098.2, 850.47, 2579.35, 463.68, 463.68, 2171.31, 3043.03, 561.16, 3043.03, 3043.03, -68.06, 2098.2, 2504.4, 1536.67, 2719.7, 3043.03, 109.91, 2579.35, 265.57, 3560.34, 2266.95, 3123.16, 3544.4, 1379.19, 2288.35, 2472.85, 2560.48, 1414.45, 3741.49, 202.2, 2856.23, 1457.75, 313.68, 191.32, 2266.95, 661.01, 0, 2050.81, 298.76, 1605.44, 373.86, 3043.03, 2579.35, 448.63, 3043.03, 463.68, 977.28, 818.06, 2620.06, 0, 3235.8, 280.99, 0, 0, 194.04, 3212.75, -23.22, 1833.46, 1829.18, 2786.7, 0, 0, 3250.38, 936.79, 0, 1045.21, 3043.03, 1988.36, 2472.85, 1197.94, 0, 313.68, 3212.75, 1419.33, 531.14, 0, 96.28, 0, 142.92, 174.79, 0, 936.79, 156.19, 2472.85, 463.68, 3520.69, 2579.35, 3328.87, 2567.88, 3043.03, 1081.14)), row.names = c(NA, 100L), class = "data.frame")
So here's what I have done: 1) added library call to load pkg:rms, removed the attempt to place a Surv object in a dataframe column, 3) built the Surv object inside formula as Therneau expects formulas to be built, and removed ID from the covariates where it most probably does not belong. library(survival); library(rms) #km_surv <- Surv(time = df$tenure, event = df$status) #df <- data.frame(y = km_surv, df[,!(names(df) %in% c("tenure","status"))]) weibull_fit <- psm(Surv(time = tenure, event = status) ~TemporalTenure +STED , dist="weibull", data = df) quantsurv <- Quantile(weibull_fit, df) lp <- predict(weibull_fit, df, type="lp") Results# print(quantsurv(0.5, lp)) 1 2 3 4 5 6 151.4129 176.0490 268.4644 466.8266 164.8640 301.2630
Auto.Arima transform timeseries and xreg correlation with lagged forecast timeseries
I'm trying to forecast an auto.arima() model like the one below. I was wondering in general if it was necessary to transform a timeseries so that it resembled a normal distribution before passing it to auto.arima()? Also does it matter if your xreg=... predictor is correlated with a lag of the timeseries you're trying to predict, or vice versa? Code: tsTrain <-tsTiTo[1:60] tsTest <- tsTiTo[61:100] Xreg<-CustCount[1:100] ##Predictor xregTrain2 <- Xreg[1:60] xregTest2 <- Xreg[61:100] Arima.fit2 <- auto.arima(tsTrain, xreg = xregTrain2) Acast2<-forecast(Arima.fit2, h=40, xreg = xregTest2) Data: #dput(ds$CustCount[1:100]) CustCount = c(3, 3, 1, 4, 1, 3, 2, 3, 2, 4, 1, 1, 5, 6, 8, 5, 2, 7, 7, 3, 2, 2, 2, 1, 3, 2, 3, 1, 1, 2, 1, 1, 3, 2, 2, 2, 3, 7, 5, 6, 8, 7, 3, 5, 6, 6, 8, 4, 2, 1, 2, 1, NA, NA, 4, 2, 2, 4, 11, 2, 8, 1, 4, 7, 11, 5, 3, 10, 7, 1, 1, NA, 2, NA, NA, 2, NA, NA, 1, 2, 3, 5, 9, 5, 9, 6, 6, 1, 5, 3, 7, 5, 8, 3, 2, 6, 3, 2, 3, 1 ) # dput(tsTiTo[1:100]) tsTiTo = c(45, 34, 11, 79, 102, 45, 21, 45, 104, 20, 2, 207, 45, 2, 3, 153, 8, 2, 173, 11, 207, 79, 45, 153, 192, 173, 130, 4, 173, 174, 173, 130, 79, 154, 4, 104, 192, 153, 192, 104, 28, 173, 52, 45, 11, 29, 22, 81, 7, 79, 193, 104, 1, 1, 46, 130, 45, 154, 153, 7, 174, 21, 193, 45, 79, 173, 45, 153, 45, 173, 2, 1, 2, 1, 1, 8, 1, 1, 79, 45, 79, 173, 45, 2, 173, 130, 104, 19, 4, 34, 2, 192, 42, 41, 31, 39, 11, 79, 4, 79)
Short answer is no and no to both questions. See the long answer below. I was wondering in general if it was necessary to transform a timeseries so that it resembled a normal distribution before passing it to auto.arima()? No. In the case of time series data, it is the innovation errors that you want to be normally distributed. Not the time series you are modelling. This is similar to in the case of liner regression model, you don't expect the predictors to be normally distributed. It is the errors that you'd expect to be normally distributed. Also does it matter if your xreg=... predictor is correlated with a lag of the timeseries you're trying to predict, or vice versa? You'd hope xreg are correlated this way. We are typing to use that information when looking for an appropriate model to forecast.
Difference Predictors in Auto.Arima Forecast
I'm trying to build an auto.arima forecast with predictors like the example below. I've noticed that my predictor is non-stationary. So I was wondering if I should difference the predictor before inputting it in the xreg parameter, like I've shown below. The real data set is much larger, this just an example. Any advice is greatly appreciated. Code: tsTrain <-tsTiTo[1:60] tsTest <- tsTiTo[61:100] ndiffs(ds$CustCount) ##returns 1 diffedCustCount<-diff(ds$CustCount,differences=1) Xreg<-diffedCustCount[1:100] ##Predictor xregTrain2 <- Xreg[1:60] xregTest2 <- Xreg[61:100] Arima.fit2 <- auto.arima(tsTrain, xreg = xregTrain2) Acast2<-forecast(Arima.fit2, h=40, xreg = xregTest2) Data: dput(ds$CustCount[1:100]) c(3, 3, 1, 4, 1, 3, 2, 3, 2, 4, 1, 1, 5, 6, 8, 5, 2, 7, 7, 3, 2, 2, 2, 1, 3, 2, 3, 1, 1, 2, 1, 1, 3, 2, 2, 2, 3, 7, 5, 6, 8, 7, 3, 5, 6, 6, 8, 4, 2, 1, 2, 1, NA, NA, 4, 2, 2, 4, 11, 2, 8, 1, 4, 7, 11, 5, 3, 10, 7, 1, 1, NA, 2, NA, NA, 2, NA, NA, 1, 2, 3, 5, 9, 5, 9, 6, 6, 1, 5, 3, 7, 5, 8, 3, 2, 6, 3, 2, 3, 1 ) dput(tsTiTo[1:100]) c(45, 34, 11, 79, 102, 45, 21, 45, 104, 20, 2, 207, 45, 2, 3, 153, 8, 2, 173, 11, 207, 79, 45, 153, 192, 173, 130, 4, 173, 174, 173, 130, 79, 154, 4, 104, 192, 153, 192, 104, 28, 173, 52, 45, 11, 29, 22, 81, 7, 79, 193, 104, 1, 1, 46, 130, 45, 154, 153, 7, 174, 21, 193, 45, 79, 173, 45, 153, 45, 173, 2, 1, 2, 1, 1, 8, 1, 1, 79, 45, 79, 173, 45, 2, 173, 130, 104, 19, 4, 34, 2, 192, 42, 41, 31, 39, 11, 79, 4, 79)
The xreg argument in auto.arima performs a dynamic regression which is to say that you are performing a linear regression and fitting the errors with an arma process. While auto.arima() used to require manual differencing for non-stationary data when external regressors are included, this is no longer the case. auto.arima() will take non-stationary data as an input and determine the order of differencing using a unit-root test. See this Post from Rob Hyndman for further detail.
Show a specific value of x-axis on ggplot
Im creating a ggplot with geom_vline at a specific location on the x axis. i would like the x axis to show that specific value Following is my data + code: dput(agg_data) structure(list(latency = structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 24, 26, 28, 29, 32, 36, 37, 40, 43, 46, 47, 48, 49, 54, 64, 71, 72, 75, 87, 88, 89, 93, 134, 151), class = "difftime", units = "days"), count = c(362, 11, 8, 5, 4, 2, 8, 6, 4, 2, 2, 1, 5, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1), cum_sum = c(362, 373, 381, 386, 390, 392, 400, 406, 410, 412, 414, 415, 420, 421, 423, 425, 427, 428, 429, 430, 432, 433, 435, 436, 437, 438, 439, 440, 441, 442, 444, 446, 447, 448, 449, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460), cum_dist = c(0.78695652173913, 0.810869565217391, 0.828260869565217, 0.839130434782609, 0.847826086956522, 0.852173913043478, 0.869565217391304, 0.882608695652174, 0.891304347826087, 0.895652173913044, 0.9, 0.902173913043478, 0.91304347826087, 0.915217391304348, 0.919565217391304, 0.923913043478261, 0.928260869565217, 0.930434782608696, 0.932608695652174, 0.934782608695652, 0.939130434782609, 0.941304347826087, 0.945652173913043, 0.947826086956522, 0.95, 0.952173913043478, 0.954347826086957, 0.956521739130435, 0.958695652173913, 0.960869565217391, 0.965217391304348, 0.969565217391304, 0.971739130434783, 0.973913043478261, 0.976086956521739, 0.980434782608696, 0.982608695652174, 0.984782608695652, 0.98695652173913, 0.989130434782609, 0.991304347826087, 0.993478260869565, 0.995652173913044, 0.997826086956522, 1)), .Names = c("latency", "count", "cum_sum", "cum_dist"), row.names = c(NA, -45L), class = "data.frame") and code: library(ggplot2) library(ggthemes) russ<-ggplot(data=agg_data,aes(x=as.numeric(latency),y=cum_dist))+geom_line(size=2) russ<-russ+ggtitle("Latency from first click to Qualified Demo:") + xlab("in Days")+ ylab("Proportion of maturity")+theme_economist() russ<-russ+geom_vline(aes(xintercept=10), color="black", linetype="dashed") russ which creates the following plot: I want the plot show the value '10' (same location as the vline) on the x-axis I looked for some other similar answers, like in Customize axis labels but this one re creates the x axis labels (with scale_x_discrete), and does not add a new number to the current scale, which is more of what im looking for. thanks in advance!
In your case x scale is continuous, so you can use function scale_x_continuous() and provide breaks at positions you need. russ + scale_x_continuous(breaks=c(0,10,50,100,150))
removing columns with similar variance
I have a dataframe of 3500 X 4000. I am trying to write a professional command in R to remove any columns in a matrix that show the same variance. I am able to do this a with a long, complicated command such as datavar <- apply(data, 2, var) datavar <- datavar[!duplicated(datavar)] then assemble my data by matching the remaining column names, but this is SAD! I was hoping to do this in a single go. I was thinking of something like data <- data[, which(apply(data, 2, function(col) !any(var(data) = any(var(data)) )))] I know the last part of the above command is nonsense, but I also know there is someway it can be done in some... smart command! Here's some data that applies to the question data <- structure(list(V1 = c(3, 213, 1, 135, 5, 2323, 1231, 351, 1, 33, 2, 213, 153, 132, 1321, 53, 1, 231, 351, 3135, 13), V2 = c(1, 1, 1, 2, 3, 5, 13, 33, 53, 132, 135, 153, 213, 213, 231, 351, 351, 1231, 1321, 2323, 3135), V3 = c(65, 41, 1, 53132, 1, 6451, 3241, 561, 321, 534, 31, 135, 1, 1351, 31, 351, 31, 31, 3212, 3132, 1), V4 = c(2, 2, 5, 4654, 5641, 21, 21, 1, 1, 465, 31, 4, 651, 35153, 13, 132, 123, 1231, 321, 321, 5), V5 = c(23, 13, 213, 135, 15341, 564, 564, 8, 464, 8, 484, 6546, 132, 165, 123, 135, 132, 132, 123, 123, 2), V6 = c(2, 1, 84, 86468, 464, 18, 45, 55, 2, 5, 12, 4512, 5, 123, 132465, 12, 456, 15, 45, 123213, 12), V7 = c(1, 2, 2, 5, 5, 12, 12, 12, 15, 18, 45, 45, 55, 84, 123, 456, 464, 4512, 86468, 123213, 132465)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7"), row.names = c(NA, 21L), class = "data.frame") Would I be able to keep one of the "similar variance" columns too? Thanks,
I might go a more cautious route, like data[, !duplicated(round(sapply(data,var),your_precision_here))]
This is pretty similar to what you've come up with: vars <- lapply(data,var) data[,which(sapply(1:length(vars), function(x) !vars[x] %in% vars[-x]))] One thing to think about though is whether you want to match variances exactly (as in this example) or just variances that are close. The latter would be a significantly more challenging problem.
... or as alternative: data[ , !c(duplicated(apply(data, 2, var)) | duplicated(apply(data, 2, var), fromLast=TRUE))] ...but also not shorter :)