Support Vector Machine Visualization in R - r

I am having trouble graphing my SVM model in R. The formula is:
svm_linear <- svm(open ~ review_count + recession + duration + count + stars + Freq + avgRev + avgStar, data=yelp_train, cost=100, gamma=1)
plot(svm_linear, data=yelp_train)
I can't figure out why nothing appears after running the plot function. Please help.
I added the dput out.
I cut out some of the extra columns to avoid waste.
newdata <- cleanDataFrame[2:10]
set.seed(10)
(newdata[sample(1:nrow(newdata), 30),])
structure(list(open = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 1L, 0L), review_count = c(3L, 5L, 6L, 38L, 6L, 4L, 5L,
23L, 19L, 3L, 22L, 74L, 15L, 38L, 88L, 26L, 9L, 3L, 58L, 4L,
13L, 117L, 38L, 10L, 5L, 6L, 102L, 108L, 264L, 103L), stars = c(3,
4, 4.5, 4, 3, 3, 3, 4, 3.5, 3.5, 3.5, 4.5, 4.5, 4, 2.5, 3.5,
3.5, 3.5, 4, 3, 4.5, 4.5, 4, 3.5, 4, 3.5, 4, 3, 3.5, 4), Freq = c(166L,
12L, 166L, 15L, 45L, 166L, 66L, 79L, 33L, 58L, 150L, 389L, 150L,
1L, 389L, 20L, 389L, 389L, 389L, 166L, 74L, 0L, 389L, 32L, 389L,
161L, 126L, 389L, 98L, 3L), avgRev = c(23.7904191616766, 18.7692307692308,
23.7904191616766, 98, 78.804347826087, 23.7904191616766, 31.3283582089552,
64.3375, 23.1764705882353, 23.6949152542373, 60.6490066225166,
34.1923076923077, 60.6490066225166, 22, 34.1923076923077, 33.1904761904762,
34.1923076923077, 34.1923076923077, 34.1923076923077, 30.8443113772455,
27.6533333333333, 117, 34.1923076923077, 30.4545454545455, 34.1923076923077,
37.2716049382716, 47.3149606299213, 34.1923076923077, 64.3838383838384,
73.75), avgStar = c(3.53592814371257, 3.92307692307692, 3.53592814371257,
3.96875, 3.6195652173913, 3.53592814371257, 3.69402985074627,
3.58125, 3.5, 3.67796610169492, 3.63245033112583, 3.5551282051282,
3.63245033112583, 4, 3.5551282051282, 3.78571428571429, 3.5551282051282,
3.5551282051282, 3.5551282051282, 3.48203592814371, 3.72666666666667,
4.5, 3.5551282051282, 3.65151515151515, 3.5551282051282, 3.43827160493827,
3.63385826771654, 3.5551282051282, 3.60606060606061, 4.25), count = c(4L,
2L, 5L, 5L, 0L, 2L, 5L, 0L, 2L, 8L, 3L, 15L, 4L, 3L, 15L, 14L,
1L, 1L, 0L, 1L, 2L, 0L, 0L, 50L, 1L, 27L, 4L, 51L, 36L, 14L),
recession = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), duration = c(332L, 427L, 614L, 117L, 1894L,
1346L, 140L, 1909L, 1100L, 1030L, 1666L, 2096L, 1054L, 352L,
2145L, 1018L, 1763L, 391L, 2116L, 1567L, 693L, 674L, 1626L,
301L, 295L, 378L, 649L, 376L, 1028L, 2390L)), .Names = c("open",
"review_count", "stars", "Freq", "avgRev", "avgStar", "count",
"recession", "duration"), row.names = c(1439L, 870L, 1210L, 1962L,
242L, 639L, 777L, 771L, 1741L, 1214L, 1840L, 1603L, 322L, 1681L,
1010L, 1209L, 148L, 745L, 1124L, 2354L, 2433L, 1731L, 2180L,
1000L, 1141L, 1985L, 2814L, 674L, 2163L, 999L), class = "data.frame")

It looks like you're trying to do classification, but your outcome variable is integer mode. To see this, do str(yelp_train). Turn the outcome into a factor and then try your plot again. For example:
yelp_train$openF = factor(yelp_train$open)
svm_linear <- svm(openF ~ review_count + recession + duration + count + stars + Freq + avgRev +
avgStar, data=yelp_train, cost=100, gamma=1)
plot(svm_linear, formula = review_count ~ Freq, data=yelp_train)
One other thing. In the portion of the data you provided, recession is always zero. If this is the case with all of the data, then remove recession from your call to svm. I had to do this to avoid an error. Once I removed recession, I was able to run the model and plot several combinations of variables successfully.
Question in Comments: Why isn't Open the dependent variable in the formula in the plot function? You're plotting where the decision boundary lies in relation to the values of two of the independent variables (or "features" in machine learning lingo). The predicted value of the dependent variable, Open, is given by the fill colors: In this case, one color for Open=1 and another for Open=0. The boundary between the two colors is the decision boundary that the svm model came up with. The plot also includes points representing the pairs of values of the two features used for the plot. The two different plot markers represent the two different values of Open and you can see how many points were properly classified and how many were misclassified by your model.
The full decision boundary is a hyperplane in a multi-dimensional space. For example, if you had 3 features in the model, the features would lie in a 3-dimensional space (imagine a 3D scatterplot) and the decision boundary would be a 2-dimensional hyperplane through that 3D space (which we of course refer to as a "plane" in this case; and in general, the decision boundary has dimension one less than the dimension of the feature space).
When you plot two features, you're looking at a two-dimensional slice through that multi-dimensional space. The plot function is setting the values of the other features to some specific values--maybe the mean for numeric variables and the base factor level for factor variables--check the documentation to be sure. The plot function for svm models allows you to set the specific values of the other features (besides the two you're plotting) using the slice argument. That allows you to see how the decision boundary for two particular features varies based on changes in the values of other features.
You might find the svm chapter of Introduction to Statistical Learning useful for additional info (you can download it at no charge).

Related

ANOVA error: why is each row of output *not* identified by a unique combination of keys?

I have a two-way ANOVA test (w/repeated measures) that I'm using with four almost identical datasets:
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
Where:
LST = surface temperature deviation in C
Month = 1-12
Buffer = a value 100-1900 - one of 19 areas outward from the boundary of a solar power plant (each 100m wide)
TimePeriod = a factor with a value of 1 or 2 corresponding to pre-/post-construction of a solar power plant.
For one dataset I get the error:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 38 rows:
* 10, 11
* 217, 218
* 240, 241
* 263, 264
* 286, 287
* 309, 310
* 332, 333
...
As far as I can tell I have unique combinations.
dplyr::count(LST_Weather_dataset_N, LST, Month, Buffer, TimePeriod, sort = TRUE)
returns
LST Month Buffer TimePeriod n
1 -6.309045316 12 100 2 1
2 -5.655279925 9 1000 2 1
3 -5.224196295 12 200 2 1
4 -5.194473224 9 1100 2 1
5 -5.025429891 12 400 2 1
6 -4.987575966 9 700 2 1
7 -4.979453868 12 600 2 1
8 -4.825298768 12 300 2 1
9 -4.668994574 12 500 2 1
10 -4.652282192 12 700 2 1
...
'n' is always 1.
I can't work out why this is happening.
Extract of datafram below:
> dput(LST_Weather_dataset_N[sample(1:nrow(LST_Weather_dataset_N), 50),])
structure(list(Buffer = c(1400L, 700L, 300L, 1400L, 100L, 200L,
1700L, 100L, 800L, 1900L, 1100L, 100L, 700L, 800L, 1400L, 400L,
1300L, 200L, 1200L, 500L, 1200L, 1300L, 400L, 1000L, 1300L, 1100L,
100L, 300L, 300L, 600L, 1100L, 1400L, 1500L, 1600L, 1700L, 1800L,
1700L, 1300L, 1200L, 300L, 1100L, 1900L, 1700L, 700L, 1400L,
1200L, 1600L, 1700L, 1900L, 1300L), Date = c("02/05/2014", "18/01/2017",
"19/06/2014", "25/12/2013", "15/09/2017", "08/04/2017", "22/08/2014",
"21/07/2014", "13/07/2017", "25/12/2013", "22/10/2013", "02/05/2014",
"07/03/2017", "15/03/2014", "13/07/2017", "19/06/2014", "25/12/2013",
"17/10/2017", "16/04/2014", "06/10/2013", "15/09/2017", "18/01/2017",
"10/01/2014", "17/12/2016", "13/07/2017", "19/06/2014", "07/03/2017",
"15/03/2014", "11/02/2014", "22/10/2013", "06/10/2013", "15/09/2017",
"16/04/2014", "18/01/2017", "15/03/2014", "21/07/2014", "17/10/2017",
"15/09/2017", "10/01/2014", "23/09/2014", "16/04/2014", "22/10/2013",
"11/06/2017", "26/05/2017", "19/06/2014", "14/08/2017", "11/02/2014",
"26/02/2017", "26/02/2017", "11/02/2014"), LST = c(1.255502397,
4.33385966, 3.327025603, -0.388631166, -0.865430798, 4.386292648,
-0.243018665, 3.276865987, 0.957036835, -0.065821795, 0.69731779,
4.846851651, -1.437700684, 1.003808572, 0.572460421, 2.995902374,
-0.334633662, -1.231447567, 0.644520741, 0.808262029, -3.392959991,
2.324569449, 2.346707612, -3.124354627, 0.58719862, 1.904859254,
1.701580958, 2.792443253, 1.638270039, 1.460743317, 0.699767335,
-3.015643366, 0.930527864, 1.309519336, 0.477789664, 0.147584938,
-0.498188865, -3.506795723, -1.007487965, 1.149604087, 1.192366386,
0.197471474, 0.999391224, -0.190613618, 1.27324015, 2.686622796,
0.573109026, 0.97847983, 0.395005095, -0.40855426), Month = c(5L,
1L, 6L, 12L, 9L, 4L, 8L, 7L, 7L, 12L, 10L, 5L, 3L, 3L, 7L, 6L,
12L, 10L, 4L, 10L, 9L, 1L, 1L, 12L, 7L, 6L, 3L, 3L, 2L, 10L,
10L, 9L, 4L, 1L, 3L, 7L, 10L, 9L, 1L, 9L, 4L, 10L, 6L, 5L, 6L,
8L, 2L, 2L, 2L, 2L), Year = c(2014L, 2017L, 2014L, 2013L, 2017L,
2017L, 2014L, 2014L, 2017L, 2013L, 2013L, 2014L, 2017L, 2014L,
2017L, 2014L, 2013L, 2017L, 2014L, 2013L, 2017L, 2017L, 2014L,
2016L, 2017L, 2014L, 2017L, 2014L, 2014L, 2013L, 2013L, 2017L,
2014L, 2017L, 2014L, 2014L, 2017L, 2017L, 2014L, 2014L, 2014L,
2013L, 2017L, 2017L, 2014L, 2017L, 2014L, 2017L, 2017L, 2014L
), JulianDay = c(122L, 18L, 170L, 359L, 258L, 98L, 234L, 202L,
194L, 359L, 295L, 122L, 66L, 74L, 194L, 170L, 359L, 290L, 106L,
279L, 258L, 18L, 10L, 352L, 194L, 170L, 66L, 74L, 42L, 295L,
279L, 258L, 106L, 18L, 74L, 202L, 290L, 258L, 10L, 266L, 106L,
295L, 162L, 146L, 170L, 226L, 42L, 57L, 57L, 42L), TimePeriod = c(1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
1L), Temperature = c(28L, 9L, 31L, 12L, 27L, 21L, 29L, 36L, 38L,
12L, 23L, 28L, 12L, 21L, 38L, 31L, 12L, 23L, 25L, 22L, 27L, 9L,
11L, 7L, 38L, 31L, 12L, 21L, 14L, 23L, 22L, 27L, 25L, 9L, 21L,
36L, 23L, 27L, 11L, 31L, 25L, 23L, 29L, 27L, 31L, 34L, 14L, 16L,
16L, 14L), Humidity = c(6L, 34L, 7L, 31L, 29L, 22L, 34L, 15L,
19L, 31L, 16L, 6L, 14L, 14L, 19L, 7L, 31L, 12L, 9L, 12L, 29L,
34L, 33L, 18L, 19L, 7L, 14L, 14L, 31L, 16L, 12L, 29L, 9L, 34L,
14L, 15L, 12L, 29L, 33L, 18L, 9L, 16L, 8L, 13L, 7L, 13L, 31L,
31L, 31L, 31L), Wind_speed = c(6L, 0L, 6L, 7L, 13L, 33L, 6L,
20L, 9L, 7L, 0L, 6L, 0L, 6L, 9L, 6L, 7L, 6L, 0L, 7L, 13L, 0L,
0L, 35L, 9L, 6L, 0L, 6L, 6L, 0L, 7L, 13L, 0L, 0L, 6L, 20L, 6L,
13L, 0L, 0L, 0L, 0L, 24L, 11L, 6L, 24L, 6L, 26L, 26L, 6L), Wind_gust = c(0L,
0L, 0L, 0L, 0L, 54L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 39L,
0L, 41L, 41L, 0L), Wind_trend = c(1L, 0L, 1L, 1L, 2L, 2L, 0L,
1L, 2L, 1L, 0L, 1L, 0L, 1L, 2L, 1L, 1L, 0L, 0L, 2L, 2L, 0L, 1L,
1L, 2L, 1L, 0L, 1L, 1L, 0L, 2L, 2L, 0L, 0L, 1L, 1L, 0L, 2L, 1L,
1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Wind_direction = c(0,
0, 0, 337.5, 360, 22.5, 0, 22.5, 0, 337.5, 0, 0, 0, 0, 0, 0,
337.5, 180, 0, 247.5, 360, 0, 0, 180, 0, 0, 0, 0, 337.5, 0, 247.5,
360, 0, 0, 0, 22.5, 180, 360, 0, 0, 0, 0, 360, 22.5, 0, 360,
337.5, 360, 360, 337.5), Pressure = c(940.2, 943.64, 937.69,
951.37, 932.69, 933.94, 937.07, 938.01, 937.69, 951.37, 939.72,
940.2, 948.33, 947.71, 937.69, 937.69, 951.37, 943.32, 932.69,
944.71, 932.69, 943.64, 942.31, 943.01, 937.69, 937.69, 948.33,
947.71, 941.94, 939.72, 944.71, 932.69, 932.69, 943.64, 947.71,
938.01, 943.32, 932.69, 942.31, 938.94, 932.69, 939.72, 928.31,
931.12, 937.69, 932.37, 941.94, 936.13, 936.13, 941.94), Pressure_trend = c(1L,
2L, 0L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
1L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
2L, 1L, 1L, 1L, 0L, 2L, 1L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 2L, 2L,
1L)), row.names = c(179L, 14L, 195L, 426L, 306L, 118L, 299L,
229L, 244L, 436L, 374L, 153L, 90L, 91L, 256L, 197L, 424L, 348L,
137L, 355L, 328L, 26L, 7L, 419L, 254L, 211L, 78L, 81L, 43L, 359L,
373L, 332L, 143L, 32L, 109L, 263L, 393L, 330L, 23L, 309L, 135L,
398L, 224L, 166L, 217L, 290L, 69L, 72L, 76L, 63L), class = "data.frame")
Well, this is a bit embarrassing.
The error arose as there were not, in fact, paired months of the data. Rather than there being 38 data (19x2) for each month, due to an error in determining the month value one month had 57 data (19x3). Correcting this, and checking that each month had the same number of paired data for the ANOVA allowed the test to run sucessfully.
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
> get_anova_table(res.aov, correction = "auto")
ANOVA Table (type III tests)
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 11 600.135 974.584 6.774 2.50e-02 * 0.189
2 Buffer 18 198 332.217 331.750 11.015 2.05e-21 * 0.115
3 TimePeriod 1 11 29.561 977.945 0.333 5.76e-01 0.011
4 Buffer:TimePeriod 18 198 13.055 283.797 0.506 9.53e-01 0.005
I still don't understand how the error message was telling me this, though.

Automatically creating bins for a numeric variable in r

So I have a variable as below.
var <- c(0L, 5L, 4L, 115L, 0L, 0L, 0L, 2L, 365L, 4L, 20L, 61L, 365L,
0L, 365L, 0L, 14L, 0L, 0L, 72L, 0L, 0L, 6L, 105L, 150L, 0L, 365L,
0L, 1L, 28L, 161L, 6L, 0L, 2L, 12L, 0L, 10L, 49L, 7L, 2L, 51L,
0L, 0L, 11L, 0L, 0L, 17L, 0L, 0L, 7L, 0L, 28L, 0L, 0L, 0L, 44L,
0L, 3L, 0L, 0L, 0L, 1L, 1L, 0L, 4L, 87L, 0L, 321L, 0L, 0L, 0L,
0L, 9L, 0L, 0L, 0L, 140L, 0L, 0L, 0L, 0L, 0L, 1L, 8L, 20L, 0L,
4L, 14L, 3L, 0L, 0L, 0L, 39L, 4L, 9L, 0L, 0L, 0L, 1L, 7L)
I want to create bins of different sizes (or same no matter) to categorize and plot as a bar chart for this variable.
I know it's possible to find automatic/reccommended binning however I am unsure how to do so in R?
Tried using the bin() function to no avail . I read about the Jenks method as well, but is there a way to create the best possible bins in R?
Would like to use it to plot a bar plot in ggplot.
Your description sounds like you're wanting to plot a histogram of var. This can be done easily enough in ggplot using geom_histogram. The key here is that ggplot likes to have a data frame, so you just have to specify your variable in a dataframe first, which you can do inside the ggplot() function:
ggplot(data.frame(var), aes(var)) + geom_histogram(color='black', alpha=0.2)
Gives you this:
The default is to use 30 bins, but you can specify either number of bins via bins= or the size of the bins via binwidth=:
ggplot(data.frame(var), aes(var)) + geom_histogram(bins=10, color='black', alpha=0.2)
If you want to plot the basic bar geom, then geom_histogram() works just fine. If you change to use the stat_bin() function instead, it will perform the same binning method, but then you can apply and use a different geom if you want to:
ggplot(data.frame(var), aes(var)) +
stat_bin(geom='area', bins=10, alpha=0.2, color='black')
If you're looking to grab just the numbers/data from "binning" a variable like you have, one of the simplest ways might be to use cut() from dplyr.
Use of cut() is pretty simple. You specify the vector and a breaks= argument. Breaks can be specified a list of places where you want to "cut" your data (or "bin" your data), or you can just set breaks=10 and it will give you an evenly cut set of 10 bins. The result is a factor with levels= that correspond to the range for each of the breaks. In the case of var with breaks=10, you get the following:
> var_cut <- cut(var, breaks = 10)
> levels(var_cut)
[1] "(-0.365,36.5]" "(36.5,73]" "(73,110]" "(110,146]" "(146,182]" "(182,219]" "(219,256]"
[8] "(256,292]" "(292,328]" "(328,365]"

How to go from long to wide with stated preference data in R

I hope this is not a cross-posting. I have been trying to understand from the available links on stackoverflow how to perform a change in the data from long to wide. I think I am almost there but there is still a lot missing.
I have stated prefence data on the choice between an electric and a gasoline car. Some variables are related to cars in general, such as PREZZO, whilst some other are specific to electric cars, such as AUTONOMIA_EV and others to internal combustion engine cars, such as AUTONOMIA ICEV.
Each respondent is identified by a number in the column INTERTOT. The first respondent has number 111. There are 20 rows corresponding to this individual because he faces 10 choices amongst two cars, one electric and one gasoline. In the column SCELTAEFF2 a value equal to 1 indicates the choice performed by the individual. Such a value has to be compared to correspoding one reported in column EV, where a value of 1 indicates that the option in that line is an electric vehicle.
Therefore, for example, if you look at line 4, which concerns the second choice confronted by the first individual, the column SCELTAEFF2 takes a value of 1 and the corresponding row on the column EV is 1 as well. This means that the respondent, for the second proposed alternatives, choose the electric vehicle. If you look, insted, at line 8, which concerns the fourth choice, the individual choose a gasoline car. This is the case because the column SCELTAEFF2 takes a value of 1, but the corresponding row on the column EV is zero.
Now, I would like to have for each respondent, INTERTOT, only one line, contaning all information that are now spread on 20 rows.
The file that I have is very big and that's why I am showing you only a part.
I would like to estimate a hybrid choice model and perform the calculation of the willingness-to-pay trough the delta method, but the very first issue is related to have the data in the right format.
The code I am trying is the following:
prova_reshape.wide = reshape(data = "prova_reshape", idvar = "INTERTOT", direction = "wide" )
but, of course, I get the following error message:
Error in data[, timevar] : numero di dimensioni errato
because I did not specified timevar. Well I did not, because I did not know what to put in it. Moreover, I am not sure that specifying idvar = "INTERTOT" is enough.
I had a look at different sources on the web, such as the following and this.
I think I could be close, but I am not sure on how to proceed.
I would be very helpful if anyone could help me.
Marco
Here an excerpt of my dataset:
structure(list(QI = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), CSET = c(10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), INTERTOT = c(111L, 111L, 111L, 111L, 111L,
111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L,
111L, 111L, 111L, 111L), NSCELTA = c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), SCELTAEFF2 = c(0L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L,
1L, 1L, 0L), MARCA = c(4L, 1L, 1L, 2L, 1L, 4L, 4L, 2L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 2L, 4L, 3L, 1L), EV = c(1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L
), PREZZO = c(25000L, 30000L, 25000L, 20000L, 30000L, 25000L,
20000L, 15000L, 25000L, 15000L, 30000L, 20000L, 20000L, 25000L,
35000L, 30000L, 20000L, 15000L, 35000L, 20000L), AUTONOMIA = c(150L,
1200L, 150L, 800L, 150L, 400L, 350L, 400L, 250L, 400L, 350L,
1200L, 150L, 800L, 350L, 800L, 250L, 1200L, 250L, 400L), AUTONOMIA_EV = c(150L,
0L, 150L, 0L, 150L, 0L, 350L, 0L, 250L, 0L, 350L, 0L, 150L, 0L,
350L, 0L, 250L, 0L, 250L, 0L), AUTONOMIA_ICEV = c(0L, 1200L,
0L, 800L, 0L, 400L, 0L, 400L, 0L, 400L, 0L, 1200L, 0L, 800L,
0L, 800L, 0L, 1200L, 0L, 400L)), row.names = c(NA, 20L), class = "data.frame")
I can't see the data as example you posted but, if I've understood well, here what you need, all in base R. In case, it can be modified:
# add the timevar you need for reshape:
dats$timev <- with(dats, ave(rep(1, nrow(dats)), CHOICE_SET, FUN = seq_along))
# now you can reshape it:
dats_w <- reshape(dats, idvar = " CHOICE_SET", timevar = "timev", direction = "wide")
# choose the column you need
dats_w <- dats_w[,c(2,1,3,4,10,6,12,7,13)]
# last add the correct column names
colnames(dats_w) <- c('INTERVIEW','CHOICE_SET','CHOICE','BRAND_EV','BRAND_ICEV','PRICE_EV','PRICE_ICEV','RANGE_EV',' RANGE_ICEV')
dats_w
INTERVIEW CHOICE_SET CHOICE BRAND_EV BRAND_ICEV PRICE_EV PRICE_ICEV RANGE_EV RANGE_ICEV
1 111 1 0 4 1 25000 30000 150 1200
3 111 2 1 1 2 25000 20000 150 800
5 111 3 1 1 4 30000 25000 150 400
7 111 4 0 4 2 20000 15000 350 400
9 111 5 1 2 3 25000 15000 250 400
11 111 6 1 1 2 30000 20000 350 1200
13 111 7 0 3 1 20000 25000 150 800
15 111 8 1 2 3 35000 30000 350 800
17 111 9 0 2 4 20000 15000 250 1200
19 111 10 1 3 1 35000 20000 250 400
With data:
dats <- structure(list(INTERVIEW = c(111L, 111L, 111L, 111L, 111L, 111L,
111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L,
111L, 111L, 111L), CHOICE_SET = c(1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), CHOICE = c(0L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L,
1L, 1L, 0L), BRAND = c(4L, 1L, 1L, 2L, 1L, 4L, 4L, 2L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 2L, 4L, 3L, 1L), EV_DUMMY = c(1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 0L), PRICE = c(25000L, 30000L, 25000L, 20000L, 30000L, 25000L,
20000L, 15000L, 25000L, 15000L, 30000L, 20000L, 20000L, 25000L,
35000L, 30000L, 20000L, 15000L, 35000L, 20000L), RANGE = c(150L,
1200L, 150L, 800L, 150L, 400L, 350L, 400L, 250L, 400L, 350L,
1200L, 150L, 800L, 350L, 800L, 250L, 1200L, 250L, 400L), timev = c(1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2)), row.names = c(NA,
20L), class = "data.frame")

Merge two legends (size and color) into one [duplicate]

This question already has answers here:
How to combine scales for colour and size into one legend?
(2 answers)
Closed 7 years ago.
What is the code to make the two legends into one: A circles legend with color?
I think, a single legend with circles colored according to "size" and "# total number of crimes" is the best way to show the legend.
Desired output:
1) There should be one legend: the circles, instead of black should be colored: 0 circle = "yellow" to 800 circle = "red".
My code:
library(maps)
library(ggmap)
Get map from Google Maps
lima <- get_map(location = "lima", zoom = 11, maptype = c("terrain"))
Plot
ggmap(lima) + geom_point(data = limanov2, aes(x = LONGITUD , y = LATITUD, color = TOTALES,
size = TOTALES)) +
scale_size_continuous(name = "Cantidad\ndelitos",range = c(2,12)) +
scale_color_gradient(name = "Cantidad\ndelitos", low = "yellow", high = "red") +
theme(legend.text= element_text(size=14)) +
ggtitle("TOTAL DELITOS - LIMA NOV 2012") +
theme(plot.title = element_text(size = 12, vjust=2, family="Verdana", face="italic"),
legend.position = 'left')
My data:
structure(list(DISTRITO = c("SAN JUAN DE LURIGANCHO", "CALLAO",
"LOS OLIVOS", "ATE", "LIMA", "SAN MARTIN DE PORRES", "SANTIAGO DE SURCO",
"CHORILLOS", "COMAS", "INDEPENDENCIA", "EL AGUSTINO", "LA VICTORIA",
"SAN JUAN DE MIRAFLORES", "VILLA EL SALVADOR", "SAN MIGUEL",
"CARABAYLLO", "MIRAFLORES", "SAN BORJA", "VENTANILLA", "SURQUILLO",
"BREÑA", "ANCON", "PTE. PIEDRA", "RIMAC", "BARRANCO", "LA MOLINA",
"SAN LUIS", "SANTA ANITA", "LURIGANCHO", "P. LIBRE", "MAGDALENA DEL MAR",
"LA PERLA", "CHACLACAYO", "PUENTE PIEDRA", "SAN ISIDRO", "JESUS MARIA",
"BELLAVISTA", "LINCE", "CARMEN DE LA LEGUA REYNOSO", "CIENEGUILLA",
"SANTA ROSA", "LURIN", "PUNTA NEGRA", "PUCUSANA", "LA PUNTA",
"PUNTA HERMOSA", "PACHACAMAC", "SAN BARTOLO", "SANTA MARIA"),
TOTALES = c(861L, 696L, 696L, 642L, 516L, 479L, 442L, 378L,
371L, 368L, 361L, 333L, 325L, 291L, 282L, 251L, 239L, 196L,
193L, 188L, 185L, 174L, 165L, 161L, 138L, 134L, 128L, 119L,
115L, 105L, 67L, 65L, 63L, 58L, 58L, 56L, 45L, 38L, 23L,
23L, 11L, 8L, 6L, 5L, 3L, 3L, 2L, 0L, 0L), HOMICIDIOS = c(1L,
7L, 0L, 1L, 2L, 0L, 0L, 1L, 7L, 4L, 4L, 4L, 0L, 0L, 0L, 2L,
0L, 0L, 7L, 0L, 0L, 0L, 0L, 4L, 0L, 0L, 2L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), LESIONES = c(100L, 72L, 61L, 43L, 44L, 8L, 10L,
15L, 44L, 40L, 50L, 15L, 52L, 28L, 7L, 33L, 15L, 3L, 21L,
7L, 36L, 33L, 15L, 19L, 14L, 1L, 8L, 6L, 16L, 4L, 4L, 9L,
1L, 12L, 2L, 9L, 5L, 2L, 5L, 7L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), VIO..DE.LA.LIBERTAD.PERSONAL = c(0L, 7L, 6L,
5L, 6L, 1L, 1L, 0L, 3L, 1L, 2L, 0L, 2L, 0L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 3L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), VIO..DE.LA.LIBERTAD.SEXUAL = c(56L, 14L, 12L, 15L, 7L,
10L, 2L, 9L, 11L, 13L, 8L, 9L, 7L, 14L, 4L, 15L, 4L, 2L,
17L, 7L, 3L, 4L, 6L, 12L, 2L, 1L, 5L, 3L, 11L, 4L, 1L, 2L,
0L, 6L, 2L, 0L, 3L, 0L, 2L, 2L, 0L, 4L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), HURTO.SIMPLE.Y.AGRAVADO = c(217L, 203L, 296L, 230L,
260L, 167L, 226L, 217L, 130L, 117L, 154L, 133L, 121L, 46L,
163L, 72L, 161L, 119L, 69L, 120L, 64L, 19L, 64L, 21L, 57L,
44L, 39L, 2L, 48L, 60L, 30L, 19L, 48L, 20L, 41L, 25L, 19L,
27L, 7L, 11L, 9L, 0L, 6L, 0L, 2L, 3L, 1L, 0L, 0L), ROBO.SIMPLE.Y.AGRAVADO = c(460L,
289L, 308L, 344L, 186L, 277L, 198L, 130L, 165L, 184L, 137L,
149L, 134L, 188L, 104L, 126L, 58L, 72L, 64L, 51L, 77L, 115L,
79L, 76L, 64L, 88L, 73L, 108L, 40L, 36L, 30L, 32L, 14L, 17L,
12L, 22L, 12L, 8L, 6L, 3L, 1L, 3L, 0L, 2L, 1L, 0L, 1L, 0L,
0L), MICRO.COM.DE.DROGAS = c(26L, 100L, 13L, 3L, 10L, 15L,
5L, 5L, 11L, 8L, 3L, 23L, 9L, 15L, 3L, 3L, 0L, 0L, 8L, 2L,
5L, 0L, 0L, 28L, 0L, 0L, 1L, 0L, 0L, 0L, 2L, 2L, 0L, 2L,
0L, 0L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 0L, 0L, 0L
), TENENCIA.ILEGAL.DE.ARMAS = c(1L, 4L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), LONGITUD = c(-77,
-77.12, -77.08, -76.89, -77.04, -77.09, -76.99, -77.01, -77.05,
-77.05, -77, -77.02, -76.97, -76.94, -77.09, -76.99, -77.03,
-77, -77.13, -77.01, -77.05, -77.11, -77.08, -76.7, -77.02,
-76.92, -77, -76.96, -76.86, -77.06, -77.07, -77.12, -76.76,
-77.08, -77.03, -77.05, -77.11, -77.04, -77.09, -76.78, -77.16,
-76.81, -76.73, -76.77, -77.16, -76.76, -76.83, -76.73, -76.77
), LATITUD = c(-11.99, -12.04, -11.95, -12.04, -12.06, -12,
-12.16, -12.2, -11.93, -11.99, -12.04, -12.08, -12.16, -12.23,
-12.08, -11.79, -12.12, -12.1, -11.89, -12.11, -12.06, -11.69,
-11.88, -11.94, -12.15, -12.09, -12.08, -12.04, -11.98, -12.08,
-12.09, -12.07, -11.99, -11.88, -12.1, -12.08, -12.06, -12.09,
-12.04, -12.07, -11.81, -12.24, -12.32, -12.47, -12.07, -12.28,
-12.18, -12.38, -12.42)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -49L), .Names = c("DISTRITO", "TOTALES",
"HOMICIDIOS", "LESIONES", "VIO..DE.LA.LIBERTAD.PERSONAL", "VIO..DE.LA.LIBERTAD.SEXUAL",
"HURTO.SIMPLE.Y.AGRAVADO", "ROBO.SIMPLE.Y.AGRAVADO", "MICRO.COM.DE.DROGAS",
"TENENCIA.ILEGAL.DE.ARMAS", "LONGITUD", "LATITUD"))
I've found a solution. Reading the documention for GGPLOT2 V. 0.9
It is the new function: guide_legend() that should be used inside guides().
This is a function that lets you have more control over legend labels.
This is the end code with the resulting output (See the last line):
ggmap(lima) + geom_point(data = limanov2, aes(x = LONGITUD , y = LATITUD, color = TOTALES,
size = TOTALES)) +
scale_size_continuous(name = "Cantidad\ndelitos",range = c(2,12)) +
scale_color_gradient(name = "Cantidad\ndelitos", low = "yellow", high = "red") +
theme(legend.text= element_text(size=14)) +
ggtitle("TOTAL DELITOS - LIMA NOV 2012") +
theme(plot.title = element_text(size = 12, vjust=2, family="Verdana", face="italic"),
legend.position = 'left') +
guides(colour = guide_legend())

R: ggmap: containing missing values (geom_point) when plottinng but no NAs values found in data.frame

I'm plotting some points over a map with ggmap package.
The problem is that i get the message: "Removed 12 rows containing missing values (geom_point)".
But i don't have any NAs. I've looked the data, and used:
sum(is.na(limanov2)) #Gives 0
to prove it.
This is my code:
library(maps)
library(ggmap)
lima <- get_map(location = "lima", zoom = 11)
ggmap(lima) + geom_point(data = limanov2, aes(x = LONGITUD , y = LATITUD, color = TOTALES,
size = TOTALES)) +
scale_color_gradient(low = "yellow", high = "red")
My data:
structure(list(DISTRITO = c("SAN JUAN DE LURIGANCHO", "CALLAO",
"LOS OLIVOS", "ATE VITARTE", "LIMA CERCADO", "SAN MARTÍN", "SANTIAGO DE SURCO",
"CHORILLOS", "COMAS", "INDEPENDENCIA", "EL AGUSTINO", "LA VICTORIA",
"SAN JUAN DE MIRAFLORES", "VILLA EL SALVADOR", "S. MIGUEL", "CARABAYLLO",
"MIRAFLORES", "PTE. PIEDRA", "SAN BORJA", "VENTANILLA", "SURQUILLO",
"BREÑA", "ANCÓN", "EL RIMAC", "BARRANCO", "LA MOLINA", "SAN LUIS",
"STA. ANITA", "LURIGANCHO", "P. LIBRE", "MAGDALENA", "LA PERLA",
"CHACLACAYO", "SAN ISIDRO", "J. MARÍA", "BELLAVISTA", "LINCE",
"C. DE LA LEGUA", "CIENEGUILLA", "STA.ROSA", "LURÍN", "PTA.NEGRA",
"PUCUSANA", "LA PUNTA", "PTA. HERMOSA", "PACHACAMAC", "SAN BARTOLO",
"SANTA MARÍA"), TOTALES = c(861L, 696L, 696L, 642L, 516L, 479L,
442L, 378L, 371L, 368L, 361L, 333L, 325L, 291L, 282L, 251L, 239L,
223L, 196L, 193L, 188L, 185L, 174L, 161L, 138L, 134L, 128L, 119L,
115L, 105L, 67L, 65L, 63L, 58L, 56L, 45L, 38L, 23L, 23L, 11L,
8L, 6L, 5L, 3L, 3L, 2L, 0L, 0L), HOMICIDIOS = c(1L, 7L, 0L, 1L,
2L, 0L, 0L, 1L, 7L, 4L, 4L, 4L, 0L, 0L, 0L, 2L, 0L, 1L, 0L, 7L,
0L, 0L, 0L, 4L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), LESIONES = c(100L,
72L, 61L, 43L, 44L, 8L, 10L, 15L, 44L, 40L, 50L, 15L, 52L, 28L,
7L, 33L, 15L, 27L, 3L, 21L, 7L, 36L, 33L, 19L, 14L, 1L, 8L, 6L,
16L, 4L, 4L, 9L, 1L, 2L, 9L, 5L, 2L, 5L, 7L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), VIO..DE.LA.LIBERTAD.PERSONAL = c(0L, 7L,
6L, 5L, 6L, 1L, 1L, 0L, 3L, 1L, 2L, 0L, 2L, 0L, 1L, 0L, 1L, 1L,
0L, 1L, 1L, 0L, 3L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), VIO..DE.LA.LIBERTAD.SEXUAL = c(56L,
14L, 12L, 15L, 7L, 10L, 2L, 9L, 11L, 13L, 8L, 9L, 7L, 14L, 4L,
15L, 4L, 12L, 2L, 17L, 7L, 3L, 4L, 12L, 2L, 1L, 5L, 3L, 11L,
4L, 1L, 2L, 0L, 2L, 0L, 3L, 0L, 2L, 2L, 0L, 4L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), HURTO.SIMPLE.Y.AGRAVADO = c(217L, 203L, 296L, 230L,
260L, 167L, 226L, 217L, 130L, 117L, 154L, 133L, 121L, 46L, 163L,
72L, 161L, 84L, 119L, 69L, 120L, 64L, 19L, 21L, 57L, 44L, 39L,
2L, 48L, 60L, 30L, 19L, 48L, 41L, 25L, 19L, 27L, 7L, 11L, 9L,
0L, 6L, 0L, 2L, 3L, 1L, 0L, 0L), ROBO.SIMPLE.Y.AGRAVADO = c(460L,
289L, 308L, 344L, 186L, 277L, 198L, 130L, 165L, 184L, 137L, 149L,
134L, 188L, 104L, 126L, 58L, 96L, 72L, 64L, 51L, 77L, 115L, 76L,
64L, 88L, 73L, 108L, 40L, 36L, 30L, 32L, 14L, 12L, 22L, 12L,
8L, 6L, 3L, 1L, 3L, 0L, 2L, 1L, 0L, 1L, 0L, 0L), MICRO.COM.DE.DROGAS = c(26L,
100L, 13L, 3L, 10L, 15L, 5L, 5L, 11L, 8L, 3L, 23L, 9L, 15L, 3L,
3L, 0L, 2L, 0L, 8L, 2L, 5L, 0L, 28L, 0L, 0L, 1L, 0L, 0L, 0L,
2L, 2L, 0L, 0L, 0L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 0L,
0L, 0L), TENENCIA.ILEGAL.DE.ARMAS = c(1L, 4L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), LONGITUD = c(-77, -77.12,
-77.08, -76.89, -77.04, -77.09, -76.99, -77.01, -77.05, -77.05,
-77, -77.02, -76.97, -76.94, -77.09, -76.99, -77.03, -77.08,
-77, -77.13, -77.01, -77.05, -77.11, -76.7, -77.02, -76.92, -77,
-76.96, -76.86, -77.06, -77.07, -77.12, -76.76, -77.03, -77.05,
-77.11, -77.04, -77.09, -76.78, -77.16, -76.81, -76.73, -76.77,
-77.16, -76.76, -76.83, -76.73, -76.77), LATITUD = c(-11.99,
-12.04, -11.97, -12.04, -12.06, -12, -12.16, -12.2, -11.93, -11.99,
-12.04, -12.08, -12.16, -12.23, -12.08, -11.79, -12.12, -11.88,
-12.1, -11.89, -12.11, -12.06, -11.69, -11.94, -12.15, -12.09,
-12.08, -12.04, -11.98, -12.08, -12.09, -12.07, -11.99, -12.1,
-12.08, -12.06, -12.09, -12.04, -12.07, -11.81, -12.24, -12.32,
-12.47, -12.07, -12.28, -12.18, -12.38, -12.42)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -48L), .Names = c("DISTRITO",
"TOTALES", "HOMICIDIOS", "LESIONES", "VIO..DE.LA.LIBERTAD.PERSONAL",
"VIO..DE.LA.LIBERTAD.SEXUAL", "HURTO.SIMPLE.Y.AGRAVADO", "ROBO.SIMPLE.Y.AGRAVADO",
"MICRO.COM.DE.DROGAS", "TENENCIA.ILEGAL.DE.ARMAS", "LONGITUD",
"LATITUD"))
You have values outside of the base map zoom range... try changing your zoom parameter.
library(maps)
library(ggmap)
lima <- get_map(location = "lima", zoom = 10)
ggmap(lima) +
geom_point(data = limanov2,
aes(x = LONGITUD , y = LATITUD,
color = TOTALES, size = TOTALES)) +
scale_color_gradient(low = "yellow", high = "red")

Resources