Data Subset error in R using %in% wildcard - r

My df:
> str(merged)
'data.frame': 714 obs. of 9 variables:
$ Date : Date, format: "2013-03-29" "2013-03-29" "2013-03-29" "2013-03-29" ...
$ patch : Factor w/ 7 levels "BVG1","BVG11",..: 1 2 3 4 5 6 7 1 2 3 ...
$ prod : num 2.93 2.77 2.86 2.87 3.01 ...
$ workmix_pct : int 100 10 16 13 17 21 22 100 11 19 ...
$ jobcounts : int 9480 968 1551 1267 1625 1946 2123 7328 810 1374 ...
$ travel : num 30.7 34.3 33.8 29.1 28.1 24.9 34 31.8 32.7 36.4 ...
$ FWIHweeklyAvg: num 1.63 4.48 3.1 1.36 1.55 ...
$ CST.NAME : Factor w/ 7 levels "Central Scotland",..: 4 2 3 1 5 7 6 4 2 3 ...
$ month : chr "March" "March" "March" "March" ...
> head(merged)
Date patch prod workmix_pct jobcounts travel FWIHweeklyAvg CST.NAME month
1 2013-03-29 BVG1 2.932208 100 9480 30.7 1.627024 Scotland March
2 2013-03-29 BVG11 2.769156 10 968 34.3 4.475714 Highlands & Islands March
3 2013-03-29 BVG12 2.857344 16 1551 33.8 3.098571 North East Scotland March
4 2013-03-29 BVG13 2.870111 13 1267 29.1 1.361429 Central Scotland March
5 2013-03-29 BVG14 3.011260 17 1625 28.1 1.550000 South East Scotland March
6 2013-03-29 BVG15 3.236246 21 1946 24.9 1.392857 West Central Scotland March
I am trying to subset on patch BVG1 by:
data=merged[patch %in% c("BVG1"),]
But getting an error:
Error in match(x, table, nomatch = 0L) : object 'patch' not found
Don't understand why...
I am trying to plot separate timeseries per patch using ggplot
This is what I have tried:
ggplot(data=merged, aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
This plots all patches on one graph... But I want to show BVG1 timeseries only and this is what I was trying:
ggplot(data=merged[patch %in% c("BVG1"),], aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
But getting the same error.
Any ideas?
UPDATE
Problem solved using [merged$patch %in% c("BVG1"),]

You could also do
data <- subset(merged, patch == "BVG1")
Since you're only conditioning on patch being a single value, you don't need %in%, you can just test for equality.
When you use subset(), R automatically interprets variables referenced in the context of the data frame, so merged$patch is unnecessary.

Try
data=merged[merged$patch %in% c("BVG1"),]
That should solve your problems. patch is defined in your dataframe, so you need to tell R where to find it.
Additionally, you may want to look at facet_wrap instead of subsetting. For instance, adding + facet_wrap(~ patch) to your plot command should show you all patches at once. I am not sure this is what you desire as output, but I thought I should point it out as an idea...

Related

I'm having problems with my ggplot2 theme system

I'm having problems with my ggplot2 drawing, I don't know why, I've restarted Rstudio and its theme system can't be restored to the original, which is the default theme
library(tidyverse)
chic <- read_csv("./chicago-nmmaps-custom.csv")
ggplot(chic, aes(x = date, y = temp)) +
geom_point()
Here's the code I ran
This is what I got when I ran it
Normal should look like this, as shown below
You could use theme_set to replace older themes like this:
library(ggplot2)
p <- ggplot(mtcars, aes(mpg, wt)) +
geom_point()
p
old <- theme_set(theme_bw())
p
theme_set(old)
p
Created on 2022-10-08 with reprex v2.0.2
The problem is that column date is not a date object, it's a column of class "character". Coerce to class "Date" and the default grey theme is used.
The output of str shows the data set columns' classes and date is displayed as chr, meaning, a column of class "character". R has real dates and times classes and this column must become one. Everything afterwards will be easier, including ggplot2 code. ggplot2's layers scale_*_date and scale_*_datetime even have special date and date/time breaks and labels arguments, respectively.
str(chic)
#> 'data.frame': 5114 obs. of 9 variables:
#> $ city : chr "chic" "chic" "chic" "chic" ...
#> $ date : chr "1987-01-01" "1987-01-02" "1987-01-03" "1987-01-04" ...
#> $ death : int 130 150 101 135 126 130 129 109 125 153 ...
#> $ temp : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
#> $ dewpoint: num 31.5 29.9 27.4 28.6 28.9 ...
#> $ pm10 : num 27.8 NA 33.7 40.8 NA ...
#> $ o3 : num 4.03 4.58 3.4 3.94 4.4 ...
#> $ time : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ season : chr "winter" "winter" "winter" "winter" ...
library(ggplot2)
chic |>
dplyr::mutate(date = as.Date(date)) |>
ggplot(aes(date, temp)) +
geom_point() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y")
Created on 2022-10-08 with reprex v2.0.2

How to use a multinomial logistic regression model to predict future observations

My question seems a little vague so I will provide background context and my reproducible code to try and clarify.
I am interested in classifying crime occurrences in various neighbourhoods of a city, based on each neighbourhood's socioeconomic indicators. My end goal is to be able to generate a reasonably accurate prediction which would suggest the most likely neighbourhood that the next crime should occur. I chose to fit a multinomial regression model, and I am having a hard time interpreting its results.
Here is how my data looks:
> str(df)
'data.frame': 1796 obs. of 12 variables:
$ Time : chr "14:37:00" "14:37:00" "16:23:00" "00:10:00" ...
$ Neighbourhood : chr "Grand Boulevard" "Grand Boulevard" "West Town" "West Englewood" ...
$ Population : num 22209 22209 84698 26346 24976 ...
$ Area : num 1.74 1.74 4.58 3.15 2.55 2.95 3.15 1.04 7.15 1.28 ...
$ Density : chr "12,763.79" "12,763.79" "18,493.01" "8,363.81" ...
$ Crowded.Housing: num 3.3 3.3 2.3 4.8 2.7 3.3 4.8 2.4 6.3 9.4 ...
$ Poverty : num 29.3 29.3 14.7 34.4 8.9 27.8 34.4 21.7 28.6 41.7 ...
$ Unemployment : num 24.3 24.3 6.6 35.9 9.5 24 35.9 15.7 22.6 25.8 ...
$ Education : num 15.9 15.9 12.9 26.3 18.8 14.5 26.3 11.3 24.4 24.5 ...
$ Age : num 39.5 39.5 21.7 40.7 37.6 40.3 40.7 35.4 37.9 43.6 ...
$ Income : num 23472 23472 43198 11317 25113 ...
$ Hardship : num 57 57 10 89 29 60 89 26 73 92 ...
Here is the code for my model:
c.nnet = nnet::multinom(Neighbourhood ~
Crowded.Housing +
Poverty +
Unemployment +
Education +
Income +
Hardship,
data = df,
MaxNWts = 100000)
Here are some classification accuracy metrics:
> odds <- c.nnet[["fitted.values"]]
> pd = predict(c.nnet,type="class")
> table = table(df$Neighbourhood, pd); classAgreement(table)
$diag
[1] 0.6631403
$kappa
[1] 0.6451884
$rand
[1] 0.9560459
$crand
[1] 0.6035169
> sum(diag(table))/sum(table)
[1] 0.6631403
Lastly, here is the output of the predicted classes and the associated class probabilities.
>head(pd)
[1] Chatham Chatham West Town West Englewood New City Chatham
72 Levels: Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park ... Woodlaw
> head(odds)
Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park
1 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
2 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
3 7.276802e-04 2.796196e-06 1.540627e-03 9.642981e-03 0.0001623333 4.575838e-05 0.0004173684 1.229428e-03 0.0007718075 2.308536e-02 9.021844e-03
4 7.168266e-05 7.869570e-04 1.743114e-05 3.519012e-05 0.0473000895 9.256728e-02 0.0058524740 4.373425e-05 0.0002943829 4.752441e-06 6.214005e-07
5 2.376865e-03 3.647976e-04 3.261888e-03 5.958128e-02 0.0090540446 4.103546e-02 0.0028125946 9.329274e-03 0.0339153709 1.394973e-02 9.034131e-02
6 7.735586e-04 5.958576e-04 2.345032e-04 4.058962e-04 0.0833015893 2.374063e-02 0.0169124221 3.038695e-04 0.0005576943 2.163316e-04 1.263609e-05
As far as my understanding goes, the latter output (odds) represents the probability of each crime occurence belonging to each of the 72 unique neighbourhoods I have in my data, while the former (pd) represents the predicted classes based on my data set. This leads to my specific question; How can I use these predicted classes in order to generate some sort of forecast as to where the next crime is likely to occur (i.e. something like a time-series forecast with 1 step ahead)?
You can create a newdata data frame with the values you want to predict over and then use the predict function to obtain predicted probabilities for each class. For example,
# estimate model
library(nnet)
dat <- mtcars
dat$gear <- factor(dat$gear)
mod <- multinom(gear ~ mpg + hp, data = dat)
# what values we want predictions for
out_of_sample_data <- data.frame(mpg = c(19, 20), hp = c(130, 140))
# generate predicted probabilities
predict(mod, newdata = out_of_sample_data, type = "probs")
#> 3 4 5
#> 1 0.6993027 0.2777716 0.02292562
#> 2 0.6217686 0.2750779 0.10315351
Obviously, you'll need to populate your out of sample data with values you believe with occur in the future, which can be tricky (to say the least).

Plot map using ggplot2

I'm trying to plot data on map of switzerland
using this code
require("rgdal")
require("maptools")
require("ggplot2")
require("plyr")
require("maps")
require("ggmap")
ggplot() + geom_polygon(data = da, aes(x=long, y = lat)) +
coord_fixed(1.3)+
geom_point(data=de, aes(x=lat, y=lon), color="orange")
Where data da is a map using swissmap package:
da<- shp_df[[6]]
& data de is:
'data.frame': 115 obs. of 5 variables:
$ FB : Factor w/ 3 levels "I","II","IV": 2 2 2 3 1 2 1 3 1 1
$ Nom : Factor w/ 115 levels "\"Patient Education\" Programm unipolare Depression",..: 9 31 95 112 92 41 70 84 13 21 ...
$ lon : num 7.36 8.54 7.08 NA 7.45 ...
$ lat : num 46.2 47.4 46.1 NA 46.9 ...
$ Coûts: int 100000 380000 150000 300000 2544000 300000 1897000 500000 2930000 2400000 ...
I got this result.
This is not what i want, i'm trying to plot at location (sometime same place)the data in de dataset.
Any kinds of help or advices will be appreciate .
thank you

R circlular wheel chart

I'm trying to make a wheel chart that has rings. My result looks like the lines all go back to zero before continuing to the next point. Is it a discreet/continuous issue? I've tried making Lap.Time and Lap both numeric to no avail:
f1 <- read.csv("F1 2011 Turkey - Fuel Corrected Lap Times.csv", header = T)
str(f1)
# data.frame: 1263 obs. of 5 variables:
# $ Driver : Factor w/ 23 levels "1","2","3","4",..: 23 23 23 23 23 23 23 23 23 23 ...
# $ Lap : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Lap.Time : num 107 99.3 98.4 97.5 97.4 ...
# $ Fuel.Adjusted.Laptime : num 102.3 94.7 93.9 93.1 93.1 ...
# $ Fuel.and.fastest.lap.adjusted.laptime: num 9.73 2.124 1.321 0.54 0.467 ...
library(ggplot2)
f1$Driver<-as.factor(f1$Driver)
p1 <- ggplot(data=subset(f1, Lap.Time <= 120), aes(x = Lap, y= Lap.Time, colour = Driver)) +
geom_point(aes(colour=Driver))
p2 <- ggplot(subset(f1, Lap.Time <= 120),
aes(x = Lap, y= Lap.Time, colour = Driver, group = 1)) +
geom_line(aes(colour=Driver))
pout <- p1 + coord_polar()
pout2 <- p2 + coord_polar()
pout
pout2
resulting chart image
All the data is in this csv:
https://docs.google.com/spreadsheets/d/1Ef2ewd1-0FM1mJL1o00C6c2gf7HFmanJh8an1EaAq2Q/edit?hl=en_GB&authkey=CMSemOQK#gid=0
Sample of csv:
Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
25,3,98.38,93.925,1.321
25,4,97.518,93.144,0.54
25,5,97.364,93.071,0.467
25,6,97.853,93.641,1.037
25,7,98.381,94.25,1.646
25,8,98.142,94.092,1.488
25,9,97.585,93.616,1.012
25,10,97.567,93.679,1.075
25,11,97.566,93.759,1.155
25,12,97.771,94.045,1.441
25,13,98.532,94.887,2.283
25,14,99.146,95.582,2.978
25,15,98.529,95.046,2.442
25,16,99.419,96.017,3.413
25,17,114.593,111.272,18.668

Passing arguments to ggplot and facet_grid

I need some help with these lines of code.
My data set:
> str(data.tidy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9480 obs. of 11 variables:
$ Country.Name : Factor w/ 248 levels "Afghanistan",..: 234 12 13 20 22 31 17 16 25 28 ...
$ Country.Code : Factor w/ 248 levels "ABW","AFG","AGO",..: 7 12 13 16 17 18 19 21 27 28 ...
$ Year : Factor w/ 56 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ InfantMortality : num 137.3 20.3 37.3 29.5 186.9 ...
$ AdolFertilityRate: num 176.9 44.8 48.4 27.1 85.8 ...
$ FertilityRate : num 6.93 3.45 2.69 2.54 6.28 ...
$ LifeExpectancy : num 52.2 70.8 68.6 69.7 37.3 ...
$ TotalUnemp : num NA NA NA NA NA NA NA NA NA NA ...
$ TotalPop : num 92612 10276477 7047539 9153489 2431620 ...
$ Region : Factor w/ 8 levels "","East Asia & Pacific",..: 5 2 3 3 8 8 7 5 4 4 ...
$ IncomeGroup : Factor w/ 6 levels "","High income: nonOECD",..: 2 3 3 3 4 4 5 2 5 6 ...
Reference code that I want to 'functionize':
ggplot(data.tidy,aes(as.numeric(as.character(Year)),y=InfantMortality))+
geom_line(aes(color=Country.Name))+
facet_grid(.~IncomeGroup)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year', title='Change in mortality rate over time')+
geom_smooth(color='black')
I want to replace data.tidy, InfantMortality, IncomeGroup and title in the example above.
Here was my attempt at the code:
facetedlineplot <- function(df,y,facet,title){
ggplot(df,aes(as.numeric(as.character(Year)),y=y))+
geom_line(aes(color=Country.Name))+
facet_grid(.~facet)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
The error:
> facetedlineplot(data.tidy,y = 'InfantMortality',facet = 'IncomeGroup',title = 'Title goes here')
Error in layout_base(data, cols, drop = drop) :
At least one layer must contain all variables used for facetting
I have tried aes_string, but I couldn't get it to work. What does the error mean? How can I work around this issue?
Update:
I have some code that partially works now, using reformulate()
facetedlineplot <- function(df,y,facet,title){
year <- as.numeric(as.character(df$Year))
ggplot(df,aes(x=year,y=y))+
geom_line(aes(color=Country.Name))+
facet_grid(paste('.~',reformulate(facet)))+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
> facetedlineplot(data.tidy,y = 'InfantMortality', facet = 'IncomeGroup', title = 'Title goes here')
Warning message:
Computation failed in `stat_smooth()`:
x has insufficient unique values to support 10 knots: reduce k.
>
Still, an incorrect plot>
Thank you in advance,
Rahul
I have the solution. Three steps worked for me:
- Change datatype of the Year variable in data.tidy from factor to numeric.
- Use aes_string for the ggplot argument
- For facet_grid(), many things worked:
Use as.formula() to pass '~IncomeGroup'
Just pass '~IncomeGroup' directly to facet_grid()
Final code:
facetedlineplot <- function(df,y,facet,title){
ggplot(df,aes_string(x = 'Year', y = y))+
geom_line(aes(color=Country.Name))+
facet_grid(facet)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 9))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
d <- data.tidy
d$Year <- as.numeric(as.character(d$Year))
facetedlineplot(d,'InfantMortality','~IncomeGroup','Title')

Resources