barplot: selecting data in R - r

I have a problem for building barplot.
I am working on air traffic in different countries. I would like to get barplots for each countries with the different airport names in the X axis. The Y axis will show the quantity of airlines using the airport.
My plan is to make the script for 1 country and to replicate it manually for the others.
in my data, I have in the different columns:
Country / aiport / destination.
So each rows is actually one airline that is using the airport.
Do you have an idea about how to do this?
For now I have this idea:
UK<-traffic[traffic$Country=="UK",]
UK$airport <- as.factor(UK$airport)
countUK<-table(UK$airport)
barplot(countUK)
This is not working, I have a bunch of airports that are not in UK in the X axis...
Thanks for your help

Answer found:
You could try to drop unused factor levels, i.e.
UK <- droplevels(UK) after the line UK$airport <- as.factor(UK$airport).

Related

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

Stacked bart chart 4 variables with ggplot

I am very new to this and I wanted to add that the various ways in which I tried to reshape/melt the data. My data in three different variations:
Version 1:
year,type,total,action,perc
2015,v,"1,199,310",crime,42.16
2015,p,"8,024,115",crime,18.24
2015,v,"505,681",arrest,42.16
2015,p,"1,463,213",arrest,18.24
2016,v,"1,250,162",crime,32.85
2016,p,"7,928,530",crime,17.07
2016,v,"410,717",arrest,32.85
2016,p,"1,353,283",arrest,17.07
2017,v,"1,247,321",crime,41.58
2017,p,"7,694,086",crime,16.24
2017,v,"518,617",arrest,41.58
2017,p,"1,249,757",arrest,16.24
Version 2:
year,type,crime,arrest,perc
2015,1,"1,199,310","505,681",42.16
2015,2,"8,024,115","1,463,213",18.24
2016,1,"1,250,162","410,717",32.85
2016,2,"7,928,530","1,353,283",17.07
2017,1,"1,247,321","518,617",41.58
2017,2,"7,694,086","1,249,757",16.24
Version 3:
df <- vpcrimetotal
year,vcrime,varrest,varrestperc,pcrime,parrest,parrestperc
2017,"1,247,321","518,617",0.4158,"7,694,086","1,249,757",0.1624
2016,"1,250,162","410,717",0.3285,"7,928,530","1,353,283",0.1707
2015,"1,199,310","505,681",0.4216,"8,024,115","1,463,213",0.1824
The idea is to show the total number of violent crime versus property crime from 1990-2017 with the number of arrests (labeled as a percent) inside each bar based on crime type (property or violent). The preference is to stack all four into one bar per year with different colors for each.
I found these that helped but was still confused in figuring out how to fit my data into them. how to create stacked bar charts for multiple variables with percentages, but to maybe look like this Count and Percent Together using Stack Bar in R
I have used these sets of data to the code but is probably confusing if I post all the different ones I tried that don't work.

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

How to plot data from Excel using the R corrplot function?

I am trying to learn R, and use the corrplot library to draw Y:City and X: Population graph. I wrote the below code:
When you look at the picture above, there are 2 columns City and population. When I run the code I get this error message:
Error in cor(Illere_Gore_Nufus) : 'x' must be numeric.
My excel data:
In general, correlation plot (Scattered plot) can be plotted only when you have two continuous variable. Correlation is a value that tells you how two continuous variables are linearly related. The Correlation value will always fall between -1 and 1, where correlation value of -1 depicts weak linear relationship and correlation value of 1 depicts strong linear relationship between the two variables. Correlation value of 0 says that there is no linear relationship between the two variables, however, there could be curvi-linear relationship between the two variables
For example
Area of the land Vs Price of the land
Here is the Data
The correlation value for this data is 0.896, which means that there is a strong linear correlation between Area of the land and Price of the land (Obviously!).
Scatter plot in R would look like this
Scatter plot
The R code would be
area<-c(650,785,880,990,1100,1250,1350,1800,2200,2800)
price<-c(250,275,280,290,350,340,400,335,420,460)
cor(area,price)
plot(area,price)
In Excel, for the same example, you can select the two columns, go to Insert > Scatter plot (under charts section)
Scatter plot
In your case, the information can be plotted in bar graph with city in y axis and population in x axis or vice versa!
Hope I have answered you query!
Some assumptions
You are asking how to do this in Excel, but your question is tagged R and Power BI (also RStudio, but that has been edited away), so I'm going to show you how to do this with R and Power BI. I'm also going to show you why you got that error message, and also why you would get an error message either way because your dataset is just not sufficient to make a correlation plot.
My answer
I'm assuming you would like to make a correlation plot of the population between the cities in your table. In that table you'd need more information than only one year for each city. I would check your data sources and see if you could come up with population numbers for, let's say, the last 10 years. In lack of the exact numbers for the cities in your table, I'm going to use some semi-made up numbers for the population in the 10 most populous countries (following your datastrutcture):
Country 2017 2016 2015 2014 2013
China 1415045928 1412626453 1414944844 1411445597 1409517397
India 1354051854 1340371473 1339431384 1343418009 1339180127
United States 326766748 324472802 325279622 324521777 324459463
Indonesia 266794980 266244787 266591965 265394107 263991379
Brazil 210867954 210335253 209297939 209860881 209288278
Pakistan 200813818 199761249 200253292 197655630 197015955
Nigeria 195875237 192568158 195757661 191728478 190886311
Bangladesh 166368149 165630262 165936711 166124290 164669751
Russia 143964709 143658415 143146914 143341653 142989754
Mexcio 137590740 137486490 136768870 137177870 136590740
Writing and debugging R code in Power BI is a real pain, so I would recommend installing R studio, write your little R snippets there, and then paste it into Power B.
The reason for your error message is that the function cor() onlyt takes numerical data as arguments. In your code sample the city names are given as arguments. And there are more potential traps in your code sample. You have to make sure that your dataset is numeric. And you have to make sure that your dataset has a shape that the cor() will accept.
Below is an R script that will do just that. Copy the data above, and store it in a file called data.xlsx on your C drive.
The Code
library(corrplot)
library(readxl)
# Read data
setwd("C:/")
data <- read_excel("data.xlsx")
# Set Country names as row index
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
The plot
Power BI
In Power BI, you need to import the data before you use it in an R visual:
Copy this:
Country,2017,2016,2015,2014,2013
China,1415045928,1412626453,1414944844,1411445597,1409517397
India,1354051854,1340371473,1339431384,1343418009,1339180127
United States,326766748,324472802,325279622,324521777,324459463
Indonesia,266794980,266244787,266591965,265394107,263991379
Brazil,210867954,210335253,209297939,209860881,209288278
Pakistan,200813818,199761249,200253292,197655630,197015955
Nigeria,195875237,192568158,195757661,191728478,190886311
Bangladesh,166368149,165630262,165936711,166124290,164669751
Russia,143964709,143658415,143146914,143341653,142989754
Mexcio,137590740,137486490,136768870,137177870,136590740
Save it as countries.csv in a folder of your choosing, and pick it up in Power BI using
Get Data | Text/CSV, click Edit in the dialog box, and in the Power Query Editor, click Use First Row as headers so that you have this table in your Power Query Editor:
Click Close & Apply and make sure that you've got the data available under VISUALIZATIONS | FIELDS:
Click R under VISUALIZATIONS:
Select all columns under FIELDS | countries so that you get this setup:
Take parts of your R snippet that we prepared above
library(corrplot)
# Set Country names as row index
data <- dataset
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
And paste it into the Power BI R script Editor:
Click Run R Script:
And you're gonna get this:
That's it!
If you change the procedure to importing data from an Excel file instead of a textfile (using Get Data | Excel , you've successfully combined the powers of Excel, Power BI and R to produce a scatterplot!
I hope this is what you were looking for!

R "maps" package and choropleths

I would like to make a choropleth with the maps package in R. I have data which I have constructed to create bins and associate color names with those bins. Now, I need to use the col= argument to point the colors to the counties, in this example. How do I construct that argument? I would have thought that constructing a data frame would associate the county and color on the same line? Is that not true? So far I have the following
Example Data:
County | Value | Bin | Color
alamance | 100 | 1 | white
brunswick | 1000 | 2 | red
... through 100 counties
R code (which does not work):
library("maps")
DATA <- read.csv("~/Example_Data.csv")
DATA$County <- as.character(DATA$County)
DATA$Color <- as.character(DATA$Color)
NC <- map('county', 'north carolina', col= DATA$Color, Fill=TRUE)
So, after many iterations here is the essence of the solution. Instead of giving the R code which made it work (pretty bland), here are the rules that helped solve the problem.
The county.fips data included in the package has a column with all states and county names. This revealed the formatting of county name matches which are all lowercase, "state,county" with no spaces.
For the NC subset there are 102 entries, not 100, because Currituck County is subdivided into three entities. This was the source of most/all of the issues and was difficult to diagnose but easy to solve.
Solution 1 - Match a vector of colors to the vector of counties. 102 color entries IN THE PROPER ALPHA ORDER will produce a correctly resulting choropleth. Fastest, but also the least convenient if you were trying to do this for, say, all counties in the U.S.
Solution 2 - Add fips codes to original data and then match on fips. Since the county.fips file has Currituck entities listed as "north carolina,currituck:main", etc., this is still going to take some manipulation or finding an external fips reference. This is the method used in the maps() documentation, but which would have taken too long so I preferred the former. However, taking the time would allow you to approach a national dataset, for instance.

Resources