Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have this dataset
airline avail_seat_km_per_week Number Year
1: Aer Lingus 320906734 2 1985-99
2: Aeroflot* 1197672318 76 1985-99
3: Aerolineas Argentinas 385803648 6 1985-99
4: Aeromexico* 596871813 3 1985-99
5: Air Canada 1865253802 2 1985-99
---
108: United / Continental* 7139291291 14 2000-14
109: US Airways / America West* 2455687887 11 2000-14
110: Vietnam Airlines 625084918 1 2000-14
111: Virgin Atlantic 1005248585 0 2000-14
112: Xiamen Airlines 430462962 2 2000-14
These are some instances of the dataset:
data.frame(airline=c("Aer Lingus", "Aeroflot*", "Aerolineas Argentinas", "Aeromexico*", "Air Canada", "Aer Lingus", "Aeroflot*", "Aerolineas Argentinas", "Aeromexico*", "Air Canada"), Number=c(2, 76, 6, 3, 2,0 ,6,1,5,2), Year=c("1985-99", "1985-99", "1985-99", "1985-99", "1985-99", "2000-14", "2000-14", "2000-14", "2000-14", "2000-14"))
which includes the number of crashes of airlines around the world in 2 different periods, 85-99 and 00-14, I want to plot a scatterplot that displays the number of crashes in period 85-99 against period 00-14, what is a neat way to do it using dplyr and ggplot2 packages, preferably using pipes?.
Please let me know if there are something I could do to further specify the problem. Appreciate your help!
When asking for help with plots in general, and ggplot, it's helpful if you're very clear about what data goes with each dimension - x, y, color, etc.
library(tidyr)
library(ggplot2)
# (calling your data d)
d %>%
# widen the data so each plot dimension gets a column
pivot_wider(names_from = Year, values_from = Number) %>%
# use backticks for non-standard column names (because of the dash in this case)
ggplot(aes(x = `1985-99`, y = `2000-14`, color = airline)) +
geom_point()
Related
I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have a dataset with longitude and latitude coordinates. I want to retrieve the corresponding census tract. Is there a dataset or api that would allow me to do this?
My dataset looks like this:
lat lon
1 40.61847 -74.02123
2 40.71348 -73.96551
3 40.69948 -73.96104
4 40.70377 -73.93116
5 40.67859 -73.99049
6 40.71234 -73.92416
I want to add a column with the corresponding census tract.
Final output should look something like this (these are not the right numbers, just an example).
lat lon Census_Tract_Label
1 40.61847 -74.02123 5.01
2 40.71348 -73.96551 20
3 40.69948 -73.96104 41
4 40.70377 -73.93116 52.02
5 40.67859 -73.99049 58
6 40.71234 -73.92416 60
The tigris package includes a function called call_geolocator_latlon that should do what you're looking for. Here is some code using
> coord <- data.frame(lat = c(40.61847, 40.71348, 40.69948, 40.70377, 40.67859, 40.71234),
+ long = c(-74.02123, -73.96551, -73.96104, -73.93116, -73.99049, -73.92416))
>
> coord$census_code <- apply(coord, 1, function(row) call_geolocator_latlon(row['lat'], row['long']))
> coord
lat long census_code
1 40.61847 -74.02123 360470152003001
2 40.71348 -73.96551 360470551001009
3 40.69948 -73.96104 360470537002011
4 40.70377 -73.93116 360470425003000
5 40.67859 -73.99049 360470077001000
6 40.71234 -73.92416 360470449004075
As I understand it, the 15 digit code is several codes put together (the first two being the state, next three the county, and the following six the tract). To get just the census tract code I'd just use the substr function to pull out those six digits.
> coord$census_tract <- substr(coord$census_code, 6, 1)
> coord
lat long census_code census_tract
1 40.61847 -74.02123 360470152003001 015200
2 40.71348 -73.96551 360470551001009 055100
3 40.69948 -73.96104 360470537002011 053700
4 40.70377 -73.93116 360470425003000 042500
5 40.67859 -73.99049 360470077001000 007700
6 40.71234 -73.92416 360470449004075 044900
I hope that helps!
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am trying to learn the R programming language to analyse and visualize my data. I have made some good progress so far and I am really enjoying learning R but I am stomped here.
I am having some trouble creating line graphs for products in specific categories. I have no problem creating graphs to show sales all categories but I would like to specify a particular category and show the product sales.
This is what my data set looks like.
Can someone show me how I could do this? E.g I would like to create a line graph to show the sales of Products in the Bakery category where the X axis would have the product name and the Y axis would have the quantity sold.
Any help would be greatly appreciated.
Next time please include the head this can be done using
head(Store_sales)
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
I reproduced relevant fields to help you out. First thing is to filter out Baker items from categories.
> install.packages("tidyverse")
> library(tidyverse)
Store sales before filter
> Store_sales
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
7 107 Produce 9228 Carrot
Filter out "Bakery" from category column into Store_sales_bakery
> Store_sales_bakery <- filter(Store_sales, category == "Bakery")
What Store_sales_bakery includes
> Store_sales_bakery
ProductID category sales product
1 101 Bakery 9468 White bread
2 106 Bakery 9252 Pankcakes
Unfortunately because the picture you gave us does not contain enough information to produce a line graph (you only have 1 data point for each variable which is not enough to create a line) so in its stead I created a point plot for you.
ggplot(Store_sales, aes(x = product, y = sales)) + geom_point()
ggplot point
Here is a bar plot with two variables
ggplot(Store_sales, aes(x = product, y = sales)) + geom_bar(stat = "identity")
bar plot
If you had enough data to make a line graph you would replace geom_bar() or geom_point() with geom_line()
Here is a link to ggplot cheatsheet that may help you in the future
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
So I have the following data:
I have 5 regions, and years 1998-2009. What I like to do is to classify countries each year by regions. I'm new at R so the only step I've taken so far is the following:
finalData$Region = factor(finalData$Region, levels = c('Former Socialist Bloc', 'Independent', 'Western Europe','Scandinavia', 'Former Yugoslavia'), levels = c(1, 2, 3, 4, 5))
but I get this error:
Error in factor(finalData$Region, levels = c("Former Socialist Bloc",
: formal argument "levels" matched by multiple actual arguments
Could please tell me how to fix this and an approach to how to do the classification? Thank you!
This is a terribly formulated question. But I am going to imagine this is roughly what you want to do to give you an idea.
You have samples (rows) which are countries and you have some variables (columns) which are observations about these samples. You want to use all/multiple variables (multivariate analysis) to cluster the countries. If this is what you want to do, then below is one approach.
I am creating a data.frame with pseudo dataset.
dfr <- data.frame(Country=c("USA","UK","Germany","Austria","Taiwan","China","Japan","South Korea"),
Year=factor(c(2009,2009,2009,2010,2010,2010,2010,2011)),
Language=c("English","English","German","German","Chinese","Chinese","Japanese","Korean"),
Region=c("North America","Europe","Europe","Europe","Asia","Asia","Asia","Asia"))
> head(dfr)
Country Year Language Region
1 USA 2009 English North America
2 UK 2009 English Europe
3 Germany 2009 German Europe
4 Austria 2010 German Europe
5 Taiwan 2010 Chinese Asia
6 China 2010 Chinese Asia
First thing you want to do is move the country names out to row names because country names are the sample labels and they are not observations.
rownames(dfr) <- dfr$Country
dfr$Country <- NULL
Now you want all the remaining variables to be numeric or factors. Do that manually and carefully. I only have categorical observations. Finally we want to
recode all factors to integers. So that our final data.frame contains only numbers.
dfr1 <- as.data.frame(sapply(dfr,as.numeric))
rownames(dfr1) <- rownames(dfr)
> head(dfr1)
Year Language Region
USA 1 2 3
UK 1 2 2
Germany 1 3 2
Austria 2 3 2
Taiwan 2 1 1
China 2 1 1
Now run some clustering algorithm. Here for example a PCA.
pc <- prcomp(dfr1)
pcval <- pc$x
> head(pcval)
PC1 PC2 PC3
USA -1.04369951 1.2743507 0.36120850
UK -0.87597336 0.5087910 -0.25990844
Germany 0.06243255 0.8258430 -0.39728520
Austria 0.36452903 0.2660249 0.37429849
Taiwan -1.34455665 -1.1336389 0.02793507
China -1.34455665 -1.1336389 0.02793507
Combine the output principal components with original data.
pcval1 <- cbind(pcval,dfr)
rownames(pcval1) <- rownames(dfr)
> head(pcval1)
PC1 PC2 PC3 Year Language Region
USA -1.04369951 1.2743507 0.36120850 2009 English North America
UK -0.87597336 0.5087910 -0.25990844 2009 English Europe
Germany 0.06243255 0.8258430 -0.39728520 2009 German Europe
Austria 0.36452903 0.2660249 0.37429849 2010 German Europe
Taiwan -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
China -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
What is PCA and what is going on here is clearly out of the scope of this answer. In short, it creates some new variables based on all your observed variables.
Scatterplot the principal components 1 and 2. Colour points by some variable. Say "Region". Add country names as text labels.
library(ggplot2)
pcval1$Country <- rownames(pcval1)
ggplot(pcval1,aes(x=PC1,y=PC2,colour=Region))+
geom_point(size=3)+
geom_text(aes(label=pcval1$Country),hjust=1.5)+
theme_bw(base_size=15)
Now we see that countries have clustered together based on the observations in your dataset. We have the countries roughly grouping by Region. Obviously, there may or may not be any clustering depending on your data.
This is just an example. If you blindly follow this, you may be violating all sorts of statistical assumptions and what-not. You have to take into account what kind of data distributions you are dealing with and what clustering algorithm is suitable for that data etc.
This question already has answers here:
Order Bars in ggplot2 bar graph
(16 answers)
Closed 8 years ago.
I use the arrange function to put my data frame in order by deaths, but when I try to do a bargraph of the top 5, they are in alphabetical order. How do I get them into order by value? Do I need to use ggplot?
library(dplyr)
library(ggplot2)
EventsByDeaths <- arrange(SumByEvent, desc(deaths))
> head(EventsByDeaths, 10)
Source: local data frame [10 x 3]
EVTYPE deaths damage
1 TORNADO 4662 2584635.60
2 EXCESSIVE HEAT 1418 53.80
3 HEAT 708 277.00
4 LIGHTNING 569 338956.35
5 FLASH FLOOD 567 759870.68
6 TSTM WIND 474 1090728.50
7 FLOOD 270 358109.37
8 RIP CURRENTS 204 162.00
9 HIGH WIND 197 170981.81
10 HEAT WAVE 172 1269.25
qplot(y=deaths, x=EVTYPE, data=EventsByDeaths[1:5,], geom="bar", stat="identity")
You could use the reorder() function
EventsByDeaths <- transform(EventsByDeaths, EVTYPE = reorder(EVTYPE, -deaths))
Then your original qplot call should work as desired. Hope this helps!