I have a data frame that has 6,497,651 observations of 6 variables that I got from the National Emissions Inventory website and it has the following variables:
fips SCC Pollutant Emissions type year
09001 10100401 PM25 15.14 POINT 1999
09001 10100402 PM25 234.75 POINT 1999
Where fips is the county code, SCC is name of the source string, Pollutant is the type of pollutant (PM2.5 emission in this case), Emissions indicates the amount of the pollutant emitted in tons, type is the type of source where pollutant was emitted (road, non-road, point, etc) and year notes down years from 1999 to 2008.
Basically, I have to plot a simple line plot to showcase the change in the level of emissions according to each year. Now, the year 1999 alone has over a thousand observations; same goes for the rest of the years till 2008. The problem is not at all difficult since I can easily form a new data frame for each year with the sum of all the emissions recorded and then row bind all those subsetted data frames. But a more efficient and tidier way to accomplish this might be to use the FOR loop where I can calculate the sum of all the values under 'Emissions' according to each year and store all that information into a new data frame, but I am stuck on where to start. How do I enter the exact syntax that will calculate the sum of values according to each year? I should be having a data frame that looks something like this:
Year Emissions
Where Emissions notes down the sum of values of all emissions in that specific year.
data.table package is probably the most efficient package to handle things like that. The syntax to calculate sum of emissions for every year would be like that (assuming your data is stored in dt):
library(data.table)
dt=data.table(dt)
dt[,.(Emissions=sum(Emissions)),by=year]
A dplyr/ggplot option. We group by 'year', get the sum of 'Emissions' using summarise and plot with ggplot.
library(dplyr)
library(ggplot2)
df1 %>%
group_by(year) %>%
summarise(Emissions=sum(Emissions)) %>%
ggplot(., aes(x=year, y=Emissions))+
geom_line()
Or this can be done directly within ggplot
ggplot(df1, aes(x=year, y=Emissions)) +
stat_summary(fun.y='sum', geom='line')
Related
Using Rstudio with tidyverse plugin, using ggplot2 to plot:
Say we have a dataset called SoccerTeam, this data set consists of variables: Location, Goals, YearPlayed, etc... and each data entry is assigned to a game, so the game was played at Location X, they scored Y Goals, It was played in year 19XX.
In the YearPlayed we have all the years the team has been active for, say years 1950 to 2020 and there is a whole season of data for each year.
Lets say that 2002 has 30 games, so there would be 30 data entries that have YearPlayed = 2002.
Our goal is to plot over time how many goals the team has scored. If we take into account every single game from each year and plot it over the 70 years of play, our graph would be very messy and hard to interpret. To tackle this issue, I would like to take the average goals for each year and plot that over time. How would i do this?
If you need a general introduction to data wrangling in R, I recommend R for Data Science. That said, you need to group by the column YearsPlayed, and then compute the mean for each year. Then, pipe it into the plot commands. The %>% symbol send the left side's output into the right side. So you can chain them together like this:
SoccerTeam %>%
group_by(YearPlayed) %>%
summarize(Goals = mean(Goals)) %>%
ggplot(aes(x=YearPlayed, y=Goals) +
geom_line()
I have data for number of cars sold each year for different brands like this:
But I also have data for how many of the cars sold were cars with a diesel engine for each one of the brands and years.
I want to be able to stack the charts in a bar chart and also add a second dimension to each class, showing how many of the cars that have a diesel engine of the specific brand (e.g. BMW). I want to do it either by colour, or by lines like below:
Is it possible to do that with ggplot in R?
Edit:
My data:
The data looks like this in Excel:
BMW Volvo Audi
2010 50 400 50
2011 75 450 35
2012 45 350 55
BMW Volvo Audi
2010 0.2 0.2 0.5
2011 0.293333333 0.5 0.571428571
2012 0.488888889 0.5 0.272727273
You will need to do a bit of data preparation to make it easier to plot, but once you do this type of thing a few times, it becomes quite straightforward. I highly recommend reading about Tidy Data Principles, which I'll apply here.
Data
In the future, please post your dataframes via the output of dput(data.frame), but your tables are small, so import isn't that difficult:
df1 <- data.frame(year=c(2010:2012), BMW=c(50,75,45), Volvo=c(400,450,350), Audi=c(50,35,55))
df2 <- data.frame(year=c(2010:2012), BMW=c(0.2, 0.29333333, 0.4888888), Volvo=c(0.2,0.5,0.5), Audi=c(0.5,0.571428571,0.2727272727272))
Your data should be converted into Tidy Data, in which the key principle is that each row is an observation, each variable is one column, and each value represents the value for that column for that observation. Consider your first table, where you have only 3 pieces of information (variables) that are changing: Year, Model, and number of cars sold. As such, we need to combine those three columns for BMW, Volvo, and Audi into two: one for Model and one for number sold. You can do that by using gather() from dplyr (or a few other ways). Similarly, we need to combine columns in the second dataset.
Then, you can merge the two datasets together. Then finally, I use the information from total sold * proportion which are diesel to identify the number of diesel vs. number that are not diesel. In this way, we create the final dataframe used for plotting:
df1.1 <- df1 %>% gather(key='Model', value='Total_Sold',-year)
df2.1 <- df2 %>% gather(key='Model', value='prop_diesel',-year)
df <- merge(df1.1, df2.1)
df$diesel <- df$Total_Sold * df$prop_diesel
df$non_diesel <- df$Total_Sold - df$diesel
df <- df %>% gather(key='type', value='sold', -(1:4))
Plot
To create the plot, it seems like the best way to show this would be in a column plot, stacking "non-diesel" and "diesel" on top of one another so you can see total amount compared across each make per year, which also estimating the proportion of diesel/non-diesel. We kind of want to use dodging (separating columns out for make where they share the same x axis value), as well as "stacking" (stacking info on diesel vs. non-diesel). You kind of can't do that at the same time for a column plot, but I'm using faceting to get the same effect. Here you assign Model as the x axis, use stacking for the amount sold, and then faceting to create the subsets per year. Here's the code and result:
ggplot(df, aes(x=Model, y=sold)) +
geom_col(aes(fill=type), position='stack') +
facet_wrap(~year)
I have GDP values listed by country (rows) and list of years (column headings) in one dataset. I'm trying to combine it with another dataset where the values represent GINI. How do I merge these two massive datasets by country and year, when "year" is not a variable? (How do I manipulate each dataset so that I introduce "year" as a column and have repeating countries to represent each year?
i.e. from the top dataframe to the bottom dataframe in the image?
Reshape the top dataset from wide to long and then merge with your other dataset. There are many, many, examples of reshaping data on this site with different approaches. A common one is to use the tidyr package, which has a function called gather that does just what you need.
long_table <- tidyr::gather(wide_table, key = year, value = GDP, 1960:1962)
or whatever the last year in your dataset is. You can install the tidyr package with install.packages('tidyr') if you don't have it yet.
Next time, please avoid putting pictures of your data and provide reproducible data so this is easier for others to answer exactly. You can use dput(..) to do so.
Hope this helps!
#sample data (added 'X' before numeric year columns as R doesn't allow column name to start with digit)
df <- data.frame(Country_Name=c('Belgium','Benin'),
X1960=c(123,234),
X1961=c(567,890))
library(dplyr)
library(tidyr)
df_new <- df %>%
gather(Year, GDP, -Country_Name)
df_new$Year <- gsub('X','',df_new$Year )
df_new
Output is:
Country_Name Year GDP
1 Belgium 1960 123
2 Benin 1960 234
3 Belgium 1961 567
4 Benin 1961 890
(PS: As already suggested by others you should always share sample data using dput(df))
With the data in Excel, if you have Excel 2010 or later, you can use Power Query or Get & Transform to unpivot the "year" columns.
This is the code but you can do this through the GUI
And this is the result, although I had to format the GDP column to get your mix of Scientific and Number formatting, and I had a typo on Belgium 1962
I have a panel dataset containing data on civil war, with indices "side_a_id" and "year_month". Each observation is an individual 'event' of armed conflict, and variables include details on the actors involved, a unique ID for each individual event, and then for each event, the number of side_a deaths, side_b deaths, and civilian deaths.
Screenshot of sample dataset
I would like to aggregate the data on each separate variable on deaths ('deaths_a', 'deaths_b' and 'civilian_deaths') according to which year_month they are in. So taking an example from my dataset below: instead of having 3 separate rows for the interaction between Government of Haiti and Military Faction (dyad_id = 14), I would like one row that contains all the deaths of each party for a specific month. I have tried using the the aggregate() function, which seems to work, until I try to re-merge it with my full dataset.
df <- aggregate(cbind(deaths_a, deaths_b, deaths_civilians) ~ side_a_id +
year_month, panel_data, sum)
rebel <- full_join(panel_data, df, by = c("side_a_id", "year_month"))
Can anyone suggest a solution?
I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.