Subset data and plotting in R - r

I would like to use R to simplify and subset large datasets (over 100 000 values) and then plot them. Below is a simplified version of my dataset (Figure 1) where I broke it down into three years and two crop types. I have a Year (2011-2013), two crop types (Corn and Soybean) and their total Area.
I want to subset the data into the total Area of Corn and Soybean by year into a new table(example figure 2) with the year, type and total area and then plot the total area by year for each (example of plot in Figure 3).
Figure 1 Small example dataset
Figure 2 New total table
Figure 3 example of graph that I want to produce
I thought I could subset the data by year and crop with
corn2011 <- subset(CropTable, Year==2011 & Lulc=="Corn")
corn2012 <- subset(CropTable, Year==2012 & Lulc=="Corn")
and then I can summarize the data using the sum function
sum(corn2011[,3]),
but I'm not sure how to plot them yearly or against each other to have it look like Figure 3.

for your plot, you could try this
data.df <- read.table(text="
Year Type Area
1 2011 corn 30
2 2012 corn 15
3 2013 corn 50
4 2011 Soy 45
5 2012 Soy 30
6 2013 Soy 60",
header = TRUE)
ggplot(data=data.df, aes(x=as.factor(Year), y=Area, group=Type, color=Type)) + geom_line() + xlab("Year") + ylab("Area (ha)") + theme_bw() + scale_color_manual(values=c("red", "blue"))

Related

Plot large panel data in R by category

I have a dataset (df) that looks like this:
EIN Year Cat Fund
1 16 2005 A 9784.490
2 16 2006 A 10020.720
3 16 2007 A 9232.796
4 15 2008 B 8567.893
5 15 2009 B 10292.670
6 17 2010 C 9274.589
The data has relatively large dimensions (around 300k observations), which makes plotting a potentially slow process. I would like to plot the variable Fund for each year, by the identifier EIN. Based on this post I have tried the following code:
library(ggplot2)
ggplot(df, mapping = aes(x = Year, y = Fund)) +
geom_line(aes(linetype = as.factor(EIN)))
Here are my questions:
This code becomes pretty slow given the high amount of observations that I have. Do you suggest any alternatives that could speed up the process?
Since I have a huge number of EINs, the legend ends-up taking all the space available for the graph, so I would like to get rid of it unsuccesfully. I tried adding + guides(fill=FALSE) at the end, but it did not work. Any advice?
If I wanted to either subset or color code my plot by Cat, what would be the best way to do it?
Thanks a lot for your help!
You can get rid of the legend using:
+ theme(legend.position = 'none')
To subset (facet) your plot, especially if there aren't too many categories, use facet_wrap:
+ facet_wrap(~Cat)
To colour instead, put colour = Cat inside your aes() calll.

Stacked barplot histogram in R

I would like to make a histogram for my data but I would also like to visualize it in such a way that each category is coloured differently but stacked together.
This is what I'm trying to achieve: Stacked histogram from already summarized counts using ggplot2
but I'm unsure how to do it for my data set and my R skills are very much on the rusty side.
My data is formatted like this
Name Category Age Year
1 A 3 2017
2 B 6 2016
3 B 12 2017
4 B 8 2017
I'm only interested in Category B so I made a subset called catB. I would like the histogram to graph the frequency of the different ages, and I would like to colour the stacks based on year (in my data there are 5 year options).
I would appreciate any help! Thank you!
ggplot(catB, aes(x = Age, fill = Year)) +
geom_histogram()
one more nice graphical option. You have to add frequency(count): in example given it is count=1. However you have to see on real data what is count value:
catB <- cbind(catB, count=1)
ggplot(catB, aes(x=Age, y=count)) + geom_histogram(aes(fill=Year), stat="identity", group=1)

Importing/Plotting a Time Series in R with two columns

I have RStudio and want to import a time series data set. The column on the x-axis should be the year, however when I use the ts.plot command it just plots Time on the x-axis. How can I make the years from the data set appear on my plot?
The data set is for Water Usage in NYC from 1898 to 1968. There are two columns, The Year and Water Usage.
This is the link to the data I used (I have donwnloaded the .TSV file)
https://datamarket.com/data/set/22tl/annual-water-use-in-new-york-city-litres-per-capita-per-day-1898-1968#!ds=22tl&display=line
These are the commands for importing my data:
nyc <- read.csv("~/Desktop/annual-water-use-in-new-york-cit.tsv", sep="")
View(nyc)
ts.plot(nyc)
This is what I get:
There are several ways to do this. I used the CSV file from your link in this demonstration.
library(tidyverse)
nyc <- read_csv("annual-water-use-in-new-york-cit.csv")
head(nyc)
# A tibble: 6 x 2
Year `Annual water use in New York city, litres per capita per day, 1898-1968`
<chr> <chr>
1 1898 402.8
2 1899 421.3
3 1900 431.2
4 1901 426.2
5 1902 425.5
6 1903 423.6
Method 1
Create a time series object and plot this time series.
Firstly, let us fix the column name of the annual water use so that it is easier to call in our code.
nyc <- nyc %>%
rename(
water_use = `Annual water use in New York city, litres per capita per day, 1898-1968`
)
Make the time series object nyc.ts with the ts() function.
nyc.ts <- ts(as.numeric(nyc$water_use), start = 1898)
You can then use the generic plot function to plot the time series.
plot(nyc.ts, xlab = "Years")
Method 2
Use the forecast::autoplot function. Note that this function is built on top of ggplot2.
autoplot(nyc.ts) + xlab("Years") + ylab("Amount in Litres")
Method 3
With just ggplot2:
nyc$Year <- as.POSIXct(nyc$Year, format = "%Y")
nyc$water_use <- as.numeric(nyc$water_use)
ggplot(nyc, aes(x = Year, y = water_use)) + geom_line() + xlab("Years") + ylab("Amount in Litres")

how to Create Histogram for one variable, using another to determine its frequency?

I'm new to R, and am using Histograms for the first time. I need to construct a histogram chart to show the frequency of income for all 50 United States + District of Columbia.
This is the data given to me:
> data
X.Income. X.No.States.
1 -22.024 5
2 -25.027 13
3 -28.030 16
4 -31.033 9
5 -34.036 4
6 -37.039 2
7 -40.042 2
> hist(data$X.Income, col="red")
But that only produces a histogram of the number of frequency that each income amount appears in the graph, not the number of states that have that level of income. How do I account for the number of states that have each level of income in the chart?
Use a bar plot instead of a histogram, as the histogram expects to calculate the frequencies for you:
library(ggplot2)
# make some data to exercise
income = c(-22.024, -25.027, -28.030, -31.033, -34.036, -37.039,-40.042)
freq = c(5,13,16,9,4,2,2)
df <- data.frame(income, freq)
df <- names(c("income","freq"))
# the graph object
p <- ggplot(data=df) +
aes(x=income, y=freq) +
geom_bar(stat="identity", fill="red")
# call the object to view
p

Trying to plot temperature and count data on same plot using xyplot?

I am using the xyplot in lattice trying to make a plot that shows temperature change over time in correlation with count data. I am not sure if ggplot2 would be better? My data is arrange like this:
Year (1998 1998 1999 2000 2001 2001 2002)
Low (2.777778 8.333330 10.555556 4.444444 26.388889 15.555556 12.500000)
Geese (2 14 10 16 7 10 15)
State (Arkansas California California California California Florida California)
I am stuck at this part of the code:
xyplot(c(geese,low)~year,subset=state=="California", par.settings=bwtheme, auto.key=TRUE)
The plot has the geese and low (temperature) as the same type of point and if I add a line there is no separation between the two. Please any help for this would be awesome.
To plot multiple series on the same plot, use + rather than c() to specify multiple y values. For example
xyplot(geese + low ~year, subset=state=="California", auto.key=TRUE, type="b")
That will produce

Resources