All I'm trying to do is plot a cumulative row count (so that by 2021 the graph line has reached 73) over time. I'm quite new to r, and I feel like this is really easy so I don't know why it's not really working.
My data looks like this:
ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015
I've tried this code and it kind of works but sometimes the line goes down which doesn't seem right, since I just want a cumulative count.
ggplot(df, aes(x=year, y=ID)) +
geom_line()
Any help would be much appreciated!
Order the data by year and ID before plotting and it will go from the first year to the last and within year the smaller ID first.
x <- 'ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015'
df <- read.table(textConnection(x), header = TRUE)
library(ggplot2)
i <- order(df$year, df$ID)
ggplot(df[i,], aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)
An alternative, that I do not know is what the question is asking for, is to aggregate the IDs by year keeping the maximum in each year.
The code below does this and pipes to the plot directly, without creating an extra data set in the global environment.
aggregate(ID ~ year, df, max) |>
ggplot(aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)
Related
The data I am working with is from eBird, and I am looking to sort out species occurrence by both name and year. There are over 30k individual observations, each with its own number of birds. From the raw data I posted below, on Jan 1, 2021 and someone observed 2 Cooper's Hawks, etc.
Raw looks like this:
specificName indivualCount eventDate year
Cooper's Hawk 1 (1/1/2018) 2018
Cooper's Hawk 1 (1/1/2020) 2020
Cooper's Hawk 2 (1/1/2021) 2021
Ideally, I would be able to group all the Cooper's Hawks specificName by the year they were observed and sum the total invidualcounts. That way I can make statistical comparisons between the number of birds observed in 2018, 2019, 2020, & 2021.
I created the separate column for the year
year <- as.POSIXct(ebird.df$eventDate, format = "%m/%d/%Y") ebird.df$year <- as.numeric(format(year, "%Y"))
Then aggregated with the follwing:
aggdata <- aggregate(ebird.df$individualCount , by = list( ebird.df$specificname, ebird.df$year ), FUN = sum)
There are hundreds of bird species, so Cooper's Hawks start on the 115th row so the output looks like this:
Group.1 Group.2 x
115 2018 Cooper's Hawk 86
116 2019 Cooper's Hawk 152
117 2020 Cooper's Hawk 221
118 2021 Cooper's Hawk 116
My question is how to I get the data to into a table that looks like the following:
Species Name 2018 2019 2020 2021
Cooper's Hawk 86 152 221 116
I want to eventually run some basic ecology stats on the data using vegan, but one problem first I guess lol
Thanks!
There are errors in the data and code in the question so we used the code and reproducible data given in the Note at the end.
Now, using xtabs we get an xtabs table directly from ebird.df like this. No packages are used.
xtabs(individualCount ~ specificName + year, ebird.df)
## year
## specificName 2018 2020 2021
## Cooper's Hawk 1 1 2
Optionally convert it to a data.frame:
xtabs(individualCount ~ specificName + year, ebird.df) |>
as.data.frame.matrix()
## 2018 2020 2021
## Cooper's Hawk 1 1 2
Although we did not need to use aggdata if you need it for some other reason then it can be computed using aggregate.formula like this:
aggregate(individualCount ~ specificName + year, ebird.df, sum)
Note
Lines <- "specificName,individualCount,eventDate,year
\"Cooper's Hawk\",1,(1/1/2018),2018
\"Cooper's Hawk\",1,(1/1/2020),2020
\"Cooper's Hawk\",2,(1/1/2021),2021"
ebird.df <- read.csv(text = Lines, strip.white = TRUE)
So I am trying to plot this data using gganimate:
YEAR WEEK COUNTRY CODE MARKET ARRIVALS DATE pct.chg
2020 1 Usa US CONTAINER SHIPS 347 2020-01-08 7.7639752
2020 2 Usa US CONTAINER SHIPS 395 2020-01-15 -2.2277228
2020 3 Usa US CONTAINER SHIPS 353 2020-01-22 -15.1442308
2020 4 Usa US CONTAINER SHIPS 359 2020-01-29 -11.3580247
2020 5 Usa US CONTAINER SHIPS 385 2020-02-05 0.2604167
The data is in an object called changesimp. I want to plot the arrivals over time, as you might expect. So here is the code I'm using to do that:
library(tidyverse)
library(gganimate)
changesimp %>%
filter(COUNTRY == "Usa") %>%
filter(YEAR == "2020") %>%
ggplot(aes(DATE, pct.chg)) +
geom_line() +
geom_point()+
labs(y="Year-over-year % change",
x="",
title="Percent change in port calls")+
theme_clean()+
transition_reveal(DATE)
This worked fine when I was just using geom_line. But when I added the geom_point part then things got a little weird and it give me this output (this is just one frame from the animation):
What I'm trying to get is something like this, found here:
There is only one value of pct.chg per week, I have checked already. So I'm not sure why it is plotting multiple points like this. Any thoughts? Thanks.
When I use dummy data as
df <- data.frame(
COUNTRY = c(rep("Usa",26)),
YEAR = c(rep("2020",26)),
WEEK = c(1:26),
pct.chg = c(rnorm(26,0,15))
)
changesimp <- df %>% mutate(DATE=(7*WEEK+as.Date('2020-01-01', format="%Y-%m-%d")))
Your program works fine and generates the following output:
I have a dataset about a university's student body with 10 columns that represent different factors such as their student id, gender, ethnicity, etc.
For right now I'm just interested in the term they were admitted, and their ethnicity because I want to see how the number of students from different ethnic backgrounds has changed over time. So I created a new data frame with two columns called ethnicitydf:
> head(ethnicitydf)
admit_term ethn_desc
1 2011-10-01 White/Caucasian
2 2011-10-01 Filipino/Filipino-American
3 2011-10-01 White/Caucasian
4 2011-10-01 Latino/Other Spanish
5 2011-10-01 East Indian/Pakistani
6 2011-10-01 White/Caucasian
I'm not exactly sure how I would create a plot that has the admit_term (time) in the x-axis and the frequency that each ethnicity occurs for each admit_term. There are 12 unique ethnicities in the second column and I want to have the frequency of all 12 ethnicities for each admit_term (6 terms in total) in one graph, each ethnicity having a different color.
The first step I was thinking was counting up each ethnicity for each term using length(which(ethnicitydf$admit_term == "2011-10-01" & ethnicitydf$ethn_desc == "White/Caucasian")) for example and recording the data in a new data frame, but I feel like there should be a faster and more efficient way of doing this. Maybe the use of a package? Could any body help me out? Thank you!
A bar plot will do the counts for you.
library(ggplot2)
ethnicitydf <- data.frame(admit_term = sample(c("2011-10-01","2012-10-01","2013-10-01"), 100, TRUE),
ethn_desc =sample(c("White/Caucasian","Filipino/Filipino-American","East Indian/Pakistani"), 100, TRUE))
ggplot() +
geom_bar(data=ethnicitydf, mapping=aes(x=admit_term, fill=ethn_desc), position="dodge")
Created on 2019-07-03 by the reprex package (v0.3.0)
You can also just plot points if you have a lot of series, like this.
ggplot() +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")
To get lines you will need to make sure your y axis is numeric (turns the text dates into numbers, e.g. years).
ethnicitydf$admit_term <- as.Date(ethnicitydf$admit_term)
ggplot() +
geom_line(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count") +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")
I have a dataframe like this
geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13
As the names of 2nd an 3rd column are considered as number I'm not able to get a bar chart wth ggplot2. Is there any solution to set the column names to be considered as text?
data
pivot_dat <- read.table(text="geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13",strin=F,h=T)
pivot_dat <- setNames(pivot_dat,c("geo","2001","2002"))
Here's how to do it :
library(ggplot2)
ggplot(pivot_dat, aes(x = geo, y = `2002`)) + geom_col()+ coord_flip()
by using ticks instead of quotes/double quotes you make sure you pass a name to the function and not a string.
If you use quotes, ggplot will convert this character value to a factor and recycle it, so all bars will have the same length of 1, and a label of value "2002".
Note 1 :
You might want to learn the difference between geom_col and geom_bar :
?ggplot2::geom_bar
In short geom_col is geom_bar with stat = "identity", which is what you want here since you want to show on your plot the raw values from your table.
Note 2:
aes_string can be used to give string instead of names but here it doesn't work as "2002" is evaluated as a number :
ggplot(pivot_dat, aes_string(x = "geo", y = "2002")) +
geom_col()+ coord_flip() # incorrect output
ggplot(pivot_dat, aes_string(x = "geo", y = "`2002`")) +
geom_col()+ coord_flip() # correct output
Without an example to see exactly what your problem is, and what you want, it is hard to give you a perfect answer. But here's the thing.
You can do a geom_bar with numeric data. There are 3 possible ways I see that you could have problems (but I may not be able to guess every way.
First, let's set up the r for plotting.
library(readr)
library(ggplot2)
test <- read_csv("geo,2001,2002
Spain,21,23
Germany,34,50
Italy,57,89
France,19,13")
Next, let's make the first mistake...incorrectly calling the column name. In the next example I will tell ggplot to make a bar of the number 2001. Not the column 2001! r has to guess whether we mean 2001 or whether we mean the object 2001. By default it always picks the number instead of the column.
ggplot(test) +
geom_bar(aes(x=2001))
Ok, that just gives you a bar at 2001...because you gave it a single number input instead of a column. Let's fix that. Use the right facing quotes `` to identify the column name 2001 instead of the number 2001.
ggplot(test) +
geom_bar(aes(x=`2001`))
This creates a perfectly workable bar chart. But maybe you don't want the spaces? That's the only possible reason you would use text instead of a number. But you want text so I'm going to show you how to use as.factor to do something similar (and more powerful).
ggplot(test) +
geom_bar(aes(x=as.factor(`2001`)))
I have a data frame newspaper_yearly, which contains the circulation ($CIRC) for a set of newspapers per year. I want to see how the distribution of these numbers change over time. So I want to create multiple, separate histograms for these different years.
I have tried the following:
ggplot(newspaper_yearly,aes(x=CIRC))+geom_histogram()+facet_grid(~YEAR==2004)+theme_bw()
But this shows two histograms, one where YEAR==2004 is true, and one where YEAR=2004 is not true. I want to only see the histogram for when YEAR=2004 is true.
Edit:
here's a cleaned up sample of the basic data structure:
YEAR CIRC
45938 1972 16557
10396 1900 2320
56311 2000 1195
1002 1872 1200
53335 1992 17764
7376 1896 1760
30101 1940 100651
18633 1916 11956
3171 1884 1900
54022 1992 5530
38751 1956 8006
42125 1964 10208
636 1872 1500
48706 1980 18830
22497 1924 NA
28024 1936 7211
7684 1896 21752
56087 2000 107129
43935 1968 9288
34692 1948 5083
I understand I could just make a subset like this (which is effectively the result I want), but I want to circumvent making a subset for every single year.
datahist2000 <- newspaper_yearly[ which(newspaper_yearly$YEAR == "2000"), ]
hist(datahist2000$CIRC)
Something like this might help.
par(mfrow=c(3,3))
for(i in levels(d$YEAR)){
datahist <- d[which(d$YEAR == i), ]
hist(datahist$CIRC)}
I used your subset approach to solve the problem with a for loop. I don't quite know if that is what you are trying to accomplish. I presume that there are quite a few entries for 'CIRC' per year, right? Otherwise separate plots don't make much sense, at least not for the data you provided.
If I understand the question correctly, you want a histogram for each year separately? In that case you can simply do
ggplot(newspaper_yearly, aes(x = CIRC)) + geom_histogram() + facet_grid(~YEAR) + theme_bw()
In case, You want to group years in a more complicated way, I recommend adding a new variable group, such as the following
group_year<- function(year){
if (year >= 1900 && year < 1980) return ("1900 - 1980")
if (year >= 1980 && year < 2020) return ("1980 - 2020")
return ("default")
}
newspaper_yearly$group = sapply(newspaper_yearly$YEAR, group_year)
ggplot(newspaper_yearly, aes(x = CIRC)) + geom_histogram() + facet_grid(~group) + theme_bw()