Is there an R function for setting rows on aggregate data? - r

The data I am working with is from eBird, and I am looking to sort out species occurrence by both name and year. There are over 30k individual observations, each with its own number of birds. From the raw data I posted below, on Jan 1, 2021 and someone observed 2 Cooper's Hawks, etc.
Raw looks like this:
specificName   indivualCount  eventDate  year
Cooper's Hawk    1    (1/1/2018)   2018
Cooper's Hawk    1    (1/1/2020)    2020
Cooper's Hawk    2    (1/1/2021)    2021
Ideally, I would be able to group all the Cooper's Hawks specificName by the year they were observed and sum the total invidualcounts. That way I can make statistical comparisons between the number of birds observed in 2018, 2019, 2020, & 2021.
I created the separate column for the year
year <- as.POSIXct(ebird.df$eventDate, format = "%m/%d/%Y") ebird.df$year <- as.numeric(format(year, "%Y"))
Then aggregated with the follwing:
aggdata <- aggregate(ebird.df$individualCount , by = list( ebird.df$specificname, ebird.df$year ), FUN = sum)
There are hundreds of bird species, so Cooper's Hawks start on the 115th row so the output looks like this:
  Group.1   Group.2    x
115   2018  Cooper's Hawk  86
116   2019  Cooper's Hawk  152
117   2020  Cooper's Hawk  221
118   2021  Cooper's Hawk  116
My question is how to I get the data to into a table that looks like the following:
Species Name   2018 2019 2020 2021
Cooper's Hawk   86   152  221  116
I want to eventually run some basic ecology stats on the data using vegan, but one problem first I guess lol
Thanks!

There are errors in the data and code in the question so we used the code and reproducible data given in the Note at the end.
Now, using xtabs we get an xtabs table directly from ebird.df like this. No packages are used.
xtabs(individualCount ~ specificName + year, ebird.df)
## year
## specificName 2018 2020 2021
## Cooper's Hawk 1 1 2
Optionally convert it to a data.frame:
xtabs(individualCount ~ specificName + year, ebird.df) |>
as.data.frame.matrix()
## 2018 2020 2021
## Cooper's Hawk 1 1 2
Although we did not need to use aggdata if you need it for some other reason then it can be computed using aggregate.formula like this:
aggregate(individualCount ~ specificName + year, ebird.df, sum)
Note
Lines <- "specificName,individualCount,eventDate,year
\"Cooper's Hawk\",1,(1/1/2018),2018
\"Cooper's Hawk\",1,(1/1/2020),2020
\"Cooper's Hawk\",2,(1/1/2021),2021"
ebird.df <- read.csv(text = Lines, strip.white = TRUE)

Related

How to plot total cumulative row count over time ggplot

All I'm trying to do is plot a cumulative row count (so that by 2021 the graph line has reached 73) over time. I'm quite new to r, and I feel like this is really easy so I don't know why it's not really working.
My data looks like this:
ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015
I've tried this code and it kind of works but sometimes the line goes down which doesn't seem right, since I just want a cumulative count.
ggplot(df, aes(x=year, y=ID)) +
geom_line()
Any help would be much appreciated!
Order the data by year and ID before plotting and it will go from the first year to the last and within year the smaller ID first.
x <- 'ID name year
73 name73 2021
72 name72 2021
71 name71 2019
70 name70 2017
69 name69 2015
68 name68 2015'
df <- read.table(textConnection(x), header = TRUE)
library(ggplot2)
i <- order(df$year, df$ID)
ggplot(df[i,], aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)
An alternative, that I do not know is what the question is asking for, is to aggregate the IDs by year keeping the maximum in each year.
The code below does this and pipes to the plot directly, without creating an extra data set in the global environment.
aggregate(ID ~ year, df, max) |>
ggplot(aes(x=year, y=ID)) +
geom_line()
Created on 2022-07-08 by the reprex package (v2.0.1)

Simple Forecasting using Average method in R for Time series data for multiple groups

I've done forecasting and time series analysis for individual values but not for group of values in one go. I've got a historical data (36 months- 1st day of each month which I created as required by time series) for multiple groups(Model No.) in a data frame which looks like below:
ModelNo. Month_Year Quantity
a 2017-06-01 0
a 2017-07-01 5
a 2017-08-01 3
.. .......... ....
.. .......... ....
a 2020-05-01 6
b 2017-06-01 9
b 2017-07-01 0
b 2017-08-01 1
.. .......... ....
.. .......... ....
b 2020-05-01 4
c 2020-05-01 3
c 2017-06-01 1
c 2017-07-01 1
c 2017-08-01 0
.. .......... ....
.. .......... ....
c 2020-05-01 4
I then use the below code to subset my data frame for "one group" to generate forecast using simple average function
Selected_data<-subset(data, ModelNo.=='a')
currentMonth<-month(Sys.Date())
currentYear<-year(Sys.Date())
I then create the time series object for 24 months which i then input to my forecast function.
y_ts = ts(Selected_data$Quantity, start=c(currentYear-3, currentMonth), end=c(currentYear-1, currentMonth-1), frequency=12)
I then use simple mean function for forecasting the 12 months value (which I already have "quantity" valuesfor , june 2019-may 2020)
meanf(y_ts, 12, level = c(95))
and I get a output like for my data (not the output linked to above data provide, just a snapshot of my original data)
Point Forecast Lo 95 Hi 95
Jun 2019 1.875 -3.117887 6.867887
Jul 2019 1.875 -3.117887 6.867887
Aug 2019 1.875 -3.117887 6.867887
Sep 2019 1.875 -3.117887 6.867887
Oct 2019 1.875 -3.117887 6.867887
Nov 2019 1.875 -3.117887 6.867887
Dec 2019 1.875 -3.117887 6.867887
Jan 2020 1.875 -3.117887 6.867887
Feb 2020 1.875 -3.117887 6.867887
Mar 2020 1.875 -3.117887 6.867887
Apr 2020 1.875 -3.117887 6.867887
May 2020 1.875 -3.117887 6.867887
So I'm able to successfully generate forecast for "one" Model No. here. However, my question are :
I have to generate this forecast for all groups in my dataframe, like a , b, c and so on. So I don't know how to do this and store the result in a new data frame for forecast values along with Dates for each ModelNo.
I know if i use below , that will return me the forecasted values R function meanf the output shows
meanf(y_ts, 12, level = c(95))$mean
But how to store its for each group type against dates in a dataframe, I tried mutate() it didnt work.
Following on Question 1, how should I then compare the forecast values with the actual values (as you can see I only sliced 24 months data to predict 12 month values). I know there are methods in R and time series analysis where I can use multiple historical slicing test and train window and then check and compare with actual values to measure forecast results/accuracy etc. I plan to expand this to use and try multiple forecasting methods.
Please if someone can help me with the above two questions.
I believe there is a learning curve required , I know partially the process but I'm not sure how systematically I can fill this knowledge gap to use forecasting methods for multiple groups and test them against actual values. Apart from the answers to the above two questions any link to a tutorial with which I can enhance my learning will be very helpful. Thank you very much.
Your question(s) is rather broad, so you can start with something like this to think about how to proceed. First of all you did not provide some reproducible data, so I used what you've posted, with some tweak to your code to make it works. The idea is to do for each model a train and a test time series, create the forecast, and store it in a data.frame. Then you can calculate for example RMSE to see the goodness of fit on test.
library(forecast)
library(lubridate)
# set date limits to train and test
train_start <- ymd("2017-06-01")
train_end <- ymd("2019-05-01")
test_start <- ymd("2019-06-01") # end not necessary
# create an empty list
listed <- list()
for (i in unique(data$ModelNo.))
{
# subset one group
Selected_data<-subset(data, ModelNo.==i)
# as ts
y_ts <- ts(Selected_data$Quantity,
start=c(year(min(data$Month_Year)),
month(max(data$Month_Year))),
frequency=12)
# create train
train_ts <- window(y_ts,
start=c(year(train_start), month(train_start)),
end=c(year(train_end), month(train_end)), frequency = 12)
# create test (note: using parameters ok to your sample data)
test_ts <- window(y_ts,
start=c(year(test_start), month(test_start)), frequency = 12)
listed[[i]] <- cbind(
data.frame(meanf(train_ts,length(test_ts),level = c(95))),
real =as.vector(test_ts))
}
Now for part 1, you can create a data.frame with the results:
res <- do.call(rbind,listed)
head(res) # only head to simplify output
Point.Forecast Lo.95 Hi.95 real
a.Jun 2019 49.29167 -22.57528 121.1586 95
a.Jul 2019 49.29167 -22.57528 121.1586 93
a.Aug 2019 49.29167 -22.57528 121.1586 5
a.Sep 2019 49.29167 -22.57528 121.1586 66
a.Oct 2019 49.29167 -22.57528 121.1586 47
a.Nov 2019 49.29167 -22.57528 121.1586 40
For point 2, you can calculate RMSE (there is an handy function in package Metrics) for each time series:
library(Metrics)
goodness <- lapply(listed, function(x)rmse(x$real, x$Point.Forecast))
goodness
$$a
[1] 31.8692
$b
[1] 30.69859
$c
[1] 30.28037
With data:
set.seed(1234)
data <- data.frame(ModelNo. = c(rep("a",36),rep("b",36),rep("c",36)),
Month_Year = lubridate::ymd(rep(seq(as.Date("2017/6/1"), by = "month", length.out = 36),3)),
Quantity =sample(1:100,108, replace = T)
)

Grouping Like Variables in Tibbles for R

Working on my second project for R. I'm trying to create some variable groups using dplyr, but I'm not sure what the heck I'm doing here.
I'm working with financial data and among the categories, there are several different forms of travel, listed as such:
Travel - Gas, Travel - Airfare, Travel - Subway...
I want to create a new tibble that groups all the Travel subtypes as one Travel subgroup. Is there a good way to do this?
I've been trying to use the dplyr filter function to no effect so far.
Sorry, I was really tired and forgot to put an example up
I have data that's like this:
Month - Year - Category - Amount
01 - 2016 - "Travel- Air" - 247.02
01 - 2016 - "Travel- Car" - 29.04
01 - 2016 - "Retail" - 45.00
03 - 2017 - "Travel - Air" - 253.60
I'm trying to group things so that all the transactions of one type in a particular month/year are summed together like this:
Total_Category_Transactions_Month <- Total_Transactions %>%
group_by(month,Year,Category) %>%
summarize(monthly = sum(Amount))
But after looking at my data, there are just way too many things that are grouped up as "Travel - foo." I'd like to keep that detail for later to analyze, but for the big scale picture, I want to see if I can lump all those Travel expenses as one thing each month.
The Output should end up being:
Month - Year - Category - Amount
01 - 2016 - "Travel" - 276.06
01 - 2016 - "Retail" - 45.00
03 - 2017 - "Travel" - 253.60
Where all the subtypes of category Travel_Foo from the same month and year are added together into one category just called Travel
One option is to use tidyr::separate
df %>%
separate(Category, into = c("Category"), extra = "drop") %>%
group_by(Month, Year, Category) %>%
summarise(Amount = sum(Amount)) %>%
ungroup() %>%
as.data.frame()
# Month Year Category Amount
#1 1 2016 Retail 45.00
#2 1 2016 Travel 276.06
#3 3 2017 Travel 253.60
Note that the as.data.frame() is not really necessary here. I've only included it to show that the resulting Amounts are the ones from your expected output (the tibbles don't print the same number of decimal places).
Sample data
df <- read.table(text =
"Month Year Category Amount
01 2016 'Travel- Air' 247.02
01 2016 'Travel- Car' 29.04
01 2016 'Retail' 45.00
03 2017 'Travel - Air' 253.60", header = T)

Frequency count per observation in R

I am trying to do a frequency count of a categorical variable (i.e., upper division class) per case in a dataset that is currently in long format. I am using R.
Current data set:
Student_ID   Class  UD_class
111       PSY 400   1
111      ENG 310   0
111       EE 510  1
I would like to conver it to a frame that looks like this:
Student_ID   UD_class
111        2
I tried using this code and this is providing me wrong frequencies:
data.frame(table(data$Student_ID, data$UD_class))
Any suggestions on how I can do this in R? Thank you!
Try:
with(data[data$UD_class==1,], data.frame(table(Student_ID))
Try as.data.frame instead of data.frame. To maintain your column headings use the with function: as.data.frame(with(df, table(StID, ud_class)))

How to form linear model from two data frames?

MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)

Resources