Conversion of monthly data to yearly data in a dataframe in r - r

I have a dataframe showing monthly mgpp from 2000-2010:
dataframe1
Year Month mgpp
1: 2000 1 0.01986404
2: 2000 2 0.011178429
3: 2000 3 0.02662008
4: 2000 4 0.05034293
5: 2000 5 0.23491388
---
128: 2010 8 0.13234501
129: 2010 9 0.10432369
130: 2010 10 0.04329537
131: 2010 11 0.04343289
132: 2010 12 0.09494946
I am trying to convert this dataframe1 into a raster that will show the variable mgpp. However I want to format the dataframe first which will show only the yearly mgpp. The expected outcome is shown below :
dataframe1
Year mgpp
1: 2000 0.01986704
2: 2001 0.01578429
3: 2002 0.02662328
4: 2003 0.05089593
5: 2004 0.07491388
6: 2005 0.11229201
7: 2006 0.10318569
8: 2007 0.07129537
9: 2008 0.04373689
10: 2009 0.02885386
11: 2010 0.74848348
I want to aggregate the months by mean. For instance, 2000 value shows one value that is the mean from Jan-Dec for the 2000 year.How can I achieve this? Help would be appreciated

Here a data.table approach.
library(data.table)
setDT(dataframe1)[,.(Yearly.mgpp = mean(mgpp)),by=Year]
Year Yearly.mgpp
1: 2000 0.06858387
2: 2010 0.08366928
Or if you prefer dplyr.
library(dplyr)
dataframe1 %>%
group_by(Year) %>%
summarise(Yearly.mgpp = mean(mgpp))
# A tibble: 2 x 2
Year Yearly.mgpp
<dbl> <dbl>
1 2000 0.0686
2 2010 0.0837
Or base R.
result <- sapply(split(dataframe1$mgpp,dataframe1$Year),mean)
data.frame(Year = as.numeric(names(result)),Yearly.mgpp = result)
Year Yearly.mgpp
2000 2000 0.06858387
2010 2010 0.08366928
Sample Data
dataframe1 <- structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2010, 2010,
2010, 2010, 2010), Month = c(1, 2, 3, 4, 5, 8, 9, 10, 11, 12),
mgpp = c(0.01986404, 0.011178429, 0.02662008, 0.05034293,
0.23491388, 0.13234501, 0.10432369, 0.04329537, 0.04343289,
0.09494946)), class = "data.frame", row.names = c(NA, -10L
))

Related

Calculating the change in % of data by year

I am trying to calculate the % change by year in the following dataset, does anyone know if this is possible?
I have the difference but am unsure how we can change this into a percentage
C diff(economy_df_by_year$gdp_per_capita)
df
year gdp
1998 8142.
1999 8248.
2000 8211.
2001 7926.
2002 8366.
2003 10122.
2004 11493.
2005 12443.
2006 13275.
2007 15284.
Assuming that gdp is the total value, you could do something like this:
library(tidyverse)
tribble(
~year, ~gdp,
1998, 8142,
1999, 8248,
2000, 8211,
2001, 7926,
2002, 8366,
2003, 10122,
2004, 11493,
2005, 12443,
2006, 13275,
2007, 15284
) -> df
df |>
mutate(pdiff = 100*(gdp - lag(gdp))/gdp)
#> # A tibble: 10 × 3
#> year gdp pdiff
#> <dbl> <dbl> <dbl>
#> 1 1998 8142 NA
#> 2 1999 8248 1.29
#> 3 2000 8211 -0.451
#> 4 2001 7926 -3.60
#> 5 2002 8366 5.26
#> 6 2003 10122 17.3
#> 7 2004 11493 11.9
#> 8 2005 12443 7.63
#> 9 2006 13275 6.27
#> 10 2007 15284 13.1
Which relies on the tidyverse framework.
If gdp is the difference, you will need the total to get a percentage, if that is what you mean by change in percentage by year.
df$change <- NA
df$change[2:10] <- (df[2:10, "gdp"] - df[1:9, "gdp"]) / df[1:9, "gdp"]
This assigns the yearly GDP growth to each row except the first one where it remains as NA
df$diff <- c(0,diff(df$gdp))
df$percentDiff <- 100*(c(0,(diff(df$gdp)))/(df$gdp - df$diff))
This is another possibility.

R: How do I avoid getting an error when merging two data frames (group by/summarise)?

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

Obtaining back incidence data from cumulative data?

I have a dataframe for which I have date data and cumulative counts.
I am trying to do a reverse of cumsum to get the daily counts but also getting the counts per group.
I am trying to go from dataframe A to dataframe B.
I am using R and tidyr.
Here is the code :
df <- data.frame(cum_count = c(5, 14, 50, 5, 14, 50),
state = c("Alabama", "Alabama", "Alabama", "NY", "NY", "NY"),
Year = c(2012:2014, 2012:2014))
Dataframe A
cum_count state Year
1 5 Alabama 2012
2 14 Alabama 2013
3 50 Alabama 2014
4 5 NY 2012
5 14 NY 2013
6 50 NY 2014
Dataframe B
cum_count state Year
1 5 Alabama 2012
2 9 Alabama 2013
3 36 Alabama 2014
4 5 NY 2012
5 9 NY 2013
6 36 NY 2014
I have tried using the diff function :
df <- df %>%group_by(state)%>%
mutate(daily_count = diff(cum_count))
But I get
Error: Column daily_count must be length 3 (the number of rows) or one, not 2
Let me know what you think.
Thanks!
diff returns length one less than the original length and mutate requires the output column to have the same length as the original (or length 1 which can be recycled). We can append a value possibly NA or the first value of 'cum_count'
library(dplyr)
df %>%
group_by(state)%>%
mutate(daily_count = c(first(cum_count), diff(cum_count)))
# A tibble: 6 x 4
# Groups: state [2]
# cum_count state Year daily_count
# <dbl> <fct> <int> <dbl>
#1 5 Alabama 2012 5
#2 14 Alabama 2013 9
#3 50 Alabama 2014 36
#4 5 NY 2012 5
#5 14 NY 2013 9
#6 50 NY 2014 36
Or for this purpose, use lag and subtract from the column itself
df %>%
group_by(state)%>%
mutate(daily_count = replace_na(cum_count - lag(cum_count), first(cum_count)))

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

R - date sequence with condition

I have this dataframe
test <-
data.frame(
id = c(4, 6, 9, 12),
open = c(as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01")),
closed = c(as.Date("2011-12-01"), as.Date("2011-12-31"), as.Date("2012-01-01"), as.Date("2015-12-31"))
)
My goal is to get each date that overlapped or reached last day in year. Since id 4 was started at 2011 and ended before last day, there should be NA for it. id 6 reached last day in 2011 but not in 2012, same for id 9.
Result should be
summary <-
data.frame(
id = c(4, 6, 9, 12),
open = c(as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01")),
closed = c(as.Date("2011-12-01"), as.Date("2011-12-31"), as.Date("2012-01-01"), as.Date("2015-12-31")),
open_summary = c(NA, 2011, 2011, 2011),
closed_summary = c(NA, 2011, 2011, 2015)
)
Then I'd like to create a sequence from these dates so result should be
result <-
data.frame(
y = c(2011, 2011, 2011, 2012, 2013, 2014, 2015),
id = c(6, 9, 12, 12, 12, 12, 12)
)
Here is a tidyverse solution, also using lubridate (for the year function)...
library(tidyverse)
library(lubridate)
summary <- test %>%
mutate(open_summary = year(open) * (year(open) > year(open - 1)),
closed_summary = (year(closed + 1) - 1) * (year(closed + 1) > year(open)))
output <- summary %>%
filter(open_summary * closed_summary > 1) %>%
mutate(open_year = map2(open_summary, closed_summary, seq)) %>%
select(id, open_year) %>%
unnest(c(open_year))
summary
id open closed open_summary closed_summary
1 4 2011-01-01 2011-12-01 2011 0
2 6 2011-01-01 2011-12-31 2011 2011
3 9 2011-01-01 2012-01-01 2011 2011
4 12 2011-01-01 2015-12-31 2011 2015
output
id open_year
1 6 2011
2 9 2011
3 12 2011
4 12 2012
5 12 2013
6 12 2014
7 12 2015
If either open_summary or closed_summary is zero, that is equivalent to your NA row.
Here is a approach using data.table:
library(data.table)
#create a lookup table of year end dates
yrend <- data.table(YR_END=seq(as.Date("2010-12-31"), as.Date("2015-12-31"), by="1 year"))[,
YR := year(YR_END)]
setDT(test)
#create open_summary column since its just the end of the year of the open column
test[, open_summary := year(open)]
#lookup the year for the closed date
test[, closed_summary := yrend[test, on=.(YR_END>=open, YR_END<=closed), mult="last", YR]]
#create the sequence in part 2 of the qn
test[!is.na(open_summary) & !is.na(closed_summary),
.(y=open_summary:closed_summary), id]
test output:
id open closed open_summary closed_summary
1: 4 2011-01-01 2011-12-01 2011 NA
2: 6 2011-01-01 2011-12-31 2011 2011
3: 9 2011-01-01 2012-01-01 2011 2011
4: 12 2011-01-01 2015-12-31 2011 2015
the other output:
id y
1: 6 2011
2: 9 2011
3: 12 2011
4: 12 2012
5: 12 2013
6: 12 2014
7: 12 2015

Resources