Sum rows in R and store the new value in a new dataframe - r

I have a dataframe that looks like this:
Year
Month
total_volume
us_assigned
1953
October
55154
18384.667
1953
November
22783
7594.333
1953
December
20996
6998.667
1954
January
22096
7365.333
1954
February
18869
6289.667
1954
March
11598
3866.000
1954
April
37051
12350.333
1954
May
105856
35285.333
1954
June
61320
20440.000
1954
July
44084
14694.667
1954
August
175152
58384.000
1954
September
80071
26690.333
The dataframe goes to the year 2021 with monthly observations as shown in the table above. I am trying to sum up 12 months (i.e., rows) at a time (from Oct. to Sept.) for the column "us_assigned" and save this value in a new dataframe, which would look like this:
Year
us_assigned
1
218343
2
3
Year 2 would have the sum of the next 12 months (i.e., the next Oct.-Sept.) and so on and so forth. I have thought of simply summing the rows by specifying them, like below, but this seems too tedious.
sum(us_volume[1:12,4])
I am sure there is a much easier way to do this. I am not too proficient with R so I appreciate any help.

This is relatively straightforward using group_by() and summarise() from the dplyr package, e.g.
library(dplyr)
df <- read.table(text = "Year Month total_volume us_assigned
1953 October 55154 18384.667
1953 November 22783 7594.333
1953 December 20996 6998.667
1954 January 22096 7365.333
1954 February 18869 6289.667
1954 March 11598 3866.000
1954 April 37051 12350.333
1954 May 105856 35285.333
1954 June 61320 20440.000
1954 July 44084 14694.667
1954 August 175152 58384.000
1954 September 80071 26690.333", header = TRUE)
df2 <- df %>%
mutate(Year_int = cumsum(Month == "October")) %>% # every Oct add 1 to Year_int
group_by(Year_int) %>%
summarise(us_assigned = sum(us_assigned))
df2
#> # A tibble: 1 × 2
#> Year_int us_assigned
#> <int> <dbl>
#> 1 1 218343.
Created on 2023-01-23 with reprex v2.0.2

Related

R transform data with year begin and year end into time series data

I have a problem that is very similar to this:
R transform data frame with start and end year into time series however, none of the solutions have worked for me.
This the original df:
df <- data.frame(country = c("Albania", "Albania", "Albania"), leader = c("Sali Berisha", "Sali Berisha", "Sali Berisha"), term = c(2, 2, 2), yearbegin = c(2009,2009, 2009), yearend = c(2013, 2013, 2013))
And it currently looks like this:
#> country leader term yearbegin yearend
#> 1 Albania Sali Berisha 2 2009 2013
#> 2 Albania Sali Berisha 2 2009 2013
#> 3 Albania Sali Berisha 2 2009 2013
And I'm trying to get it to look like this:
#> 1 Albania Sali Berisha 2 2009
#> 2 Albania Sali Berisha 2 2010
#> 3 Albania Sali Berisha 2 2011
#> 4 Albania Sali Berisha 2 2012
#> 5 Albania Sali Berisha 2 2013
When using this code:
library(tidyverse)
gpd_df<- gpd_df %>%
mutate(year = map2(yearbegin, yearend, `:`)) %>%
select(-yearbegin, -yearend) %>%
unnest```
I get a column that looks like this:
```year
2009:2013
2009:2013
2009:2013
Many thanks in advance for your help!!
trying to transform date into time-series form year begin/year end. Have just found errors :')
Use distinct first:
library(dplyr)
library(tidyr)
gpd_df %>%
distinct() %>%
mutate(year = map2(yearbegin, yearend, `:`), .keep = "unused") %>%
unnest_longer(year)
country leader term year
1 Albania Sali Berisha 2 2009
2 Albania Sali Berisha 2 2010
3 Albania Sali Berisha 2 2011
4 Albania Sali Berisha 2 2012
5 Albania Sali Berisha 2 2013

How to plot various multiple variables (in the same column) over a long time series?

I am attempting to plot the following:
I have two columns. One of which is the type of disaster and the other is the year in which it occurred. I would like to do a stacked bar plot or a plot that shows the number of each event by year. For example a graph that would show the number of tornado events, hurricanes, and fires that occurred yearly over 40 years all in the same graph. Do I need to split them into various columns or is there a way to do this as is without manipulating the data. I am also curious if there is a way to count the values by year as that would facilitate this process.
Thanks in advance! Data below...
inctype decdate
1 Tornado 1953
2 Tornado 1953
3 Flood 1953
4 Tornado 1953
5 Flood 1953
6 Tornado 1953
7 Tornado 1953
8 Flood 1953
9 Flood 1953
10 Fire 1953
11 Flood 1953
12 Other 1953
13 Tornado 1953
14 Flood 1954
15 Tornado 1954
16 Flood 1954
17 Flood 1954
18 Earthquake 1954
19 Flood 1954
20 Flood 1954
21 Hurricane 1954
22 Hurricane 1954
23 Hurricane 1954
24 Hurricane 1954
25 Hurricane 1954
26 Flood 1954
27 Hurricane 1954
28 Hurricane 1954
29 Flood 1954
30 Other 1954
31 Volcano 1955
32 Flood 1955
33 Tornado 1955
Getting the counts is very straightforward, so let's start with that. The easiest way is probably just the built in function table:
table(df$inctype, df$decdate)
#> 1953 1954 1955
#> Earthquake 0 1 0
#> Fire 1 0 0
#> Flood 5 7 1
#> Hurricane 0 7 0
#> Other 1 1 0
#> Tornado 6 1 1
#> Volcano 0 0 1
To make a stacked bar plot, you have several options. If for some reason you want to use only base R functions, you can do something like:
colours <- c("#440154FF", "#443A83FF", "#31688EFF", "#21908CFF",
"#35B779FF", "#8FD744FF", "#FDE725FF")
barplot(table(df$inctype, df$decdate),
col = colours,
border = "gray20",
font.axis = 2,
xlab = "year",
font.lab = 2)
legend(list(x = 2.6, y = 16),
legend = levels(as.factor(df$inctype)),
fill = colours)
Though perhaps the simplest and most popular way to get a stacked bar plot without manipulating your data is to use the ggplot2 package:
library(ggplot2)
ggplot(df, aes(decdate, fill = inctype)) + geom_bar()
This has the benefit of having a huge number of options to get something a bit more aesthetically pleasing:
ggplot(df, aes(decdate, fill = inctype)) +
geom_bar(color = "black", size = 0.2) +
scale_fill_viridis_d() +
labs(x = "Year", fill = "Event type") +
theme_minimal()

what's the difference between these two classes and why I must use as_tibble function in my code?

Here is my code. I have two questions about my codes. The first question is what's the difference between these two classes? The second question is why I must use as_tibble() function so that I could use pivot_wider() function?
head(global_economy)
write.table(global_economy,"global_economy.csv",sep=",",row.names=FALSE)
class(global_economy)
Country Code Year GDP Growth CPI Imports Exports Population
Afghanistan AFG 1960 537777811 NA NA 7.024793 4.132233 8996351
Afghanistan AFG 1961 548888896 NA NA 8.097166 4.453443 9166764
Afghanistan AFG 1962 546666678 NA NA 9.349593 4.878051 9345868
Afghanistan AFG 1963 751111191 NA NA 16.863910 9.171601 9533954
Afghanistan AFG 1964 800000044 NA NA 18.055555 8.888893 9731361
Afghanistan AFG 1965 1006666638 NA NA 21.412803 11.258279 9938414
'tbl_ts' 'tbl_df' 'tbl' 'data.frame'
wider_tibble <- global_economy %>%
as_tibble()%>%
pivot_wider(names_from=Country,values_from=GDP)
class(wider_tibble)
'tbl_df' 'tbl' 'data.frame'
My guess: The first one is a time_series ('table_ts'). 'as_tibble' returns it to a dataframe without time series elements so that 'pivot_wider' works.

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

Creating new variable and new data rows for country-conflict-year observations

I'm very new to R, still learning the very basics, and I haven't yet figured out how to perform this particular operation, but it would save me lots and lots of labor and time.
I have a dataset of international conflicts with columns for country and dates that looks something like this:
country dates
Angola 1951-1953
Belize 1970-1972
I would like to reorganize the data to create variables for start year and end year, as well as create a year-observed (call it 'yrobs') column, so the set looks more like this:
country yrobs yrstart yrend
Angola 1951 1951 1953
Angola 1952 1951 1953
Angola 1953 1951 1953
Belize 1970 1970 1972
Belize 1971 1970 1972
Belize 1972 1970 1972
Someone suggested using data frames and a double for-loop, but I got a little confused trying that. Any help would be greatly appreciated, and feel free to use dummy language, as I'm still pretty green to the programming here. Thanks much.
No need for any for loops here. Use the power of R and its contributed packages, particularly plyr and reshape2.
library(reshape2)
library(plyr)
Create some data:
df <- data.frame(
country =c("Angola","Belize"),
dates = c("1951-1953", "1970-1972")
)
Use colsplit in the reshape package to split your dates column into two, and cbind this to the original data frame.
df <- cbind(df, colsplit(df$date, "-", c("start", "end")))
Now for the fun bit. Use ddply in package plyr to split, apply and combine (SAC). This will take df and apply a function to each change in country. The anonymous function inside ddply creates a small data.frame with country and observations, and the key bit is to use seq() to generate a sequence from start to end date. The power of ddply is that it does all of this splitting, combining and applying in one step. Think of it as a loop in other languages, but you don't need to keep track of your indexing variables.
ddply(df, .(country), function(x){
data.frame(
country=x$country,
yrobs=seq(x$start, x$end),
yrstart=x$start,
yrend=x$end
)
}
)
And the results:
country yrobs yrstart yrend
1 Angola 1951 1951 1953
2 Angola 1952 1951 1953
3 Angola 1953 1951 1953
4 Belize 1970 1970 1972
5 Belize 1971 1970 1972
6 Belize 1972 1970 1972

Resources