R: Add 0s to dataframe - r

How to add a 0 amount for source solar in year 1990 to the dataframe below? There's presently no value for solar in 1990.
Data:
year
source
amount
1990
coal
19203
1990
nuclear
2345
1991
coal
18490
1991
nuclear
2398
1991
solar
123
1992
...
...
...
...
...
2019
...
...
Code:
data <- read.csv('annual_generation.csv')
data$source <- as.factor(data$source)
This doesn't work but it's the general idea:
for(i in 1990:2019) {
for (j in data$source) {
if (!data[i][j])
data[i][j] = 0
}
}
Edit: Based on the answer below, this was the final solution:
data <- complete(data, YEAR, STATE, ENERGY.SOURCE,
fill = list(
GEN = 0,
TYPE.OF.PRODUCER = 'Total Electric Power Industry'))
YEAR STATE ENERGY.SOURCE TYPE.OF.PRODUCER GEN
<int><fct> <fct> <fct> <dbl>
1 1990 IL Coal Total Electric Power Industry 54966018
...

We can use complete from tidyr
library(tidyr)
complete(data, year, source, fill = list(amount = 0))
-output
# A tibble: 6 x 3
# year source amount
# <int> <chr> <dbl>
#1 1990 coal 19203
#2 1990 nuclear 2345
#3 1990 solar 0
#4 1991 coal 18490
#5 1991 nuclear 2398
#6 1991 solar 123
Also, if there are some 'year', missing. we can use a range
complete(data, year = 1990:2019, source, fill = list(amount = 0))
data
data <- structure(list(year = c(1990L, 1990L, 1991L, 1991L, 1991L),
source = c("coal",
"nuclear", "coal", "nuclear", "solar"), amount = c(19203L, 2345L,
18490L, 2398L, 123L)), class = "data.frame", row.names = c(NA,
-5L))

Related

Aggregate dataframe by condition in R

I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))

How to create percentage column by year and type in R

I have table such as this one
Year Type Value
1991 A 4945
1991 B 525
1991 C 764
1992 A 640
1992 B 3935
1992 D 49
1993 K 49
I would like to generate a new column that calculates the percentage of each type for each year. The types may change per year, and some years only have one type
Eg. The first percentage should be 4945/(4945+525+764)
Any help would be very welcome. Thank you very much!
Do a group by 'Year' and get the proportions of 'Value'
library(dplyr)
df1 %>%
group_by(Year) %>%
mutate(new = proportions(Value) * 100) %>%
ungroup
-output
# A tibble: 6 × 4
Year Type Value new
<int> <chr> <int> <dbl>
1 1991 A 4945 79.3
2 1991 B 525 8.42
3 1991 C 764 12.3
4 1992 A 640 13.8
5 1992 B 3935 85.1
6 1992 D 49 1.06
Or use base R with ave
df1$new <- with(df1, ave(Value, Year, FUN = proportions) * 100)
data
df1 <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 1992L
), Type = c("A", "B", "C", "A", "B", "D"), Value = c(4945L, 525L,
764L, 640L, 3935L, 49L)), class = "data.frame", row.names = c(NA,
-6L))

How to summarize two different rows with different values to a single row with that sum using dplyr?

I have the following data frame but in a bigger scale of course:
country
year
strain
num_cases
mex
1996
sp_m014
412
mex
1996
sp_f014
214
mex
1998
sp_m014
150
mex
1998
sp_f014
200
usa
1996
sp_m014
200
usa
1996
sp_f014
180
usa
1997
sp_m014
190
usa
1997
sp_f014
150
I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:
country
year
strain
num_cases
mex
1996
sp
626
mex
1998
sp
350
usa
1996
sp
380
usa
1997
sp
340
In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.
Thanks!
Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'
library(dplyr)
df1 %>%
group_by(country, year) %>%
summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')
-output
# A tibble: 4 x 4
# country year strain num_cases
#* <chr> <int> <chr> <int>
#1 mex 1996 sp 626
#2 mex 1998 sp 350
#3 usa 1996 sp 380
#4 usa 1997 sp 340
data
df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa",
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L,
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014",
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L,
214L, 150L, 200L, 200L, 180L, 190L, 150L)),
class = "data.frame", row.names = c(NA,
-8L))
Here's an approach with tidyr::extract:
library(tidyr);library(dplyr)
df1 %>%
extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
group_by(country,year,strain) %>%
summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups: country, year [4]
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Now that you have the strains fully parsed you can easily group by sex or age. Thanks to #akrun for the data.
Update:
To use the age range you can do parse_number
df1 %>%
mutate(age_range=parse_number(strain)) %>%
group_by(country, year, age_range) %>%
summarise(num_cases=sum(num_cases))
Output:
country year age_range num_cases
<chr> <int> <dbl> <int>
1 mex 1996 14 626
2 mex 1998 14 350
3 usa 1996 14 380
4 usa 1997 14 340
First answer:
Thanks to akrun for the data:
library(tidyverse)
df1 %>%
group_by(country, year, strain) %>%
mutate(strain=str_extract(strain, "^.{2}")) %>%
summarise(num_cases=sum(num_cases))
Output:
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340

How to undummy a datasset with R

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

Summing data frames with different length

I have two data sets (one for each country) that look like this:
dfGermany
Country Sales Year Code
Germany 2000 2000 221
Germany 1500 2001 150
Germany 2150 2002 270
dfJapan
Country Sales Year Code
Japan 500 2000 221
Japan 750 2001 221
Japan 800 2001 270
Japan 1000 2002 270
Code here is the "name" of the product. What I want to do is to take half the Japanese sell and add it to the df for Germany if the code and the year matches.
So for instance, half of the sales value for product 221 and 270 in dfJapan (250 € and 500 €) should be added to dfGermany for year 2000 and 2002. But nothing should happen to the values for 2001 since the code does not match with the year.
I tried with merge, but that function did not work since the data is of different size and I also want to match both year and value.
We can do a join on 'Year', 'Code' and then update the 'dfGermany' 'Sales' column
library(data.table)
setDT(dfGermany)[dfJapan, Sales := Sales + i.Sales/2, on = .(Year, Code)]
dfGermany
# Country Sales Year Code
#1: Germany 2250 2000 221
#2: Germany 1500 2001 150
#3: Germany 2650 2002 270
data
dfGermany <- structure(list(Country = c("Germany", "Germany", "Germany"),
Sales = c(2000, 1500, 2150), Year = 2000:2002, Code = c(221L,
150L, 270L)), row.names = c(NA, -3L), class = "data.frame")
dfJapan <- structure(list(Country = c("Japan", "Japan", "Japan", "Japan"
), Sales = c(500L, 750L, 800L, 1000L), Year = c(2000L, 2001L,
2001L, 2002L), Code = c(221L, 221L, 270L, 270L)),
class = "data.frame", row.names = c(NA, -4L))
Using dplyr and #akrun's provided data:
library(dplyr)
dfGermany %>%
left_join(dfJapan %>%
select(Year, Code, sales_japan = Sales),
by = c('Year', 'Code')) %>%
mutate(Sales = Sales + coalesce(sales_japan / 2, 0)) %>%
select(-sales_japan)
> dfGermany
Country Sales Year Code
1 Germany 2250 2000 221
2 Germany 1500 2001 150
3 Germany 2650 2002 270

Resources