Overwrite a specific value in a dataframe, based on matching values - r

My data is in a format like this:
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 90
df <- data.frame(stringsAsFactors=FALSE,
country = c("AUS", "USA", "AUS"),
year = c(2019, 2019, 2018),
value = c(100, 120, 90)
)
and I have an one row dataframe that represents a revision that should overwrite the existing record in my data.
#> country year value
#> 1 AUS 2019 500
df2 <- data.frame(stringsAsFactors=FALSE,
country = c("AUS"),
year = c(2018),
value = c(500)
)
My desired output is:
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500
I know how to find the row to overwrite:
library(tidyverse)
df %>% filter(country == overwrite$country & year == overwrite$year) %>%
mutate(value = overwrite$value)
but how do I put that back in the original dataframe?
Tidyverse answers are easier for me to work with, but I'm open to any solutions.

Using mutate and if_else:
library(tidyverse)
df %>%
mutate(value = if_else(country %in% df2$country & year %in% df2$year, df2$value, value))
Results in:
country year value
1 AUS 2019 100
2 USA 2019 120
3 AUS 2018 500

Here, an efficient approach is join on with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), join on with the 'df2' on 'country', 'year' assign (:=) the 'value' column from second dataset (i.value) to replace the 'value' in original dataset
library(data.table)
setDT(df)[df2, value := i.value, on = .(country, year)]
df
# country year value
#1: AUS 2019 100
#2: USA 2019 120
#3: AUS 2018 500

One possible tidyverse approach using 1). anti_join to remove the rows from df that will be replaced and 2). bind_rows to add the replacement rows from df2:
library(dplyr)
anti_join(df, df2, by = c("country", "year")) %>% bind_rows(df2)
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500
Or, another one using 1). right_join to join the old and new values and 2). coalesce to keep only the new values:
right_join(df2, df, by = c("country", "year")) %>%
transmute(country, year, value = coalesce(value.x, value.y))
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500

Related

How to SUM total number of a value per each month & year and join by other columns in R

Hey I've got this table:
country date cases
----------------------------------------------------------
USA 2022-05-01 5
Benin 2022-05-28 2
USA 2021-05-17 3
USA 2022-05-05 7
Benin 2022-02-11 3
I want to group by country and then calculate the total number of cases per country each month of each year (without days where the month & year will be in a single column called date.):
country date cases
----------------------------------------------------------
USA 2022-05 12
USA 2021-05 3
Benin 2022-05 2
Benin 2022-02 3
I've recently asked a pre-question that was to group cases per each month and someone helped me with that code:
bymonth <- aggregate(cbind(cases) ~ substr(date, 1, 7), data=data1,
FUN=sum)
that was good but it sums up all cases in all countries per each month. I want to sum up for each country, each month of each year like in the second table above.
the name of the data is data1 and the columns are 'cases'/'date' (class DATE)/'country'
If the date column is of type Date, then you can use format(date, "%Y-%m") to foramt it as "yyyy-mm" (It's no longer a Date object, but a character string).
aggregate(cases ~ country + date2,
transform(df, date2 = format(date, "%Y-%m")),
FUN = sum)
# country date2 cases
# 1 USA 2021-05 3
# 2 Benin 2022-02 3
# 3 Benin 2022-05 2
# 4 USA 2022-05 12
If the date column is of type character, you should convert it to Date by as.Date(date) in advance.
If you have attached dplyr, then the above method is equivalent to:
df %>%
count(country, date2 = format(date, "%Y-%m"), wt = cases, name = "cases")
# country date2 cases
# 1 Benin 2022-02 3
# 2 Benin 2022-05 2
# 3 USA 2021-05 3
# 4 USA 2022-05 12
where count() is a shortcut of group_by(...) %>% summarise(cases = sum(cases)).
Data
df = data.frame(
country = c('USA', 'Benin', 'USA', 'USA', 'Benin'),
date = as.Date(c('2022-05-01', '2022-05-28', '2021-05-17', '2022-05-05', '2022-02-11')),
cases = c(5, 2, 3, 7, 3)
)
The lubridate package is not really necessary, but it does make things a little simpler
library(lubridate)
aggregate(cases~country+year(date)+month(date),data=df1,sum)
country year(date) month(date) cases
1 Benin 2022 2 3
2 USA 2021 5 3
3 Benin 2022 5 2
4 USA 2022 5 12
If you prefer tidyverse, you can create new grouping columns in group_by function.
df = data.frame(
'country'= c('usa', 'usa', 'benin', 'benin', 'usa'),
'date' = c('2022-05-02', '2021-04-03', '2022-05-02', '2022-02-03', '2022-05-05'),
'cases' = c(12, 3, 2, 4, 10)
)
library(tidyverse)
library(lubridate)
df %>% group_by(
country,
year=year(date),
month=month(date)) %>%
summarise(sum=sum(cases)) %>%
ungroup() %>%
mutate(year_mon = paste(year, month, sep='-')) %>%
select(country, year_mon, sum)
#> `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
#> # A tibble: 4 × 3
#> country year_mon sum
#> <chr> <chr> <dbl>
#> 1 benin 2022-2 4
#> 2 benin 2022-5 2
#> 3 usa 2021-4 3
#> 4 usa 2022-5 22
Created on 2022-05-13 by the reprex package (v2.0.1)
Here's another way with tidyverse, where we can simultaneously convert the date object to year-month and create our grouping variables, then summarise.
library(tidyverse)
df %>%
group_by(country, date = format(as.Date(df$date), "%Y-%m")) %>%
summarise(cases = sum(cases, na.rm = TRUE))
Output
country date cases
<chr> <chr> <dbl>
1 Benin 2022-02 3
2 Benin 2022-05 2
3 USA 2021-05 3
4 USA 2022-05 12
Data
df <- structure(list(country = c("USA", "Benin", "USA", "USA", "Benin"
), date = structure(c(19113, 19140, 18764, 19117, 19034), class = "Date"),
cases = c(5, 2, 3, 7, 3)), class = "data.frame", row.names = c(NA,
-5L))
Here is how an alternative approach:
library(lubridate)
library(dplyr)
df %>%
mutate(Month_Yr = format_ISO8601(date, precision = "ym")) %>%
group_by(country, Month_Yr) %>%
summarise(cases = sum(cases))
country Month_Yr cases
<chr> <chr> <dbl>
1 Benin 2022-02 3
2 Benin 2022-05 2
3 USA 2021-05 3
4 USA 2022-05 12

Recoding rare categories of a variable to category -"other" based on condition

Need to transform variable below, based on category quantity in the dataset
so that categories that appear less than two times are re-named to category "other"
Data example
Desirable output
I used to use below chunk of code for such transformation but since I moved to R 4.05 it throws me an error.
levels(data$Country_of_origin) <-ifelse(table(data&Country_of_origin)>2,"OTHER",levels(data&Country_of_origin))
Not sure why your code stopped working but forcats::fct_lump_*() is a great option for this application. See small example here:
library(tidyverse)
d <- c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA') %>% factor()
# original distribution
table(d)
#> d
#> Germany Japan USA
#> 1 1 4
# lumpped distribution
fct_lump_min(d, min = 2) %>% table()
#> .
#> USA Other
#> 4 2
Created on 2022-02-10 by the reprex package (v2.0.1)
Here is another option:
Packages
library(dplyr)
library(tibble)
library(magrittr)
Input
data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))
data
# A tibble: 6 x 1
country
<chr>
1 USA
2 USA
3 Germany
4 Japan
5 USA
6 USA
Solution
few_country <- data %>% count(country) %>% filter(n<=2)
data %>%
mutate(new_country = case_when(country %in% few_country$country ~ "OTHER",
TRUE ~ country))
Output
# A tibble: 6 x 2
country new_country
<chr> <chr>
1 USA USA
2 USA USA
3 Germany OTHER
4 Japan OTHER
5 USA USA
6 USA USA
Using dplyr, you can calculate frequency after group_by(country) and then mutate country when below a threshold:
library(dplyr)
library(tidyr)
data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))
data |>
group_by(country) |>
mutate(country = ifelse(n() < 2, "OTHER", country))
# A tibble: 6 × 1
# Groups: country [2]
country
<chr>
1 USA
2 USA
3 OTHER
4 OTHER
5 USA
6 USA

R Dplyr: How do I add columns from an ungrouped dataframe to a grouped dataframe and retain the grouping?

I have a main data frame (data) that contains information about purchases: names, year, city, and a few other variables:
Name Year City
N1 2018 NY
N2 2019 SF
N2 2018 SF
N1 2010 NY
N3 2020 AA
I used new_data <- data %>% group by(Name) %>% tally(name = "Count") to get something like this:
Name Count
N1 2
N2 2
N3 1
My questions, preferably using dplyr:
1) How do I now add the city that corresponds to Name to new_data, i.e:
Name Count City
N1 2 NY
N2 2 SF
N3 1 AA
2) How do I add the earliest year of each Name to new_data, i.e.:
Name Count City Year
N1 2 NY 2010
N2 2 SF 2018
N3 1 AA 2020
It seems that summarise may suit you better, for example:
data %>%
group_by(Name, City) %>%
summarise(Count = n(),
Year = min(Year))
Output:
# A tibble: 3 x 4
# Groups: Name [3]
Name City Count Year
<fct> <fct> <int> <int>
1 N1 NY 2 2010
2 N2 SF 2 2018
3 N3 AA 1 2020
While you can group with City as well to keep it in the output.
An option with data.table
library(data.table)
setDT(data)[, .(Count = .N, Year = min(Year)), .(Name, City)]

How to work with special row after group_by in R

I have a data frame something like bellow:
amount <- sample(10000:2000, 20)
year<- sample(2015:2017, 20, replace = TRUE)
company<- sample(LETTERS[1:3],20, replace = TRUE)
df<-data.frame(company, year, amount)
Then I want to group by company and year so I have:
df %>%
group_by(company, year) %>%
summarise(
total= sum(amount)
)
company year total
<fct> <int> <int>
1 A 2015 1094
2 A 2016 3308
3 A 2017 4785
4 B 2015 1190
5 B 2016 6583
6 B 2017 1964
7 C 2015 4974
8 C 2016 1986
9 C 2017 3465
Now, I want to divide the last row in each group to the first row. In other words, I want to divide the total value for the last year for each company to the same value of the first year.
Thanks.
You could use last and first to access those elements of total respectively :
library(dplyr)
df %>%
group_by(company, year) %>%
summarise(total= sum(amount)) %>%
summarise(final = last(total)/first(total))
# company final
# <fct> <dbl>
#1 A 2.26
#2 B 1.92
#3 C 0.565
In base R, we can use aggregate
aggregate(amount~company, aggregate(amount~company+year, df, sum),
function(x) x[length(x)]/x[1])
# company amount
#1 A 2.262524
#2 B 1.919138
#3 C 0.565281
With data.table, we can do
library(data.table)
setDT(df)[ , .(total = sum(amount)), .(company, year)][,
.(final = last(total)/first(total)), .(company)]

R: creating new variable with conditions using dplyr

Hi I am trying to create a new variable with dplyr.
My data looks like the following:
Land happy year
<fctr> <int> <dbl>
1 Country1 09 2002
2 Country1 08 2012
3 Country3 05 2008
...
To create a variable with the mean of happy per Land and year, I used this code:
New <-df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T))
Now I would like to make a variable with this content:
(mean of happy in 2012)- (mean of happy in 2008) for each Country.
How can I build a new variable with these conditions?
Here's a dplyr/tidyr solution.
library(dplyr)
library(tidyr)
df <- df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T)) %>%
spread(year, mean.happy)
Here's a data.table solution. It's typically faster
library(data.table)
dt = read.table("clipboard", header = TRUE)
setDT(dt)
dt[ , "mean.happy" := mean(happy), by = .(Land, year)]
dt[ , "diff.happiness" := mean(happy[year == 2012]) - mean(happy[year == 2008])]
> dt
Land happy year mean.happy diff.happiness
1: Country1 9 2002 9 3
2: Country1 8 2012 8 3
3: Country3 5 2008 5 3

Resources