> dput(df1)
structure(list(X = c("cc_China", "bb_China", "dd_China", "cc_Egypt",
"bb_Egypt", "dd_Egypt"), Country = c("China", "China", "China",
"Egypt", "Egypt", "Egypt"), May = c(2, 3, 8, 2, 4, 1), Jun = c(2,
2, 5, 5, 5, 5), Jul = c(3, NA, NA, 3, 2, NA), Aug = c(4, 6, 3,
2, 3, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I have a dataset as such, I have extract the country from X column into the Country column. For each country, I wish to get the latest month and their value, where all 3 different row (cc,bb and dd) are not NA. For China, the latest is Aug where, all cc,bb and dd have values. For Egypt, the latest month would be Jun, where it was the latest month where all 3 datas are available. Thanks.
df1>
X Country May Jun Jul Aug
cc_China China 2 2 3 4
bb_China China 3 2 NA 6
dd_China China 8 5 NA 3
cc_Egypt Egypt 2 5 3 2
bb_Egypt Egypt 4 5 2 3
dd_Egypt Egypt 1 5 NA NA
I wish to get this
Month X Value
Aug cc_China 4
Aug bb_China 6
Aug dd_China 3
Jun cc_Egypt 5
Jun bb_Egypt 5
Jun dd_Egypt 5
Get the data in long format and for each Country keep only those rows which have all non-NA values in a month. For each Country you can then keep only the max month.
Since we cannot compare character month names directly, I have converted them to numbers using inbuilt vector month.abb.
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -c(X, Country), names_to = 'Month') %>%
mutate(month_num = match(Month, month.abb)) %>%
group_by(Country, Month) %>%
filter(all(!is.na(value))) %>%
group_by(Country) %>%
filter(month_num == max(month_num)) %>%
ungroup %>% select(-month_num, -Country)
# X Month value
# <chr> <chr> <dbl>
#1 cc_China Aug 4
#2 bb_China Aug 6
#3 dd_China Aug 3
#4 cc_Egypt Jun 5
#5 bb_Egypt Jun 5
#6 dd_Egypt Jun 5
Related
I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))
Say you have a database like gapminder with the population per country. Even though the current year is 2021, you also have predictions for the following years to come.
location 2020.0 2021.0 2022.0
Canada 5 7 9
China 23 34 54
Congo 1 2 3
and another database like this, vaccins
location date amount_of_vaccins
Canada 2020-01-02 50
China 2021-05-03 59
Congo 2022-03-05 34
How can I merge the population of each country into the second database, but following the dates in the second database.
I managed to merge them by country like this:
merge(gapminder,vaccins, by = "location")
but I'm getting this
location date amount_of_vaccins 2020.0 2021.0 2022.0
Canada 2020-01-02 50 5 7 9
China 2021-05-03 59 23 34 54
Congo 2022-03-05 34 1 2 3
I'd like to have only a new variable giving the population of the country according to the year. Thank you.
You could do something like this with tidyverse.
library(tidyverse)
df1 <- df1 %>%
pivot_longer(!location, names_to = "date", values_to = "population") %>%
dplyr::mutate(year = str_sub(date, 1, 4))
df2 %>%
dplyr::mutate(year = str_sub(date, end = 4)) %>%
dplyr::left_join(., df1, by = c("location", "year")) %>%
dplyr::select(-c(date.y, year)) %>%
dplyr::rename(date = date.x)
Output
location date amount_of_vaccins population
1 Canada 2020-01-02 50 5
2 China 2021-05-03 59 34
3 Congo 2022-03-05 54 3
Data
df1 <-
structure(
list(
location = c("Canada", "China", "Congo"),
`2020.0` = c(5, 23, 1),
`2021.0` = c(7, 34, 2),
`2022.0` = c(9, 54, 3)
),
class = "data.frame",
row.names = c(NA,-3L)
)
df2 <-
structure(
list(
location = c("Canada", "China", "Congo"),
date = c("2020-01-02",
"2021-05-03", "2022-03-05"),
amount_of_vaccins = c(50, 59, 54)
),
class = "data.frame",
row.names = c(NA,-3L)
)
I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14
I have a dataframe like this:
data <- data.frame(Time = rep(c("Jan 1999", "Feb 1999", "Mar 1999"), each = 3), Country = rep(c("Australia", "Brazil", "Canada"), 3), rep(Group = c("A", "B", "A"), 3), Intercept = NA)
and another dataframe with coefficients from a regression where A and B are the Intercepts for the different groups.
coeffs <- data.frame(Time = c("Jan 1999", "Feb 1999", "Mar 1999"), A = c(1,2,3), B = c(3,2,1))
Now I want to put the Intercepts from the coeffs dataframe into the dataframe's intercept columns. I did this the following way:
l <- length(unique(data[,"Country"]))
data[,"Intercept"] <- ifelse(data_1[,"Group_1"] == "A", rep(coeffs_1[,"A"], each = l), rep(coeffs_1[,"B"], each = l))
This seems to work well for the 2 groups, but now I need to do the same thing for 7 groups and I don't see how I could generalize the approach above. I guess I could use a 7 level nested ifelse statement or a for loop, but there has to be a more elegant way.
Thanks for your help!
Get coeffs in long format and join with data :
library(dplyr)
coeffs %>%
tidyr::pivot_longer(cols = -Time, names_to = 'Group',
values_to = 'Intercept') %>%
right_join(data, by = c('Time', 'Group'))
# A tibble: 9 x 4
# Time Group Intercept Country
# <chr> <chr> <dbl> <chr>
#1 Jan 1999 A 1 Australia
#2 Jan 1999 A 1 Canada
#3 Jan 1999 B 3 Brazil
#4 Feb 1999 A 2 Australia
#5 Feb 1999 A 2 Canada
#6 Feb 1999 B 2 Brazil
#7 Mar 1999 A 3 Australia
#8 Mar 1999 A 3 Canada
#9 Mar 1999 B 1 Brazil
Used this dataframe for data :
data <- data.frame(Time = rep(c("Jan 1999", "Feb 1999", "Mar 1999"), each = 3),
Country = rep(c("Australia", "Brazil", "Canada"), 3),
Group = rep(c("A", "B", "A"), 3))
I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!