Related
I am trying to clean my age variable from data entry discrepancies in a panel data that follow individuals over time. Many respondents have a jump in their age from one observation to another because they have missed a few waves and then came back as we can see for the persons below with ID 1 and 2. However, the person with ID 3 had a jump in age that is not equal to the year that s/he was out of the panel.
Could someone please guide me on how to filter out respondents from my data that have unreasonable change in their age that is not equal to the number of years they were out of the panel but to other reasons such as data entry issues?
id year age
1 2005 50
1 2006 51
1 2010 55
2 2002 38
2 2005 41
2 2006 42
3 2006 30
3 2009 38
3 2010 39
structure(list(id = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), format.stata = "%9.0g"),
year = structure(c(2005, 2006, 2010, 2002, 2005, 2006, 2006,
2009, 2010), format.stata = "%9.0g"), age = structure(c(50,
51, 55, 38, 41, 42, 30, 38, 39), format.stata = "%9.0g")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
We can use diff
library(dplyr)
df %>%
group_by(id) %>%
filter(!all(diff(year) == diff(age)))
-output
# A tibble: 3 x 3
# Groups: id [1]
# id year age
# <dbl> <dbl> <dbl>
#1 3 2006 30
#2 3 2009 38
#3 3 2010 39
You can filter out the id's whose change in year and age is not in sync.
library(dplyr)
df %>%
group_by(id) %>%
filter(!all(year - min(year) == age - min(age))) -> unreasonable_data
unreasonable_data
# id year age
# <dbl> <dbl> <dbl>
#1 3 2006 30
#2 3 2009 38
#3 3 2010 39
The same logic can also be implemented using lag.
df %>%
group_by(id) %>%
filter(!all(year - lag(year) == age - lag(age))) -> unreasonable_data
I want to create a country_year variable that is conditioned on the occurrence of countries and years as shown below in this small subsample that i have created. This means that if i have 2 countries with 3 different years, a new country_year variable will have the values of country1_year1, country1_year2, etc..
It seems so simple, but i am new to R and tried to look for different questions that target it with no success. Could someone guide me a bit please?
structure(list(id = structure(c(1, 1, 1, 2, 2, 2), format.stata = "%9.0g"),
country = structure(c("US", "US", "US", "UK", "UK", "UK"), format.stata = "%9s"),
year = structure(c(2003, 2004, 2005, 2003, 2004, 2005), format.stata = "%9.0g"),
country_year = structure(c(1, 2, 3, 4, 5, 6), format.stata = "%9.0g")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
It seems like you are wanting to make a new variable country_year:
Using base R:
df$country_year <- paste0(df$country, "_", df$year)
Using dplyr:
library(dplyr)
df %>%
mutate(country_year = paste0(country,"_",year))
This gives us:
id country year country_year
<dbl> <chr> <dbl> <chr>
1 1 US 2003 US_2003
2 1 US 2004 US_2004
3 1 US 2005 US_2005
4 2 UK 2003 UK_2003
5 2 UK 2004 UK_2004
6 2 UK 2005 UK_2005
An option with tidyverse would be
library(dplyr)
library(tidyr)
df %>%
unite(country_year, country, year, sep="_", remove = FALSE)
-output
# A tibble: 6 x 4
# id country_year country year
# <dbl> <chr> <chr> <dbl>
#1 1 US_2003 US 2003
#2 1 US_2004 US 2004
#3 1 US_2005 US 2005
#4 2 UK_2003 UK 2003
#5 2 UK_2004 UK 2004
#6 2 UK_2005 UK 2005
I'm trying to graph excess deaths for 2020 against confirmed covid-19 deaths.
I have 2 dataframes, one x_worldwide_weekly_deaths (covid-19) and the other containing excess deaths, I want to add an excess deaths column to x_worldwide_weekly_deaths and match by both ISO3 country code, and week number;
Not every country tracks excess deaths so I want those not within the original excess df to have an NA value
Likewise, not every country who track excess deaths are as up to date, some have 37 weeks of data, others might only have 24, so I want the NA values for the missing weeks also
Using the below, I've gotten halfway there, countries not on the original list have NA and those who are have a value, however it only uses the first value rather than changing total per week
x_worldwide_weekly_death_values["excess_2020"] <- excess_death_2020$DTotal[match(x_worldwide_weekly_death_values$ISO3,
excess_death_2020$ISO3)]
Example of the data not in the original excess_death_2020 file which have had NA's added successfully
ISO3 administrative_~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AFG Afghanistan 37172386 56.937760009803 1 0 2020-01-06 NA
2 AFG Afghanistan 37172386 56.937760009803 2 0 2020-01-13 NA
3 AFG Afghanistan 37172386 56.937760009803 3 0 2020-01-20 NA
dput() for the above:
dput(x_worldwide_weekly_death_values[1:3,])
structure(list(ISO3 = c("AFG", "AFG", "AFG"), administrative_area_level_1 = c("Afghanistan",
"Afghanistan", "Afghanistan"), population = c(37172386L, 37172386L,
37172386L), pop_density_km2 = c("56.937760009803", "56.937760009803",
"56.937760009803"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Compared to Austria, where the week 1 value has been added to all cells
ISO3 administrative_a~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AUT Austria 8840521 107.1279668605~ 1 0 2020-01-06 1610
2 AUT Austria 8840521 107.1279668605~ 2 0 2020-01-13 1610
3 AUT Austria 8840521 107.1279668605~ 3 0 2020-01-20 1610
dput() for the above:
dput(x_worldwide_weekly_death_values[371:373,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), administrative_area_level_1 = c("Austria",
"Austria", "Austria"), population = c(8840521L, 8840521L, 8840521L
), pop_density_km2 = c("107.127966860564", "107.127966860564",
"107.127966860564"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(1610, 1610, 1610)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Expected output for excess_2020 column would be the DTotal column figures associated to the Week number; Week 1 = 1610, Week 2 = 1702, Week 3 = 1797
ISO3 Year Week Sex D0_14 D15_64 D65_74 D75_84 D85p DTotal R0_14 R15_64 R65_74 R75_84 R85p
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUT 2020 1 b 1 220 221 481 687 1610 4.07e-5 0.00196 0.0134 0.0399 0.157
2 AUT 2020 2 b 8 231 261 490 712 1702 3.26e-4 0.00206 0.0158 0.0407 0.163
3 AUT 2020 3 b 12 223 272 537 753 1797 4.89e-4 0.00198 0.0165 0.0446 0.173
dput() for the above
dput(excess_death_2020[1:3,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), Year = c(2020,
2020, 2020), Week = c(1, 2, 3), Sex = c("b", "b", "b"), D0_14 = c(1,
8, 12), D15_64 = c(220, 231, 223), D65_74 = c(221, 261, 272),
D75_84 = c(481, 490, 537), D85p = c(687, 712, 753), DTotal = c(1610,
1702, 1797), R0_14 = c(4.07296256273503e-05, 0.000325837005018803,
0.000488755507528204), R15_64 = c(0.00195783568851069, 0.00205572747293622,
0.00198453344789947), R65_74 = c(0.0133964529296798, 0.0158211502925177,
0.0164879420672982), R75_84 = c(0.0399495248686277, 0.0406970211759409,
0.044600613003021), R85p = c(0.157436284517545, 0.163165406952681,
0.172561167746305), RTotal = c(0.00948052042945739, 0.0100222644539978,
0.0105816740445559), Split = c(0, 0, 0), SplitSex = c(0,
0, 0), Forecast = c(1, 1, 1), date = structure(c(18267, 18274,
18281), class = "Date")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
I tried a few variations of the below with little success
x_worldwide_weekly_deaths["excess_2020"] <- excess_death_2020$DTotal[excess_death_2020$Week[match(x_worldwide_weekly_death_values$week_number
[x_worldwide_weekly_death_values$ISO3],
excess_death_2020$Week[excess_death_2020$CountryCode])]]
Should I not be using match() on multiple criteria or am I not formatting it correctly?
Really appreciate any help and suggestions!
dplyr is reaaly good/easy for this kind of thing. Here's a simplified example that achieves both of your goals (adding NA for countries that are not in the excess death data, and adding NA for weeks that are not in the excess death data)...
library(dplyr)
x_worldwide_weekly_death_values <-
tribble(
~iso3c, ~week, ~covid_deaths,
"AFG", 1, 0,
"AFG", 2, 10,
"AFG", 3, 30,
"AFG", 4, 50,
"AUT", 1, 120,
"AUT", 2, 200,
"AUT", 3, 320,
"AUT", 4, 465,
"XXX", 1, 10,
"XXX", 2, 20,
"XXX", 3, 30,
"XXX", 4, 40,
)
excess_death_2020 <-
tribble(
~iso3c, ~week, ~DTotal,
"AFG", 1, 0,
"AFG", 2, 0,
"AFG", 3, 0,
"AUT", 1, 1610,
"AUT", 2, 1702,
"AUT", 3, 1797,
)
x_worldwide_weekly_death_values %>%
left_join(excess_death_2020, by = c("iso3c", "week"))
#> # A tibble: 12 x 4
#> iso3c week covid_deaths DTotal
#> <chr> <dbl> <dbl> <dbl>
#> 1 AFG 1 0 0
#> 2 AFG 2 10 0
#> 3 AFG 3 30 0
#> 4 AFG 4 50 NA
#> 5 AUT 1 120 1610
#> 6 AUT 2 200 1702
#> 7 AUT 3 320 1797
#> 8 AUT 4 465 NA
#> 9 XXX 1 10 NA
#> 10 XXX 2 20 NA
#> 11 XXX 3 30 NA
#> 12 XXX 4 40 NA
Give a dataframe df as follows:
df <- structure(list(year = c(2001, 2002, 2003, 2004), `1` = c(22.0775,
24.2460714285714, 29.4039285714286, 27.7110714285714), `2` = c(27.2535714285714,
35.9996428571429, 26.39, 27.8557142857143), `3` = c(24.7710714285714,
25.4428571428571, 15.1142857142857, 19.9657142857143)), row.names = c(NA,
-4L), groups = structure(list(year = c(2001, 2002, 2003, 2004
), .rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Out:
year 1 2 3
0 2001 22.07750 27.25357 24.77107
1 2002 24.24607 35.99964 25.44286
2 2003 29.40393 26.39000 15.11429
3 2004 27.71107 27.85571 19.96571
For column 1, 2 and 3, how could I calculate year-to-year absolute change?
The expected result will like this:
year 1 2 3
0 2002 2.16857 8.74607 0.67179
1 2003 5.15786 9.60964 10.32857
2 2004 1.69286 1.46571 4.85142
The final objective is to compare values of 1, 2, 3 columns across all years, find the largest change year and column, at this example, it should be 2003 and column 3.
How could I do that in R? Thanks.
You can use :
library(dplyr)
data <- df %>% ungroup %>% summarise(across(-1, ~abs(diff(.))))
data
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 2.17 8.75 0.672
#2 5.16 9.61 10.3
#3 1.69 1.47 4.85
To get max change
mat <- which(data == max(data), arr.ind = TRUE)
mat
# row col
#[1,] 2 3
#Year name
df$year[mat[, 1] + 1]
#[1] 2003
#Column name
mat[, 2]
#col
# 3
You can try:
library(reshape2)
library(dplyr)
#Melt
Melted <- reshape2::melt(df,id.vars = 'year')
#Group
Melted %>% group_by(variable) %>% mutate(Diff=c(0,abs(diff(value)))) %>% ungroup() %>%
filter(Diff==max(Diff))
# A tibble: 1 x 4
year variable value Diff
<dbl> <fct> <dbl> <dbl>
1 2003 3 15.1 10.3
We can apply the diff on the entire dataset by converting the numeric columns of interest to matrix in base R
cbind(year = df$year[-1], abs(diff(as.matrix(df[-1]))))
# year 1 2 3
#[1,] 2002 2.168571 8.746071 0.6717857
#[2,] 2003 5.157857 9.609643 10.3285714
#[3,] 2004 1.692857 1.465714 4.8514286
I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!