Missing Values and Rows

Missing Values and Rows - r

I apologize if this is a duplicate question, I couldn't seem to find anything quite like this.
I have some data that I am cleaning and I need to fill missing values. Data looks like this, with dput below. Decimals were removed in print, but included in dput.
> print(tbl_df(df), n=26)
# A tibble: 26 x 6
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year1 2 346588. 156266 34806. NA
2 Year1 3 342573 NA 34652. 292001.
3 Year1 5 286285. 129257. 29645. 252786.
4 Year1 7 234410. NA 24536. NA
5 Year1 9 184733. 82944. NA 170653
6 Year1 10 NA 81419. 19461 167273.
7 Year1 11 169620. 74688. 18065 155442
8 Year1 14 107652 48381. 11941. 100076
9 Year1 15 88440 39807 10123. 83137
10 Year1 17 NA 31608 7926 64551.
11 Year1 18 63622 29236 7444. 58848.
12 Year1 22 14143. 6366. 1683. 10889.
13 Year2 22 279904 102271 28221. 138804.
14 Year2 25 200386 78628. 21942 NA
15 Year2 26 157182. NA 18099. 91963.
16 Year2 28 121122. 54538 14532. 76422
17 Year2 30 25899. 16773 489. NA
18 Year2 32 112091. 51219. 11298. 71655.
19 Year2 33 108756 49311. 10589. 70167
20 Year2 34 NA 49127. NA 69195.
21 Year2 36 104827 42651. 8568. 63580.
22 Year2 38 44849 14114 2302. 11652
23 Year2 40 104407. 42545 6240 63318.
24 Year2 41 99059. 38423 6766. 58017
25 Year2 42 NA 40432. NA 57932.
26 Year2 44 49119. 8796. 4769. 11233.
dput(df)
structure(list(Year = c("Year1", "Year1", "Year1", "Year1", "Year1",
"Year1", "Year1", "Year1", "Year1", "Year1", "Year1", "Year1",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2"
), Trial = c(2, 3, 5, 7, 9, 10, 11, 14, 15, 17, 18, 22, 22, 25,
26, 28, 30, 32, 33, 34, 36, 38, 40, 41, 42, 44), Group1 = c(346587.6667,
342573, 286285.3333, 234409.6667, 184733.3333, NA, 169620.3333,
107652, 88440, NA, 63622, 14143.33333, 279904, 200386, 157182.3333,
121122.3333, 25899.33333, 112090.6667, 108756, NA, 104827, 44849,
104407.3333, 99058.66667, NA, 49119.33333), Group2 = c(156266,
NA, 129257.3333, NA, 82943.66667, 81419.33333, 74688.33333, 48381.33333,
39807, 31608, 29236, 6365.666667, 102271, 78628.33333, NA, 54538,
16773, 51218.66667, 49311.33333, 49127.33333, 42650.66667, 14114,
42545, 38423, 40432.33333, 8795.666667), Group3 = c(34805.66667,
34651.66667, 29644.66667, 24535.66667, NA, 19461, 18065, 11941.33333,
10123.33333, 7926, 7444.333333, 1683.333333, 28221.33333, 21942,
18099.33333, 14532.33333, 489.3333333, 11297.66667, 10588.66667,
NA, 8567.666667, 2302.333333, 6240, 6765.666667, NA, 4769.333333
), Group4 = c(NA, 292000.6667, 252785.6667, NA, 170653, 167273.3333,
155442, 100076, 83137, 64551.33333, 58847.66667, 10888.66667,
138803.6667, NA, 91963.33333, 76422, NA, 71655.33333, 70167,
69195.33333, 63579.66667, 11652, 63317.66667, 58017, 57932.33333,
11232.66667)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -26L), spec = structure(list(cols = list(
Year = structure(list(), class = c("collector_character",
"collector")), Trial = structure(list(), class = c("collector_double",
"collector")), Group1 = structure(list(), class = c("collector_double",
"collector")), Group2 = structure(list(), class = c("collector_double",
"collector")), Group3 = structure(list(), class = c("collector_double",
"collector")), Group4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Basically, I need to fill the na values with the previous trial (trials in descending order). So for example, I need to fill row 6, column 3 with the data from row 6, column 4.
But that's not all. I need to create a row for days with missing trials, and then fill those rows with the last trial as well. This is the thing I'm getting hung up on. Is there a way to accomplish both of these?
For example, I need to change tail(df) from A to B.
A.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 NA 40432. NA 57932.
4 Year2 44 49119. 8796. 4769. 11233.
B.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 49119. 40432. 4769. 57932.
4 Year2 43 49119. 40432. 4769. 57932.
5 Year2 44 49119. 8796. 4769. 11233.

You can use complete and fill with .direction = 'up'
library(dplyr)
library(tidyr)
df %>%
group_by(Year) %>%
complete(Trial = min(Trial):max(Trial)) %>%
fill(starts_with('Group'), .direction = 'up') %>%
ungroup
# A tibble: 44 x 6
# Year Trial Group1 Group2 Group3 Group4
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Year1 2 346588. 156266 34806. 292001.
# 2 Year1 3 342573 129257. 34652. 292001.
# 3 Year1 4 286285. 129257. 29645. 252786.
# 4 Year1 5 286285. 129257. 29645. 252786.
# 5 Year1 6 234410. 82944. 24536. 170653
# 6 Year1 7 234410. 82944. 24536. 170653
# 7 Year1 8 184733. 82944. 19461 170653
# 8 Year1 9 184733. 82944. 19461 170653
# 9 Year1 10 169620. 81419. 19461 167273.
#10 Year1 11 169620. 74688. 18065 155442
# … with 34 more rows

Related

How to set missing some columns and their corresponding columns in data frame in R

I have a longitudinal data with three follow-up. The columns 2,3 and 4
I want to set the value 99 in the columns v_9, v_01, and v_03 to NA, but I want to set their corresponding columns (columns "d_9", "d_01","d_03" and "a_9", "a_01","a_03") as NA as well. As an example for ID 101 as below:
How can I do this for all the individuals and my whole data set in R? thanks in advance for the help.
"id" "v_9" "v_01" "v_03" "d_9" "d_01" "d_03" "a_9" "a_01" "a_03"
101 12 NA 10 2015-03-23 NA 2003-06-19 40.50650 NA 44.1065
structure(list(id = c(101, 102, 103, 104), v_9 = c(12, 99, 16,
25), v_01 = c(99, 12, 16, NA), v_03 = c(10, NA, 99, NA), d_9 = structure(c(16517,
17613, 16769, 10667), class = "Date"), d_01 = structure(c(13291,
NA, 13566, NA), class = "Date"), d_03 = structure(c(12222, NA,
12119, NA), class = "Date"), a_9 = c(40.5065, 40.5065, 30.19713,
51.40862), a_01 = c(42.5065, 41.5112, 32.42847, NA), a_03 = c(44.1065,
NA, 35.46543, NA)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))

Try this function:
fn <- function(df){
for(s in c("_9" , "_01" , "_03")){
i <- which(`[[`(df,paste0("v",s)) == 99)
df[i, paste0("v",s)] <- NA
df[i, paste0("d",s)] <- NA
df[i, paste0("a",s)] <- NA
}
df
}
df <- fn(df)
Output
# A tibble: 4 × 10
id v_9 v_01 v_03 d_9 d_01 d_03 a_9 a_01 a_03
<dbl> <dbl> <dbl> <dbl> <date> <date> <date> <dbl> <dbl> <dbl>
1 101 12 NA 10 2015-03-23 NA 2003-06-19 40.5 NA 44.1
2 102 NA 12 NA NA NA NA NA 41.5 NA
3 103 16 16 NA 2015-11-30 2007-02-22 NA 30.2 32.4 NA
4 104 25 NA NA 1999-03-17 NA NA 51.4 NA NA

Merge two dataframes: specifically merge a selection of columns based on two conditions?

I have two datasets on the same 2 patients. With the second dataset I want to add new information to the first, but I can't seem to get the code right.
My first (incomplete) dataset has a patient ID, measurement time (either T0 or FU1), year of birth, date of the CT scan, and two outcomes (legs_mass and total_mass):
library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)
df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA,
NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# Which gives the following dataframe
df1
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 NA NA
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 NA NA
The second dataset adds to the legs_mass and total_mass columns:
df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842,
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif",
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# Which gives the following dataframe:
df2
# A tibble: 2 x 5
ID date_ct ctscan_label legs_mass total_mass
<dbl> <date> <chr> <dbl> <dbl>
1 115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif 956. 1016.
2 370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif 21.3 21.3
What I am trying to do, is...
Add the legs_mass and total_mass column values from df2 to df1, based on ID number and date_ct.
Add the new columns of df2 (the one that is not in df1; ctscan_label) to df1, also based on the date of the ct and patient ID.
So that the final dataset df3 looks as follows:
df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5,
1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
# Corresponding to the following tibble:
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 956. 1016.
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 21.3 21.3
I have tried the merge function and rbind from baseR, and bind_rows from dplyr but can't seem to get it right.
Any help?

You can join the two datasets and use coalesce to keep one non-NA value from the two datasets.
library(dplyr)
left_join(df1, df2, by = c("ID", "date_ct")) %>%
mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y),
total_mass = coalesce(total_mass.x, total_mass.y)) %>%
select(-matches('\\.x|\\.y'), -ctscan_label)
# ID time year_of_birth date_ct leg_mass total_mass
# <dbl> <fct> <dbl> <date> <dbl> <dbl>
#1 115 T0 1970 2015-08-04 9.1 14.5
#2 115 FU1 1970 2018-11-07 956. 1016.
#3 370 T0 1961 2015-08-04 NA NA
#4 370 FU1 1961 2020-09-30 21.3 21.3

We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") :=
.(fcoalesce(legs_mass, i.legs_mass),
fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]
-output
df1
ID time year_of_birth date_ct legs_mass total_mass
1: 115 T0 1970 2015-08-04 9.1 14.5
2: 115 FU1 1970 2018-11-07 956.1 1015.9
3: 370 T0 1961 2015-08-04 NA NA
4: 370 FU1 1961 2020-09-30 21.3 21.3

Can you use multiple conditions in match() function - R

I'm trying to graph excess deaths for 2020 against confirmed covid-19 deaths.
I have 2 dataframes, one x_worldwide_weekly_deaths (covid-19) and the other containing excess deaths, I want to add an excess deaths column to x_worldwide_weekly_deaths and match by both ISO3 country code, and week number;
Not every country tracks excess deaths so I want those not within the original excess df to have an NA value
Likewise, not every country who track excess deaths are as up to date, some have 37 weeks of data, others might only have 24, so I want the NA values for the missing weeks also
Using the below, I've gotten halfway there, countries not on the original list have NA and those who are have a value, however it only uses the first value rather than changing total per week
x_worldwide_weekly_death_values["excess_2020"] <- excess_death_2020$DTotal[match(x_worldwide_weekly_death_values$ISO3,
excess_death_2020$ISO3)]
Example of the data not in the original excess_death_2020 file which have had NA's added successfully
ISO3 administrative_~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AFG Afghanistan 37172386 56.937760009803 1 0 2020-01-06 NA
2 AFG Afghanistan 37172386 56.937760009803 2 0 2020-01-13 NA
3 AFG Afghanistan 37172386 56.937760009803 3 0 2020-01-20 NA
dput() for the above:
dput(x_worldwide_weekly_death_values[1:3,])
structure(list(ISO3 = c("AFG", "AFG", "AFG"), administrative_area_level_1 = c("Afghanistan",
"Afghanistan", "Afghanistan"), population = c(37172386L, 37172386L,
37172386L), pop_density_km2 = c("56.937760009803", "56.937760009803",
"56.937760009803"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Compared to Austria, where the week 1 value has been added to all cells
ISO3 administrative_a~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AUT Austria 8840521 107.1279668605~ 1 0 2020-01-06 1610
2 AUT Austria 8840521 107.1279668605~ 2 0 2020-01-13 1610
3 AUT Austria 8840521 107.1279668605~ 3 0 2020-01-20 1610
dput() for the above:
dput(x_worldwide_weekly_death_values[371:373,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), administrative_area_level_1 = c("Austria",
"Austria", "Austria"), population = c(8840521L, 8840521L, 8840521L
), pop_density_km2 = c("107.127966860564", "107.127966860564",
"107.127966860564"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(1610, 1610, 1610)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Expected output for excess_2020 column would be the DTotal column figures associated to the Week number; Week 1 = 1610, Week 2 = 1702, Week 3 = 1797
ISO3 Year Week Sex D0_14 D15_64 D65_74 D75_84 D85p DTotal R0_14 R15_64 R65_74 R75_84 R85p
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUT 2020 1 b 1 220 221 481 687 1610 4.07e-5 0.00196 0.0134 0.0399 0.157
2 AUT 2020 2 b 8 231 261 490 712 1702 3.26e-4 0.00206 0.0158 0.0407 0.163
3 AUT 2020 3 b 12 223 272 537 753 1797 4.89e-4 0.00198 0.0165 0.0446 0.173
dput() for the above
dput(excess_death_2020[1:3,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), Year = c(2020,
2020, 2020), Week = c(1, 2, 3), Sex = c("b", "b", "b"), D0_14 = c(1,
8, 12), D15_64 = c(220, 231, 223), D65_74 = c(221, 261, 272),
D75_84 = c(481, 490, 537), D85p = c(687, 712, 753), DTotal = c(1610,
1702, 1797), R0_14 = c(4.07296256273503e-05, 0.000325837005018803,
0.000488755507528204), R15_64 = c(0.00195783568851069, 0.00205572747293622,
0.00198453344789947), R65_74 = c(0.0133964529296798, 0.0158211502925177,
0.0164879420672982), R75_84 = c(0.0399495248686277, 0.0406970211759409,
0.044600613003021), R85p = c(0.157436284517545, 0.163165406952681,
0.172561167746305), RTotal = c(0.00948052042945739, 0.0100222644539978,
0.0105816740445559), Split = c(0, 0, 0), SplitSex = c(0,
0, 0), Forecast = c(1, 1, 1), date = structure(c(18267, 18274,
18281), class = "Date")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
I tried a few variations of the below with little success
x_worldwide_weekly_deaths["excess_2020"] <- excess_death_2020$DTotal[excess_death_2020$Week[match(x_worldwide_weekly_death_values$week_number
[x_worldwide_weekly_death_values$ISO3],
excess_death_2020$Week[excess_death_2020$CountryCode])]]
Should I not be using match() on multiple criteria or am I not formatting it correctly?
Really appreciate any help and suggestions!

dplyr is reaaly good/easy for this kind of thing. Here's a simplified example that achieves both of your goals (adding NA for countries that are not in the excess death data, and adding NA for weeks that are not in the excess death data)...
library(dplyr)
x_worldwide_weekly_death_values <-
tribble(
~iso3c, ~week, ~covid_deaths,
"AFG", 1, 0,
"AFG", 2, 10,
"AFG", 3, 30,
"AFG", 4, 50,
"AUT", 1, 120,
"AUT", 2, 200,
"AUT", 3, 320,
"AUT", 4, 465,
"XXX", 1, 10,
"XXX", 2, 20,
"XXX", 3, 30,
"XXX", 4, 40,
)
excess_death_2020 <-
tribble(
~iso3c, ~week, ~DTotal,
"AFG", 1, 0,
"AFG", 2, 0,
"AFG", 3, 0,
"AUT", 1, 1610,
"AUT", 2, 1702,
"AUT", 3, 1797,
)
x_worldwide_weekly_death_values %>%
left_join(excess_death_2020, by = c("iso3c", "week"))
#> # A tibble: 12 x 4
#> iso3c week covid_deaths DTotal
#> <chr> <dbl> <dbl> <dbl>
#> 1 AFG 1 0 0
#> 2 AFG 2 10 0
#> 3 AFG 3 30 0
#> 4 AFG 4 50 NA
#> 5 AUT 1 120 1610
#> 6 AUT 2 200 1702
#> 7 AUT 3 320 1797
#> 8 AUT 4 465 NA
#> 9 XXX 1 10 NA
#> 10 XXX 2 20 NA
#> 11 XXX 3 30 NA
#> 12 XXX 4 40 NA

How to find min and max in dplyr?

I know the sum of points for each person.
I need to know: what is the minimum number of points that a person could have. And what is the maximum number of points that a person could have.
What I have tried:
min_and_max <- dataset %>%
group_by(person) %>%
dplyr::filter(min(sum(points, na.rm = T))) %>%
distinct(person) %>%
pull()
min_and_max
My dataset:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1

I would suggest this dplyr approach. You have to summarize data like this:
library(tidyverse)
#Code
df %>% group_by(id,person) %>%
summarise(Total=sum(points,na.rm = T),
min=min(points,na.rm = T),
max=max(points,na.rm=T))
Output:
# A tibble: 7 x 5
# Groups: id [7]
id person Total min max
<int> <chr> <int> <int> <int>
1 201 rt99 5 2 3
2 202 kt 4 4 4
3 203 rr 4 4 4
4 204 jk 4 2 2
5 322 knm3 8 3 5
6 343 kll2 8 1 5
7 344 kll 8 1 7

Here is the data.table solution -
dataset[, min_points := min(points, na.rm = T), by = person]
dataset[, max_points := max(points, na.rm = T), by = person]
Since I don't have your data, I cannot test this code, but it should work fine.

The summarize() verb is what you want for this. You don't even need to filter out the NA values first since both min() and max() can have na.rm = TRUE.
library(dplyr)
min_and_max <- dataset %>%
group_by(person) %>%
summarize(min = min(points, na.rm = TRUE),
max = max(points, na.rm = TRUE))
min_and_max
# A tibble: 7 x 3
person min max
<chr> <dbl> <dbl>
1 jk 2 2
2 kll 1 7
3 kll2 1 5
4 knm3 3 5
5 kt 4 4
6 rr 4 4
7 rt99 2 3
dput(dataset)
structure(list(id = c(201, 201, 201, 202, 202, 202, 203, 203,
203, 204, 204, 204, 322, 322, 322, 343, 343, 343, 344, 344, 344
), person = c("rt99", "rt99", "rt99", "kt", "kt", "kt", "rr",
"rr", "rr", "jk", "jk", "jk", "knm3", "knm3", "knm3", "kll2",
"kll2", "kll2", "kll", "kll", "kll"), points = c(NA, 3, 2, 4,
NA, NA, 4, NA, NA, 2, 2, NA, 5, NA, 3, 2, 1, 5, NA, 7, 1)), class = "data.frame", row.names = c(NA,
-21L), spec = structure(list(cols = list(id = structure(list(), class = c("collector_double",
"collector")), person = structure(list(), class = c("collector_character",
"collector")), points = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

How can I pick an element from a matrix depending on a set of conditions?

I have a dataframe containing n rows and m columns. Each row is an individual and each column is information on that individual.
df
id age income
1 18 12
2 24 24
3 36 12
4 18 24
. . .
. . .
. . .
I also have a matrix rXcshowing age buckets in each row and income buckets in each column and each element of the matrix is the % of people for each income-age bucket.
matrix age\income
12 24 36 .....
18 0.15 0.12 0.11 ....
24 0.12 0.6 0.2 ...
36 0.02 0.16 0.16 ...
. ..................
. ..................
For each individual in the dataframe, I need to find the right element of the matrix given the age and income bucket of the individual.
The desired output should look like this
df2
id age income y
1 18 12 0.15
2 24 24 0.6
3 36 12 0.02
4 18 24 0.12
. . .
. . .
. . .
I tried with a series of IFs inside a loop (like in the example):
for (i in 1:length(df$x)) {
workingset <- df[i,]
if(workingset$age==18){
temp<-marix[1,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if(workingset$age==24){
temp<-marix[2,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if{
...
}
if(i==1){
df2 <- workingset
}else{
df2<- rbind(df2, workingset)
}
}
This code works, but it takes too long. Is there a way do this job efficiently?

Assuming your data looks exactly like shown you could use dplyr and tidyr.
First convert your matrix (I name it my_mat) into a data.frame
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric))
returns
# A tibble: 9 x 3
age income y
<dbl> <dbl> <dbl>
1 18 12 0.15
2 18 24 0.12
3 18 36 0.11
4 24 12 0.12
5 24 24 0.6
6 24 36 0.2
7 36 12 0.02
8 36 24 0.16
9 36 36 0.16
This can be left joined with your data.frame df, so in one go:
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric)) %>%
left_join(df, ., by=c("age", "income"))
gives you
# A tibble: 4 x 4
id age income y
<dbl> <dbl> <dbl> <dbl>
1 1 18 12 0.15
2 2 24 24 0.6
3 3 36 12 0.02
4 4 18 24 0.12
Data
my_mat <- structure(c(0.15, 0.12, 0.02, 0.12, 0.6, 0.16, 0.11, 0.2, 0.16
), .Dim = c(3L, 3L), .Dimnames = list(c("18", "24", "36"), c("12",
"24", "36")))
df <- structure(list(id = c(1, 2, 3, 4), age = c(18, 24, 36, 18), income = c(12,
24, 12, 24)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), spec = structure(list(cols = list(
id = structure(list(), class = c("collector_double", "collector"
)), age = structure(list(), class = c("collector_double",
"collector")), income = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Missing Values and Rows - r

Related

How to set missing some columns and their corresponding columns in data frame in R

Merge two dataframes: specifically merge a selection of columns based on two conditions?

Can you use multiple conditions in match() function - R

How to find min and max in dplyr?

How can I pick an element from a matrix depending on a set of conditions?

Categories

Resources