How to calculate average exposure for multiple columns - r

I have the dataset below where each year has a number that represents an exposure:
## ID DOB sector meters Oct Res_FROM Res_TO Exp_FROM
## 1 20100 1979-08-24 H38 6400 W 1979-08-15 1991-05-15 1979-08-24
## 2 20101 1980-05-05 B01 1600 NW 1980-05-15 1991-04-15 1980-05-15
## 3 20102 1979-03-17 H04 1600 SW 1972-06-15 1979-08-15 1979-03-17
## 4 20103 1981-11-30 B09 3200 NE 1982-01-15 1984-01-15 1982-01-15
## 5 20103 1981-11-30 B37 8000 N 1984-01-15 1986-04-15 1984-01-15
## 6 20104 1978-09-01 B09 3200 NE 1982-01-15 1984-01-15 1982-01-15
## Exp_TO Exps_Grp Yr1952 Yr1953 Yr1954 Yr1955 Yr1956 Yr1957 Yr1958 Yr1959
## 1 1988-12-31 fr51>88 NA NA NA NA NA NA NA NA
## 2 1988-12-31 fr51>88 NA NA NA NA NA NA NA NA
## 3 1979-08-15 between NA NA NA NA NA NA NA NA
## 4 1984-01-15 between NA NA NA NA NA NA NA NA
## 5 1986-04-15 between NA NA NA NA NA NA NA NA
## 6 1984-01-15 between NA NA NA NA NA NA NA NA
## Yr1960 Yr1961 Yr1962 Yr1963 Yr1964 Yr1965 Yr1966 Yr1967 Yr1968 Yr1969 Yr1970
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA
## Yr1971 Yr1972 Yr1973 Yr1974 Yr1975 Yr1976 Yr1977 Yr1978 Yr1979 Yr1980
## 1 NA NA NA NA NA NA NA NA 5.950991 4.340588
## 2 NA NA NA NA NA NA NA NA NA 2.927725
## 3 NA NA NA NA NA NA NA NA 20.608986 NA
## 4 NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA
## Yr1981 Yr1982 Yr1983 Yr1984 Yr1985 Yr1986 Yr1987 Yr1988
## 1 4.340588 4.340588 4.340588 4.3405881 4.340588 4.3405881 4.340588 1.083782
## 2 4.447229 4.447229 4.447229 4.4472289 4.447229 4.4472289 4.447229 1.110409
## 3 NA NA NA NA NA NA NA NA
## 4 NA 15.365412 16.018407 0.6529943 NA NA NA NA
## 5 NA NA NA 2.9414202 3.052618 0.6918076 NA NA
## 6 NA 15.365412 16.018407 0.6529943 NA NA NA NA
## Yrs_Exp arth_mean median cumulative caldate Age Month_Res
## 1 9.3616438 4.175948 4.340588 41.759478 12/31/88 9 141
## 2 8.6356164 3.907637 4.447229 35.168736 12/31/88 9 131
## 3 0.4136986 20.608986 20.608986 20.608986 12/31/88 9 86
## 4 2.0000000 10.678938 15.365412 32.036813 12/31/88 9 24
## 5 2.2493151 2.228615 2.941420 6.685846 12/31/88 8 27
## 6 2.0000000 10.678938 15.365412 32.036813 12/31/88 9 24
I am wanting to calculate the average exposure for each year and then find out which years had an average exposure that exceeded a value of 4. How would I go about accomplishing this? So my desired output would be a list of each year with the average exposure, and then another output with a list of the years that had averages exceeding a value of 4. Reproducible data below.
dat <- structure(list(ID = c(20100L, 20101L, 20102L, 20103L, 20103L,
20104L, 20104L, 20105L, 20105L, 20106L, 20106L), DOB = c("1979-08-24",
"1980-05-05", "1979-03-17", "1981-11-30", "1981-11-30", "1978-09-01",
"1978-09-01", "1980-12-03", "1980-12-03", "1978-04-25", "1978-04-25"
), sector = c("H38", "B01", "H04", "B09", "B37", "B09", "B37",
"B09", "B09", "B09", "B09"), meters = c(6400L, 1600L, 1600L,
3200L, 8000L, 3200L, 8000L, 3200L, 3200L, 3200L, 3200L), Oct = c("W",
"NW", "SW", "NE", "N", "NE", "N", "NE", "NE", "NE", "NE"), Res_FROM = c("1979-08-15",
"1980-05-15", "1972-06-15", "1982-01-15", "1984-01-15", "1982-01-15",
"1984-01-15", "1980-12-15", "1983-08-15", "1978-04-15", "1983-08-15"
), Res_TO = c("1991-05-15", "1991-04-15", "1979-08-15", "1984-01-15",
"1986-04-15", "1984-01-15", "1986-04-15", "1983-08-15", "1991-03-15",
"1983-08-15", "2000-01-15"), Exp_FROM = c("1979-08-24", "1980-05-15",
"1979-03-17", "1982-01-15", "1984-01-15", "1982-01-15", "1984-01-15",
"1980-12-15", "1983-08-15", "1978-04-25", "1983-08-15"), Exp_TO = c("1988-12-31",
"1988-12-31", "1979-08-15", "1984-01-15", "1986-04-15", "1984-01-15",
"1986-04-15", "1983-08-15", "1988-12-31", "1983-08-15", "1988-12-31"
), Exps_Grp = c("fr51>88", "fr51>88", "between", "between", "between",
"between", "between", "between", "fr51>88", "between", "fr51>88"
), Yr1952 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1953 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1954 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1955 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1956 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1957 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1958 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1959 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1960 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1961 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1962 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1963 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1964 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1965 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1966 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1967 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1968 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1969 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1970 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1971 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1972 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1973 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1974 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1975 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1976 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1977 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1978 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
79.39642441, NA), Yr1979 = c(5.950991161, NA, 20.60898553, NA,
NA, NA, NA, NA, NA, 59.94484924, NA), Yr1980 = c(4.340588078,
2.927724588, NA, NA, NA, NA, NA, 0.758267013, NA, 16.01840668,
NA), Yr1981 = c(4.340588078, 4.447228937, NA, NA, NA, NA, NA,
16.01840668, NA, 16.01840668, NA), Yr1982 = c(4.340588078, 4.447228937,
NA, 15.36541238, NA, 15.36541238, NA, 16.01840668, NA, 16.01840668,
NA), Yr1983 = c(4.340588078, 4.447228937, NA, 16.01840668, NA,
16.01840668, NA, 9.952203009, 6.066203667, 9.952203009, 6.066203667
), Yr1984 = c(4.340588078, 4.447228937, NA, 0.652994292, 2.941420153,
0.652994292, 2.941420153, NA, 16.01840668, NA, 16.01840668),
Yr1985 = c(4.340588078, 4.447228937, NA, NA, 3.052618478,
NA, 3.052618478, NA, 16.01840668, NA, 16.01840668), Yr1986 = c(4.340588078,
4.447228937, NA, NA, 0.691807598, NA, 0.691807598, NA, 16.01840668,
NA, 16.01840668), Yr1987 = c(4.340588078, 4.447228937, NA,
NA, NA, NA, NA, NA, 16.01840668, NA, 16.01840668), Yr1988 = c(1.083782142,
1.110408824, NA, NA, NA, NA, NA, NA, 3.999564755, NA, 3.999564755
), Yrs_Exp = c(9.361643836, 8.635616438, 0.41369863, 2, 2.249315068,
2, 2.249315068, 2.665753425, 5.383561644, 5.309589041, 5.383561644
), arth_mean = c(4.175947792, 3.907637331, 20.60898553, 10.67893778,
2.22861541, 10.67893778, 2.22861541, 10.68682084, 12.35656585,
32.89144945, 12.35656585), median = c(4.340588078, 4.447228937,
20.60898553, 15.36541238, 2.941420153, 15.36541238, 2.941420153,
12.98530484, 16.01840668, 16.01840668, 16.01840668), cumulative = c(41.75947792,
35.16873597, 20.60898553, 32.03681335, 6.685846229, 32.03681335,
6.685846229, 42.74728337, 74.13939513, 197.3486967, 74.13939513
), caldate = c("12/31/88", "12/31/88", "12/31/88", "12/31/88",
"12/31/88", "12/31/88", "12/31/88", "12/31/88", "12/31/88",
"12/31/88", "12/31/88"), Age = c(9L, 9L, 9L, 9L, 8L, 9L,
7L, 7L, 10L, 10L, 8L), Month_Res = c(141L, 131L, 86L, 24L,
27L, 24L, 27L, 32L, 91L, 64L, 197L)), class = "data.frame", row.names = c(NA,
-11L))

You'll have to think about what you want to do with NAs; these solutions just drop them.
Base R solution:
# subset to year columns
dat_years <- dat[, grep("^Yr1", names(dat))]
# compute averages
avg_by_year <- sapply(dat_years, \(col) mean(col, na.rm = TRUE))
# find years w avg > 4, and remove "Yr" prefix
years_gt_4 <- names(avg_by_year)[!is.na(avg_by_year) & avg_by_year > 4] |>
sub(pattern = "Yr", replacement = "")
years_gt_4
# "1978" "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987"
tidyverse solution:
library(tidyverse)
avg_by_year <- dat %>%
pivot_longer(
cols = Yr1952:Yr1988,
names_to = "Year",
values_to = "Exposure",
names_prefix = "Yr"
) %>%
group_by(Year) %>%
summarize(Exposure = mean(Exposure, na.rm = TRUE))
years_gt_4 <- avg_by_year %>%
filter(Exposure > 4) %>%
pull(Year)
years_gt_4
# "1978" "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987"

Related

reshaping multiple columns in R, based on name values

Df <- data.frame(prop1 = c(NA, NA, NA, "French", NA, NA,NA, "-29 to -20", NA, NA, NA, "Pop", NA, NA, NA, "French", "-29 to -20", "Pop"),
prop1_rank = c(NA, NA, NA, 0, NA, NA,NA, 11, NA, NA, NA, 1, NA, NA, NA, 40, 0, 2),
prop2 = c(NA, NA, NA, "Spanish", NA, NA,NA, "-19 to -10", NA, NA, NA, "Rock", NA, NA, NA, "Spanish", "-19 to -10", "Rock"),
prop2_rank = c(NA, NA, NA, 10, NA, NA,NA, 4, NA, NA, NA, 1, NA, NA, NA, 1, 0, 2),
initOSF1 = c(NA, NA, NA, NA, NA, "French", NA,NA,NA, "-29 to -20", NA, NA, NA, "Pop", NA, NA, NA, NA),
initOSF1_freq = c(NA, NA, NA, NA, NA, 66, NA,NA,NA, 0, NA, NA, NA, 14, NA, NA, NA, NA),
initOSF2 = c(NA, NA, NA, NA, NA, "Spanish", NA,NA,NA, "-19 to -10", NA, NA, NA, "Rock", NA, NA, NA, NA),
initOSF2_freq = c(NA, NA, NA, NA, NA, 0, NA,NA,NA, 6, NA, NA, NA, 14, NA, NA, NA, NA))
Df
I would like to organize this into
3 columns consisting: c("propositions", "ranks", "freqs"),
where,
Propositions column has the values: "French", "Spanish", "-29 to -20", "19 to -10", "Pop", "Rock", and having a separate columns for the rank values e.g., 0 for French, 10 for Spanish, etc., and frequency values e.g., 66 for French, 0 for Spanish, etc.
This is not an easy one. Probably a better solution exists:
library(tidyverse)
library(data.table)
setDT(Df) %>%
select(contains(c('prop', 'rank', 'freq'))) %>%
filter(!if_all(everything(), is.na)) %>%
melt(measure.vars = patterns(c('prop.$', 'rank$', 'freq'))) %>%
group_by(gr=cumsum(!is.na(value1)))%>%
summarise(across(-variable, ~if(length(.x)>1) na.omit(.x) else .x))
# A tibble: 12 x 4
gr value1 value2 value3
<int> <chr> <dbl> <dbl>
1 1 French 0 66
2 2 -29 to -20 11 0
3 3 Pop 1 14
4 4 French 40 NA
5 5 -29 to -20 0 NA
6 6 Pop 2 NA
7 7 Spanish 10 0
8 8 -19 to -10 4 6
9 9 Rock 1 14
10 10 Spanish 1 NA
11 11 -19 to -10 0 NA
12 12 Rock 2 NA

Collapsing Dataframe Rows along several variables

I have a dataframe that looks something like this, in which I have several rows for each user, and many NAs in the columns.
user
Effect T1
Effect T2
Effect T3
Benchmark T1
Benchmark T2
Benchmark T3
Tom
01
NA
NA
02
NA
NA
Tom
NA
07
NA
NA
08
NA
Tom
NA
NA
13
NA
NA
14
Larry
03
NA
NA
04
NA
NA
Larry
NA
09
NA
NA
10
NA
Larry
NA
NA
15
NA
NA
16
Dave
05
NA
NA
06
NA
NA
Dave
NA
11
NA
NA
12
NA
Dave
NA
NA
17
NA
NA
18
I want to collapse the columns using the name and filling the values from reach row, this this.
user
Effect T1
Effect T2
Effect T3
Benchmark T1
Benchmark T2
Benchmark T3
Tom
01
07
13
02
08
14
Larry
03
09
15
04
10
16
Dave
05
11
17
06
12
18
How might I accomplish this?
Thank you in advance for your help. Update: I've added the dput of a subset of the actual data below.
structure(list(name = c("Abraham_Ralph", "Abraham_Ralph", "Abraham_Ralph",
"Ackerman_Gary", "Adams_Alma", "Adams_Alma", "Adams_Alma", "Adams_Alma",
"Adams_Sandy", "Aderholt_Robert", "Aderholt_Robert", "Aderholt_Robert",
"Aderholt_Robert", "Aderholt_Robert", "Aguilar_Pete", "Aguilar_Pete",
"Aguilar_Pete"), state = c("LA", "LA", "LA", "NY", "NC", "NC",
"NC", "NC", "FL", "AL", "AL", "AL", "AL", "AL", "CA", "CA", "CA"
), seniority = c(1, 2, 3, 15, 1, 2, 3, 4, 1, 8, 9, 10, 11, 12,
1, 2, 3), legeffect_112 = c(NA, NA, NA, 0.202061712741852, NA,
NA, NA, NA, 1.30758035182953, 3.73544979095459, NA, NA, NA, NA,
NA, NA, NA), legeffect_113 = c(NA, NA, NA, NA, 0, NA, NA, NA,
NA, NA, 0.908495426177979, NA, NA, NA, NA, NA, NA), legeffect_114 = c(2.07501077651978,
NA, NA, NA, NA, 0.84164834022522, NA, NA, NA, NA, NA, 0.340001106262207,
NA, NA, 0.10985741019249, NA, NA), legeffect_115 = c(NA, 0.493490308523178,
NA, NA, NA, NA, 0.587624311447144, NA, NA, NA, NA, NA, 0.159877583384514,
NA, NA, 0.730929613113403, NA), legeffect_116 = c(NA, NA, 0.0397605448961258,
NA, NA, NA, NA, 1.78378939628601, NA, NA, NA, NA, NA, 0.0198802724480629,
NA, NA, 0.0497006773948669), benchmark_112 = c(NA, NA, NA, 0.738679468631744,
NA, NA, NA, NA, 0.82908970117569, 1.39835929870605, NA, NA, NA,
NA, NA, NA, NA), benchmark_113 = c(NA, NA, NA, NA, 0.391001850366592,
NA, NA, NA, NA, NA, 1.58223271369934, NA, NA, NA, NA, NA, NA),
benchmark_114 = c(1.40446054935455, NA, NA, NA, NA, 0.576326191425323,
NA, NA, NA, NA, NA, 1.42212760448456, NA, NA, 0.574363172054291,
NA, NA), benchmark_115 = c(NA, 1.3291300535202, NA, NA, NA,
NA, 0.537361204624176, NA, NA, NA, NA, NA, 1.45703768730164,
NA, NA, 0.523149251937866, NA), benchmark_116 = c(NA, NA,
0.483340591192245, NA, NA, NA, NA, 1.31058621406555, NA,
NA, NA, NA, NA, 0.751261711120605, NA, NA, 1.05683290958405
)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"
))
A data.table solution:
# melt data, remove NA, then recast ...
dt <- dcast(melt(data.table(d), "name")[!value %in% NA], name ~ variable)
dcast(melt(data.table(d), "name")[!value %in% c(NA) & !variable %in% c("variable", "seniority", "state")], name ~ variable)
name legeffect_112 legeffect_113 legeffect_114 legeffect_115 legeffect_116 benchmark_112 benchmark_113 benchmark_114 benchmark_115 benchmark_116
1: Abraham_Ralph <NA> <NA> 2.07501077651978 0.493490308523178 0.0397605448961258 <NA> <NA> 1.40446054935455 1.3291300535202 0.483340591192245
2: Ackerman_Gary 0.202061712741852 <NA> <NA> <NA> <NA> 0.738679468631744 <NA> <NA> <NA> <NA>
3: Adams_Alma <NA> 0 0.84164834022522 0.587624311447144 1.78378939628601 <NA> 0.391001850366592 0.576326191425323 0.537361204624176 1.31058621406555
4: Adams_Sandy 1.30758035182953 <NA> <NA> <NA> <NA> 0.82908970117569 <NA> <NA> <NA> <NA>
5: Aderholt_Robert 3.73544979095459 0.908495426177979 0.340001106262207 0.159877583384514 0.0198802724480629 1.39835929870605 1.58223271369934 1.42212760448456 1.45703768730164 0.751261711120605
6: Aguilar_Pete <NA> <NA> 0.10985741019249 0.730929613113403 0.0497006773948669 <NA> <NA> 0.574363172054291 0.523149251937866 1.05683290958405
Data/Setup
# Load data.table
# install.packages("data.table")
library(data.table)
# Read example data
d <- structure(list(name = c("Abraham_Ralph", "Abraham_Ralph", "Abraham_Ralph",
"Ackerman_Gary", "Adams_Alma", "Adams_Alma", "Adams_Alma", "Adams_Alma",
"Adams_Sandy", "Aderholt_Robert", "Aderholt_Robert", "Aderholt_Robert",
"Aderholt_Robert", "Aderholt_Robert", "Aguilar_Pete", "Aguilar_Pete",
"Aguilar_Pete"), state = c("LA", "LA", "LA", "NY", "NC", "NC",
"NC", "NC", "FL", "AL", "AL", "AL", "AL", "AL", "CA", "CA", "CA"
), seniority = c(1, 2, 3, 15, 1, 2, 3, 4, 1, 8, 9, 10, 11, 12,
1, 2, 3), legeffect_112 = c(NA, NA, NA, 0.202061712741852, NA,
NA, NA, NA, 1.30758035182953, 3.73544979095459, NA, NA, NA, NA,
NA, NA, NA), legeffect_113 = c(NA, NA, NA, NA, 0, NA, NA, NA,
NA, NA, 0.908495426177979, NA, NA, NA, NA, NA, NA), legeffect_114 = c(2.07501077651978,
NA, NA, NA, NA, 0.84164834022522, NA, NA, NA, NA, NA, 0.340001106262207,
NA, NA, 0.10985741019249, NA, NA), legeffect_115 = c(NA, 0.493490308523178,
NA, NA, NA, NA, 0.587624311447144, NA, NA, NA, NA, NA, 0.159877583384514,
NA, NA, 0.730929613113403, NA), legeffect_116 = c(NA, NA, 0.0397605448961258,
NA, NA, NA, NA, 1.78378939628601, NA, NA, NA, NA, NA, 0.0198802724480629,
NA, NA, 0.0497006773948669), benchmark_112 = c(NA, NA, NA, 0.738679468631744,
NA, NA, NA, NA, 0.82908970117569, 1.39835929870605, NA, NA, NA,
NA, NA, NA, NA), benchmark_113 = c(NA, NA, NA, NA, 0.391001850366592,
NA, NA, NA, NA, NA, 1.58223271369934, NA, NA, NA, NA, NA, NA),
benchmark_114 = c(1.40446054935455, NA, NA, NA, NA, 0.576326191425323,
NA, NA, NA, NA, NA, 1.42212760448456, NA, NA, 0.574363172054291,
NA, NA), benchmark_115 = c(NA, 1.3291300535202, NA, NA, NA,
NA, 0.537361204624176, NA, NA, NA, NA, NA, 1.45703768730164,
NA, NA, 0.523149251937866, NA), benchmark_116 = c(NA, NA,
0.483340591192245, NA, NA, NA, NA, 1.31058621406555, NA,
NA, NA, NA, NA, 0.751261711120605, NA, NA, 1.05683290958405
)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"
))
This solution is using only the base functions (no extra packages), but the one-liner may cause eyes to cross, so I'll split it into several functions.
The plan is the following:
Split the original data.frame by the values in name column, using the function by;
For each partition of the data.frame, collapse the columns;
A collapsed column returns the max value of the column, or NA if all its values are NA;
The collapsed data.frame partitions are stacked together.
So, this is a function that does that:
dfr_collapse <- function(dfr, col0)
{
# Collapse the columns of the data.frame "dfr" grouped by the values of
# the column "col0"
# Max/NA function
namax <- function(x)
{
if(all(is.na(x)))
NA # !!!
else
max(x, na.rm=TRUE)
}
# Column collapse function
byfun <- function(x)
{
lapply(x, namax)
}
# Stack the partitioning results
return(do.call(
what = rbind,
args = by(dfr, dfr[[col0]], byfun)
))
}
May not look as slick as a one-liner, but it does the job. It can be tunrned into a one-liner, but you don't want that.
Assuming that df0 is the data.frame from you dput, you can test this function with
dfr_collapse(df0)
Nota bene: for the sake of simplicity, I return an NA of type logical (see the comment # !!! above). The correct code should convert that NA to the mode of the x vector. Also, the function should check the type of its inputs, etc.

How to calculate number of IDs in another variable and how to normalize the data

I have this data below:
## ID DOB sector meters Oct Res_FROM Res_TO Exp_FROM
## 1 20100 1979-08-24 H38 6400 W 1979-08-15 1991-05-15 1979-08-24
## 2 20101 1980-05-05 B01 1600 NW 1980-05-15 1991-04-15 1980-05-15
## 3 20102 1979-03-17 H04 1600 SW 1972-06-15 1979-08-15 1979-03-17
## 4 20103 1981-11-30 B09 3200 NE 1982-01-15 1984-01-15 1982-01-15
## 5 20103 1981-11-30 B37 8000 N 1984-01-15 1986-04-15 1984-01-15
## 6 20104 1978-09-01 B09 3200 NE 1982-01-15 1984-01-15 1982-01-15
## Exp_TO Exps_Grp Yr1952 Yr1953 Yr1954 Yr1955 Yr1956 Yr1957 Yr1958 Yr1959
## 1 1988-12-31 fr51>88 NA NA NA NA NA NA NA NA
## 2 1988-12-31 fr51>88 NA NA NA NA NA NA NA NA
## 3 1979-08-15 between NA NA NA NA NA NA NA NA
## 4 1984-01-15 between NA NA NA NA NA NA NA NA
## 5 1986-04-15 between NA NA NA NA NA NA NA NA
## 6 1984-01-15 between NA NA NA NA NA NA NA NA
## Yr1960 Yr1961 Yr1962 Yr1963 Yr1964 Yr1965 Yr1966 Yr1967 Yr1968 Yr1969 Yr1970
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA
## Yr1971 Yr1972 Yr1973 Yr1974 Yr1975 Yr1976 Yr1977 Yr1978 Yr1979 Yr1980
## 1 NA NA NA NA NA NA NA NA 5.950991 4.340588
## 2 NA NA NA NA NA NA NA NA NA 2.927725
## 3 NA NA NA NA NA NA NA NA 20.608986 NA
## 4 NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA
## Yr1981 Yr1982 Yr1983 Yr1984 Yr1985 Yr1986 Yr1987 Yr1988
## 1 4.340588 4.340588 4.340588 4.3405881 4.340588 4.3405881 4.340588 1.083782
## 2 4.447229 4.447229 4.447229 4.4472289 4.447229 4.4472289 4.447229 1.110409
## 3 NA NA NA NA NA NA NA NA
## 4 NA 15.365412 16.018407 0.6529943 NA NA NA NA
## 5 NA NA NA 2.9414202 3.052618 0.6918076 NA NA
## 6 NA 15.365412 16.018407 0.6529943 NA NA NA NA
## Yrs_Exp arth_mean median cumulative caldate Age Month_Res
## 1 9.3616438 4.175948 4.340588 41.759478 12/31/88 9 141
## 2 8.6356164 3.907637 4.447229 35.168736 12/31/88 9 131
## 3 0.4136986 20.608986 20.608986 20.608986 12/31/88 9 86
## 4 2.0000000 10.678938 15.365412 32.036813 12/31/88 9 24
## 5 2.2493151 2.228615 2.941420 6.685846 12/31/88 8 27
## 6 2.0000000 10.678938 15.365412 32.036813 12/31/88 9 24
I have talked with a couple other folks and they have recommended that a normalize the cumulative exposures for each person (ID) based on population for each sector and the average residence time per sector. I have a couple questions. First, how would I go about creating the R code to determine how many people (IDs) are in each sector, and then how would I calculate average residence time per sector (the month_res column gives how many months residents lived in that sector)? I've tried this R code to separate each sector by the total number of IDs in it, but the error 'sum' not relevant for factors was given.
Fernald_Normalized$ID <- as.factor(Fernald_Normalized$ID)
Fernald_1 <- aggregate(Fernald_Normalized$ID, list(Fernald_Normalized$sector), FUN=sum)
If I keep ID as numeric, it sums the IDs themselves in each sector and produces a large number. Additionally, once I calculate the number of IDs per sector and the average residence time per sector, how would I use R to actually normalize this? I have a basic understanding of why we would normalize and generally what is done for normalization, but I haven't been able to create code for this in R. Reproducible dataset below. This is only a small snippet, in reality there are around 14,000 rows.
dat <- structure(list(UC_ID = c(20100L, 20101L, 20102L, 20103L, 20103L,
20104L, 20104L, 20105L, 20105L, 20106L, 20106L), DOB = c("1979-08-24",
"1980-05-05", "1979-03-17", "1981-11-30", "1981-11-30", "1978-09-01",
"1978-09-01", "1980-12-03", "1980-12-03", "1978-04-25", "1978-04-25"
), sector = c("H38", "B01", "H04", "B09", "B37", "B09", "B37",
"B09", "B09", "B09", "B09"), meters = c(6400L, 1600L, 1600L,
3200L, 8000L, 3200L, 8000L, 3200L, 3200L, 3200L, 3200L), Oct = c("W",
"NW", "SW", "NE", "N", "NE", "N", "NE", "NE", "NE", "NE"), Res_FROM = c("1979-08-15",
"1980-05-15", "1972-06-15", "1982-01-15", "1984-01-15", "1982-01-15",
"1984-01-15", "1980-12-15", "1983-08-15", "1978-04-15", "1983-08-15"
), Res_TO = c("1991-05-15", "1991-04-15", "1979-08-15", "1984-01-15",
"1986-04-15", "1984-01-15", "1986-04-15", "1983-08-15", "1991-03-15",
"1983-08-15", "2000-01-15"), Exp_FROM = c("1979-08-24", "1980-05-15",
"1979-03-17", "1982-01-15", "1984-01-15", "1982-01-15", "1984-01-15",
"1980-12-15", "1983-08-15", "1978-04-25", "1983-08-15"), Exp_TO = c("1988-12-31",
"1988-12-31", "1979-08-15", "1984-01-15", "1986-04-15", "1984-01-15",
"1986-04-15", "1983-08-15", "1988-12-31", "1983-08-15", "1988-12-31"
), Exps_Grp = c("fr51>88", "fr51>88", "between", "between", "between",
"between", "between", "between", "fr51>88", "between", "fr51>88"
), Yr1952 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1953 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1954 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1955 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1956 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1957 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1958 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1959 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1960 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1961 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1962 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1963 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1964 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1965 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1966 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1967 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1968 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1969 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1970 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1971 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Yr1972 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), Yr1973 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1974 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Yr1975 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), Yr1976 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Yr1977 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), Yr1978 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
79.39642441, NA), Yr1979 = c(5.950991161, NA, 20.60898553, NA,
NA, NA, NA, NA, NA, 59.94484924, NA), Yr1980 = c(4.340588078,
2.927724588, NA, NA, NA, NA, NA, 0.758267013, NA, 16.01840668,
NA), Yr1981 = c(4.340588078, 4.447228937, NA, NA, NA, NA, NA,
16.01840668, NA, 16.01840668, NA), Yr1982 = c(4.340588078, 4.447228937,
NA, 15.36541238, NA, 15.36541238, NA, 16.01840668, NA, 16.01840668,
NA), Yr1983 = c(4.340588078, 4.447228937, NA, 16.01840668, NA,
16.01840668, NA, 9.952203009, 6.066203667, 9.952203009, 6.066203667
), Yr1984 = c(4.340588078, 4.447228937, NA, 0.652994292, 2.941420153,
0.652994292, 2.941420153, NA, 16.01840668, NA, 16.01840668),
Yr1985 = c(4.340588078, 4.447228937, NA, NA, 3.052618478,
NA, 3.052618478, NA, 16.01840668, NA, 16.01840668), Yr1986 = c(4.340588078,
4.447228937, NA, NA, 0.691807598, NA, 0.691807598, NA, 16.01840668,
NA, 16.01840668), Yr1987 = c(4.340588078, 4.447228937, NA,
NA, NA, NA, NA, NA, 16.01840668, NA, 16.01840668), Yr1988 = c(1.083782142,
1.110408824, NA, NA, NA, NA, NA, NA, 3.999564755, NA, 3.999564755
), Yrs_Exp = c(9.361643836, 8.635616438, 0.41369863, 2, 2.249315068,
2, 2.249315068, 2.665753425, 5.383561644, 5.309589041, 5.383561644
), arth_mean = c(4.175947792, 3.907637331, 20.60898553, 10.67893778,
2.22861541, 10.67893778, 2.22861541, 10.68682084, 12.35656585,
32.89144945, 12.35656585), median = c(4.340588078, 4.447228937,
20.60898553, 15.36541238, 2.941420153, 15.36541238, 2.941420153,
12.98530484, 16.01840668, 16.01840668, 16.01840668), cumulative = c(41.75947792,
35.16873597, 20.60898553, 32.03681335, 6.685846229, 32.03681335,
6.685846229, 42.74728337, 74.13939513, 197.3486967, 74.13939513
), caldate = c("12/31/88", "12/31/88", "12/31/88", "12/31/88",
"12/31/88", "12/31/88", "12/31/88", "12/31/88", "12/31/88",
"12/31/88", "12/31/88"), Age = c(9L, 9L, 9L, 9L, 8L, 9L,
7L, 7L, 10L, 10L, 8L), Month_Res = c(141L, 131L, 86L, 24L,
27L, 24L, 27L, 32L, 91L, 64L, 197L)), class = "data.frame", row.names = c(NA,
-11L))
Use ave to apply function by group with base R.
# Number of IDs per sector
with(dat, ave(ID, sector, FUN = length))
# [1] 1 1 1 6 2 6 2 6 6 6 6

Missing cases while using summarise(across())

I have data.frame that looks like this:
I want to quickly reshape it so I will only one record for each ID, something that is looks like this:
df can be build using codes:
df<-structure(list(ID = structure(c("05-102", "05-102", "05-102",
"01-103", "01-103", "01-103", "08-104", "08-104", "08-104", "05-105",
"05-105", "05-105", "02-106", "02-106", "02-106", "05-107", "05-107",
"05-107", "08-108", "08-108", "08-108", "02-109", "02-109", "02-109",
"05-111", "05-111", "05-111", "07-115", "07-115", "07-115"), label = "Unique Subject Identifier", format.sas = "$"),
EXSTDTC1 = structure(c(NA, NA, NA, 17022, NA, NA, 17024,
NA, NA, 17032, NA, NA, 17038, NA, NA, 17092, NA, NA, 17108,
NA, NA, 17155, NA, NA, 17247, NA, NA, 17333, NA, NA), class = "Date"),
EXSTDTC6 = structure(c(NA, 16885, NA, NA, NA, 17031, NA,
NA, 17032, NA, NA, 17041, NA, NA, 17047, NA, NA, 17100, NA,
NA, 17116, NA, 17164, NA, NA, NA, 17256, NA, 17342, NA), class = "Date"),
EXSTDTC3 = structure(c(NA, NA, 16881, NA, 17027, NA, NA,
17029, NA, NA, 17037, NA, NA, 17043, NA, NA, 17097, NA, NA,
17113, NA, NA, NA, 17160, NA, 17252, NA, NA, NA, 17338), class = "Date"),
EXDOSEA1 = c("73.8+147.6", NA, NA, "64.5+129", NA, NA, "62.7+125.4",
NA, NA, "114+57", NA, NA, "60+117.5", NA, NA, "48.6+97.2",
NA, NA, "61.2+122.4", NA, NA, "47.7+95.4", NA, NA, "51.6+103.2",
NA, NA, "68+136", NA, NA), EXDOSEA6 = c(NA, "100", NA, NA,
NA, "86", NA, NA, "83.5", NA, NA, "76", NA, NA, "39.2", NA,
NA, "32", NA, NA, "81.5", NA, "69.6", NA, NA, NA, "68", NA,
"91", NA), EXDOSEA3 = c(NA, NA, "1600", NA, "4302", NA, NA,
"4185", NA, NA, "3900", NA, NA, "3921", NA, NA, "3300", NA,
NA, "4080", NA, NA, NA, "3183", NA, "3300", NA, NA, NA, "1514"
)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"
))
right now I have my codes as:
df %>%
group_by(ID) %>%
summarise(across(EXSTDTC1:EXDOSEA3, na.omit))
But it seems remove the 05-102 as it did not have value on EXSTDTC1. I would like to see how we can address this. Is it possible to keep across still?
Many thanks.
We could use an if/else condition to address those cases where there is only NA
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(EXSTDTC1:EXDOSEA3,
~ if(all(is.na(.))) NA else .[complete.cases(.)]), .groups = 'drop')
-output
# A tibble: 10 x 7
# ID EXSTDTC1 EXSTDTC6 EXSTDTC3 EXDOSEA1 EXDOSEA6 EXDOSEA3
# <chr> <date> <date> <date> <chr> <chr> <chr>
# 1 01-103 2016-08-09 2016-08-18 2016-08-14 64.5+129 86 4302
# 2 02-106 2016-08-25 2016-09-03 2016-08-30 60+117.5 39.2 3921
# 3 02-109 2016-12-20 2016-12-29 2016-12-25 47.7+95.4 69.6 3183
# 4 05-102 NA 2016-03-25 2016-03-21 73.8+147.6 100 1600
# 5 05-105 2016-08-19 2016-08-28 2016-08-24 114+57 76 3900
# 6 05-107 2016-10-18 2016-10-26 2016-10-23 48.6+97.2 32 3300
# 7 05-111 2017-03-22 2017-03-31 2017-03-27 51.6+103.2 68 3300
# 8 07-115 2017-06-16 2017-06-25 2017-06-21 68+136 91 1514
# 9 08-104 2016-08-11 2016-08-19 2016-08-16 62.7+125.4 83.5 4185
#10 08-108 2016-11-03 2016-11-11 2016-11-08 61.2+122.4 81.5 4080

Create a row with character and numeric

I want to create a single with NA values, 0 values and characters as shown below:
newrow = c(NA, NA, NA, NA, "GPP", numeric(0), NA, NA, NA, NA, NA, NA, NA, NA, numeric(0), NA, NA, NA, NA, NA, NA)
However, the zero values are transformed in NA values in the output.
[1] NA NA NA NA "GPP" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Does anyone know what is going wrong?
Answer from Frank:
newrow = list(NA, NA, NA, NA, "GPP", numeric(1), NA, NA, NA, NA, NA, NA, NA, NA, numeric(1), NA, NA, NA, NA, NA, NA)

Resources