Merging monthly level data with quarterly data? - r

I have 2 data sets - one is quarterly which I need to match to monthly data. So the values from the quarterly data will be repeated thrice in the final data set. I have created a one quarter sample below but this would need to be repeated for many quarters.
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
df1 <- cbind(month,rating)
quarter <- c(“q1/20”)
amount <- c(100)
df2 <- cbind(quarter,amount)
My final data set should have the following structure
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
quarter <- c(“q1/20”, “q1/20”, “q1/20”)
amount <- c(100,100,100)
df3 <- cbind(month, rating, quarter, amount)
In the full quarterly data set (df1), some observations are also monthly so it would maybe be a case of matching the monthly observations by month and quarterly observations by quarter?
Thanks in anticipation.

Assuming you have this data.
head(m.dat)
# month rating
# 1 1/18 0.91
# 2 2/18 0.94
# 3 3/18 0.29
# 4 4/18 0.83
# 5 5/18 0.64
# 6 6/18 0.52
head(q.dat)
# quarter amount
# 1 q1/18 1
# 2 q2/18 21
# 3 q3/18 91
# 4 q4/18 61
# 5 q1/19 38
# 6 q2/19 44
You could match month information to quarters using an assignment matrix qm.
qm <- matrix(c(1:12, paste0("q", rep(1:4, each=3))), 12, 2)
m.dat$quarter <- paste0(qm[match(qm[, 1], gsub("(^\\d*).*", "\\1", m.dat$month)), 2],
"/",
sapply(strsplit(m.dat$month, "/"), `[`, 2))
This enables you to use merge.
res <- merge(m.dat, q.dat, all=TRUE)
head(res)
# quarter month rating amount
# 1 q1/18 1/18 0.91 1
# 2 q1/18 2/18 0.94 1
# 3 q1/18 3/18 0.29 1
# 4 q1/19 1/19 0.93 38
# 5 q1/19 2/19 0.26 38
# 6 q1/19 3/19 0.46 38
Toy data
m.dat <- structure(list(month = c("1/18", "2/18", "3/18", "4/18", "5/18",
"6/18", "7/18", "8/18", "9/18", "10/18", "11/18", "12/18", "1/19",
"2/19", "3/19", "4/19", "5/19", "6/19", "7/19", "8/19", "9/19",
"10/19", "11/19", "12/19", "1/20", "2/20", "3/20", "4/20", "5/20",
"6/20", "7/20", "8/20", "9/20", "10/20", "11/20", "12/20"), rating = c(0.91,
0.94, 0.29, 0.83, 0.64, 0.52, 0.74, 0.13, 0.66, 0.71, 0.46, 0.72,
0.93, 0.26, 0.46, 0.94, 0.98, 0.12, 0.47, 0.56, 0.9, 0.14, 0.99,
0.95, 0.08, 0.51, 0.39, 0.91, 0.45, 0.84, 0.74, 0.81, 0.39, 0.69,
0, 0.83)), class = "data.frame", row.names = c(NA, -36L))
q.dat <- structure(list(quarter = c("q1/18", "q2/18", "q3/18", "q4/18",
"q1/19", "q2/19", "q3/19", "q4/19", "q1/20", "q2/20", "q3/20",
"q4/20"), amount = c(1, 21, 91, 61, 38, 44, 4, 97, 43, 96, 89,
64)), class = "data.frame", row.names = c(NA, -12L))

Assuming that df1 and df2 are the data frames shown in the Note at the end create a yq column of class yearqtr in each and merge on that:
library(zoo)
df1 <- transform(df1, yq = as.yearqtr(month, "%m/%y"))
df2 <- transform(df2, yq = as.yearqtr(quarter, "q%q/%y"))
merge(df1, df2, by = "yq", all = TRUE)
giving:
yq month rating quarter amount
1 2020 Q1 1/20 0.50 q1/20 100
2 2020 Q1 2/20 0.60 q1/20 100
3 2020 Q1 3/20 0.65 q1/20 100
We could also consider converting the month column into a yearmon class column using
as.yearmon .
Note
df1 <- data.frame(month = c("1/20", "2/20", "3/20"), rating = c(0.5,0.6,0.65))
df2 <- data.frame(quarter = "q1/20", amount = 100)

Related

Renaming entries of a list within a list, by using a vector containing the column names

My data looks as follows:
DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Something", "Anything",
"Nothing", "Something", "Anything", "Nothing", "Something", "Anything",
"Nothing", "Something", "Anything", "Nothing", "Something", "Anything",
"Nothing", "Something", "Anything", "Nothing", "Something", "Anything",
"Nothing", "Something", "Anything", "Nothing", "Something", "Anything",
"Nothing"), Values = c(0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86,
0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99,
0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA,
-27L), class = c("tbl_df", "tbl", "data.frame"))
DT1_list <- DT1%>%
group_split(Province, Year)
I want to rename all the columns, using a vector as follows:
colnames <- c("newname1","newname2","newname3","newname4")
for (i in DT1_list) {
names(DT1_list)[[i]] <- colnames
}
The problem is that names(DT1_list)[[i]] does not give column names but NULL.
What is the correct way to do this?
EDIT:
I noticed that my question, was not a good enough representation of my actual problem (my apologies, I did not foresee the the issue). My actual problem is that I want to rename 3 out of 4 columns:
colnames <- c("newname1","newname2","newname3")
If I use the answers provided, the fourth column becomes NA. Is there anyway to keep the other columns intact?
You could use purrr:map:
library(purrr) 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA,
-27L), class = c("tbl_df", "tbl", "data.frame"))
DT1_list <- DT1%>%
group_split(Province, Year)
library(purrr) 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA,
-27L), class = c("tbl_df", "tbl", "data.frame"))
DT1_list <- DT1%>%
group_split(Province, Year)
DT1_list %>% map(~{colnames <- colnames(.x)
#You could also use str_replace
#colnames <- stringr::str_replace(colnames,"Values","NewValues")
colnames[1:3] <- c("newname1","newname2","newname3")
colnames(.x)<- colnames
.x})
[[1]]
# A tibble: 3 x 4
newname1 newname2 newname3 Values
<dbl> <dbl> <chr> <dbl>
1 1 2000 Something 0.59
2 1 2000 Anything 0.580
3 1 2000 Nothing 0.66
[[2]]
# A tibble: 3 x 4
newname1 newname2 newname3 Values
<dbl> <dbl> <chr> <dbl>
1 1 2001 Something 0.53
2 1 2001 Anything 0.94
3 1 2001 Nothing 0.2
...
You can use lapply/map :
lapply(DT1_list, setNames, colnames)
#[[1]]
# A tibble: 3 x 4
# newname1 newname2 newname3 newname4
# <dbl> <dbl> <chr> <dbl>
#1 1 2000 Something 0.59
#2 1 2000 Anything 0.580
#3 1 2000 Nothing 0.66
#[[2]]
# A tibble: 3 x 4
# newname1 newname2 newname3 newname4
# <dbl> <dbl> <chr> <dbl>
#1 1 2001 Something 0.53
#2 1 2001 Anything 0.94
#3 1 2001 Nothing 0.2
#...
#...
When you want to rename less columns than your original data use :
inds <- seq_along(colnames)
lapply(DT1_list, function(x) {names(x)[inds] <- colnames;x})
Or :
library(dplyr)
library(purrr)
map(DT1_list, ~.x %>% rename_with(~colnames, inds))

How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?

I want to compute the mean exposure to se ozone from a dataset with the example below. The mean value should be the ozone value from the year of birth to age 5. Is there a simple way to do this in R.
final = data.frame(ID = c(1, 2, 3, 4, 5, 6),
Zone = c("A", "B", "C", "D", "A", "B"),
dob = c(1993, 1997, 1994, 2001, 1999, 1993),
Ozone_1993 = c(0.12, 0.01, 0.36, 0.78, 0.12, 0.01),
Ozone_1994 = c(0.75, 0.23, 0.14, 0.98, 0.75, 0.23),
Ozone_1995 = c(1.38, 0.45, -0.08, 1.18, 1.38, 0.45),
Ozone_1996 = c(2.01, 0.67, -0.3, 1.38, 2.01, 0.67),
Ozone_1997 = c(2.64, 0.89, -0.52, 1.58, 2.64, 0.89),
Ozone_1998 = c(3.27, 1.11, -0.74, 1.78, 3.27, 1.11),
Ozone_1999 = c(3.9, 1.33, -0.96, 1.98, 3.9, 1.33),
Ozone_2000 = c(4.53, 1.55, -1.18, 2.18, 4.53, 1.55),
Ozone_2001 = c(5.16, 1.77, -1.4, 2.38, 5.16, 1.77),
Ozone_2002 = c(5.79, 1.99, -1.62, 2.58, 5.79, 1.99),
Ozone_2003 = c(6.42, 2.21, -1.84, 2.78, 6.42, 2.21),
Ozone_2004 = c(7.05, 2.43, -2.06, 2.98, 7.05, 2.43),
mean_under5_ozone = c(0.85, 1.33, -0.3, 2.68, 5.16, 0.45))
where column (variable) mean_under5_ozone is the mean score of Ozone exposure from birthyear to age 5 or less. e.g mean_under5_ozone for ID 1 is the rowmean from Ozone_1993 to Ozone_1997
From a novice,
Here is one way to do it with for loops. (It's not very elegant, but it avoids getting into too much details of dplyr and rlang syntax.)
loop over birth years (dob_yr below) to define a column containing variable names to use for the custom mean (use_vars below).
loop over rows and for each row, extract relevant variables using this new column (use_vars) and calculate the custom mean.
library(dplyr)
df <- tibble(id=1:5)
df$zone <- c(rep('A', 5))
df$dob_yr <- c(1991:1995)
for (yr in 1991:1995) {
df[[paste('x_',yr,sep='')]] <- c(abs(rnorm(5)))
}
df # check mock data
add_use_vars <- function(df, dob_yr_varname='dob_yr', prefix='x_', yr_within=3) {
vars <- names(df %>% select(starts_with(prefix)))
vars_yr <- as.integer(sub(prefix, '', vars))
df$use_vars <- NA
for (i in seq_along(df[[dob_yr_varname]])) {
yr <- df[[dob_yr_varname]][i]
idx <- (vars_yr <= yr + yr_within) & (vars_yr >= yr)
df$use_vars[i] <- list(vars[idx]) # list with one element
}
return(df)
}
df <- add_use_vars(df)
df$use_vars[1][[1]] # see the first row in use_vars
custom_mean <- function(df, varname_varlist='use_vars') {
df$custom_mean <- NA
for (i in seq_along(df[[varname_varlist]])) {
vars = df[[varname_varlist]][i][[1]] # extract first element in list
df$custom_mean[i] <- mean(as.numeric(df[i, vars]))
}
return(df)
}
df <- custom_mean(df)
df # see results
Note that for this mock data, for each row, I am averaging over the columns containing value of 0 to 3 years from the birth year.
(Complete rewrite.)
I don't think I understand what mean_under5_ozone means, since I can't reproduce your numbers. For instance, for ID==1, born in 1993, that means we want data from 1993 through 1998 (to include age 5) or 1997 (up to but not including), but neither of those averages is 0.85:
mean(unlist(final[1, 4:9]))
# [1] 1.695
mean(unlist(final[1, 4:8]))
# [1] 1.38
Ignoring this, I'll give you what I think are the correct answers with your final data.
tidyverse
library(dplyr)
library(tidyr) # pivot_longer
final <- select(final, -mean_under5_ozone)
final %>%
pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>%
group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ]), .groups = "drop")
# # A tibble: 6 x 2
# ID mean_under5_ozone
# <dbl> <dbl>
# 1 1 1.70
# 2 2 1.44
# 3 3 -0.41
# 4 4 2.68
# 5 5 5.48
# 6 6 0.56
data.table
library(data.table)
library(magrittr) # %>%, not required but used for improved readability
finalDT[, mean_under5_ozone := NULL]
melt(finalDT, 1:3) %>%
.[, year := as.integer(gsub("[^0-9]", "", variable))] %>%
.[ year >= dob, ] %>%
.[, .(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ])), by = .(ID)] %>%
.[order(ID),]
# ID mean_under5_ozone
# 1: 1 1.695
# 2: 2 1.440
# 3: 3 -0.410
# 4: 4 2.680
# 5: 5 5.475
# 6: 6 0.560
A few thoughts, using random data.
set.seed(42)
dat <- data.frame(dob = sample(1990:2020, size=1000, replace=TRUE), Ozone_1993=runif(1000), Ozone_1994=runif(1000), Ozone_1995=runif(1000))
head(dat)
# dob Ozone_1993 Ozone_1994 Ozone_1995
# 1 2006 0.37383448 0.68624969 0.1681480
# 2 1994 0.46496563 0.29309851 0.8198724
# 3 1990 0.04660819 0.41994895 0.7501070
# 4 2014 0.98751620 0.73526105 0.2899959
# 5 1999 0.90845233 0.84982125 0.1798130
# 6 1993 0.97939015 0.07746459 0.6172919
tidyverse
library(dplyr)
dat %>%
filter(dob >= 2015) %>%
summarize_at(vars(starts_with("Ozone")), mean)
# Ozone_1993 Ozone_1994 Ozone_1995
# 1 0.5242029 0.4852803 0.4864364
That is the average per year. If you instead need a single statistic, then
# library(tidyr) # pivot_longer
dat %>%
filter(dob >= 2015) %>%
tidyr::pivot_longer(starts_with("Ozone")) %>%
summarize(value = mean(value))
# # A tibble: 1 x 1
# value
# <dbl>
# 1 0.499
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[ dob >= 2015, ][, lapply(.SD, mean), .SDcols = patterns("^Ozone")]
# Ozone_1993 Ozone_1994 Ozone_1995
# 1: 0.5242029 0.4852803 0.4864364
melt(datDT[ dob >= 2015, ], "dob")[, .(value = mean(value))]
# value
# 1: 0.4986398
Base R
apply(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995), 2, mean)
# Ozone_1993 Ozone_1994 Ozone_1995
# 0.5242029 0.4852803 0.4864364
mean(unlist(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995)))
# [1] 0.4986398

Determine how close proportions are to an even split

I've got a dataset that has info about bunch of cities in it. Variables include % of residents that are several different race categories, % of residents in several employment sectors, etc. I'm trying to determine, for each category, how close each city is to an even split among the options.
So for race, there's 4 race categories, so a city that's 25% of each would be (for example) 1, while a city that was 100% white would be a 0. However, with 7 employment sectors, each would have to be 14.29% for a perfect score (the point being that I'm doing this on multiple categories with different numbers of groups in each category). My output would be a column that has some kind of numeric score for how evenly the group I'm looking at (for example, race) is spread out.
I'm programming in R, so a solution there would be great, but I'm up for whatever kind of answer might be useful.
Here's a sample data frame if that's useful
testdata <- structure(list(city = c("City1", "City2", "City3", "City4"), black = c(0.4, 0.1, 0.3, 0.2), white = c(0.3, 0.7, 0.1, 0.2), hisp = c(0.2, 0.1, 0.2, 0.2),asian = c(0.1, 0.1, 0.4, 0.4), service =c(0.10, 0.14, 0.4, 0.0),tech = c(0.00, 0.14, 0.6, 0.2),govt = c(0.15, 0.14, 0.0, 0.2),nonprofit = c(0.20, 0.14, 0.0, 0.3),agriculture = c(0.05, 0.14, 0.0, 0.1),manufacturing = c(0.40, 0.14, 0.0, 0.1),marketing = c(0.10, 0.16, 0.0, 0.1)), row.names = c(NA, -4L), class = "data.frame")
Here's one way to proceed :
Differentiate the data based on categories. In the example, you have shared you have information about two broad categories, race and employment sectors, once you have the categories you could get the even split number by dividing 1 by number of rows in each group and subtract it from the value present.
library(dplyr)
testdata %>%
tidyr::pivot_longer(cols = -city) %>%
mutate(category=case_when(name %in% c('black', 'white', 'hisp', 'asian') ~ 'race',
TRUE ~ 'sectors')) %>%
group_by(city, category) %>%
mutate(close_ratio = abs(1/n() - value))
# city name value category close_ratio
# <chr> <chr> <dbl> <chr> <dbl>
# 1 City1 black 0.4 race 0.15
# 2 City1 white 0.3 race 0.0500
# 3 City1 hisp 0.2 race 0.0500
# 4 City1 asian 0.1 race 0.15
# 5 City1 service 0.1 sectors 0.0429
# 6 City1 tech 0 sectors 0.143
# 7 City1 govt 0.15 sectors 0.00714
# 8 City1 nonprofit 0.2 sectors 0.0571
# 9 City1 agriculture 0.05 sectors 0.0929
#10 City1 manufacturing 0.4 sectors 0.257
# … with 34 more rows
close_ratio = 0 is ideal which means that the value is exactly same as even split. The more it goes far from 0, the more it is towards uneven split.

If rowSums greater than one, divide by sum

I'd like to divide by the sum of rows if the rowSums() is greater than one. I haven't thought of a way to do this without a for() loop, so I'm looking for a solution without a loop.
Sample Data
dat <- structure(list(x1 = c(0.18, 0, 0.11, 0.24, 0.33), x2 = c(0.34,
0.14, 0.36, 0.35, 0.21), x3 = c(0.1, 0.36, 0.12, 0.07, 0.18),
x4 = c(0.08, 0.35, 0.06, 0.1, 0.09), x5 = c(0.26, 0.13, 0.22,
0.31, 0.22)), .Names = c("x1", "x2", "x3", "x4", "x5"), row.names = c(NA,
5L), class = "data.frame")
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.07 1.03
What I've tried
This works, but I wonder if there is a better way to do it:
a <- which(rowSums(dat) > 1)
dat[a, ] <- dat[a,] / rowSums(dat[a,]
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.00 1.00
This gives the same value as the expression near the end of the question:
dat / pmax(rowSums(dat), 1)
This is inferior to G. Grothendieck's answer, but you can also use ifelse.
rs <- rowSums(dat)
dat / ifelse(rs < 1, rs, 1L)
x1 x2 x3 x4 x5
1 0.1875000 0.3541667 0.1041667 0.08333333 0.2708333
2 0.0000000 0.1428571 0.3673469 0.35714286 0.1326531
3 0.1264368 0.4137931 0.1379310 0.06896552 0.2528736
4 0.2400000 0.3500000 0.0700000 0.10000000 0.3100000
5 0.3300000 0.2100000 0.1800000 0.09000000 0.2200000

Keep NA values in their original position when reordering vector

I have a large set of data that I want to reorder in groups of twelve using the sample() function in R to generate randomised data sets with which I can carry out a permutation test. However, this data has NA characters where data could not be collected and I would like them to stay in their respective original positions when the data is shuffled.
Currently, NAs are shuffled randomly with all other values. For example, where example.data is a made-up example set of 12 values:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42)
sample(example.data, replace = F, prob = NULL)
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 NA 0.12 0.42 1.00 NA
Whereas a suitable reordering would be:
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 0.12 NA NA 0.42 1.00
Is there a simple way to do this?
Thank you for your help!
This has been solved, but I have an extending question
Extending from this, if I have a set of data with a length of 24 how would I go about re-ordering the first and second set of 12 values individually?
For example, a vector extending from the first example:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98, NA, 0.05)
Where example.data[1:12] and example.data[13:24] are shuffled separately within their own respective groups.
The code I am trying to work this solution into is as follows:
shuffle.data = function(input.data,nr,ns){
simdata <- input.data
for(i in 1:nr){
start.row <- (ns*(i-1))+1
end.row <- start.row + actual.length[i] - 1
newdata = sample(input.data[start.row:end.row], size=actual.length[i], replace=F)
simdata[start.row:end.row] <- newdata
}
return(simdata)}
Where input.data is the raw input data (example.data); nr is the number of groups (2), ns is the size of each sample (12); and actual.length is the length of each group exluding NAs stored in a vector (actual.length <- c(9, 8) in the example above).
Thank you again for your help!
Is this what you are looking for ?
example.data[!is.na(example.data)] <- sample(example.data[!is.na(example.data)], replace = F, prob = NULL)
We can try with non-NA elements by creating an index
i1 <- which(!is.na(example.data))
example.data[i1] <- example.data[sample(i1)]
example.data
#[1] 0.25 0.64 NA 0.83 0.12 1.00 0.42 0.47 NA NA 0.33 0.90

Resources