Summarize columns where names have a specific pattern in data.table - r

I have a very large data.table, which I want to summarise columns by group, where the column names starts with a certain pattern.
The columns I am interested in always have the same format, namely: f<X>_<Y>, m<X>_<Y>, f<X>, m<X>.
This is the list of all possible column names:
ageColsPossible <- c("m0_9", "m10_19", "m20_29", "m30_39", "m40_49", "m50_59", "m60_69",
"f0_9", "f10_19", "f20_29", "f30_39", "f40_49", "f50_59", "f60_69")
if there is not enough data available, my data.table will only have some of these columns. I would like to get a vector with the column names that are available in the data:
> names(myData)
[1] "clientID" "policyID" "startYear" "product" "NOplans" "grp"
[7] "policyid" "personid" "age" "gender" "dependant" "location"
[13] "region" "exposure" "startMonth" "cover_effective_date" "endexposuredate" "fromdate"
[19] "enddate" "planHistSufficiency" "productRank" "claim10month" "claim11month" "claim12month"
[25] "claim9month" "NA20_29" "NA30_39" "NA40_49" "NA50_59" "f0_9"
[31] "f10_19" "f20_29" "f30_39" "f40_49" "f50_59" "f60_69"
[37] "m0_9" "m10_19" "m20_29" "m30_39" "m40_49" "m50_59"
[43] "m60_69" "u0_9" "u10_19" "u20_29" "u30_39" "u40_49"
[49] "u50_59" "u60_69" "uNA"
I know of regrex and was thinking something along the line: regex = "(m|f)(\\d+)_?(\\d+)?", but i have also seen patern() function somewhere. Unfortunately i can no longer find it.
any ideas?

something like this will most likely do the trick.. assuming you only need one summary-function? (median() in this example)...
DT[, lapply( .SD, median), by=.(group), .SDcols = patterns( "^[mf]\\d+" ) ]

Related

left_join says column is not present even though it is present

I would like to join two data frames with two different variables tp join. There is an error which says it cannotfind the variable in the second dataframe. But when I run the function colnames(), the column name shows up. Why is this the case?
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty", "NAMECOUNTY"))
Error: Join columns must be present in data.
x Problem with `NAMECOUNTY`.
Run `rlang::last_error()` to see where the error occurred.
colnames(master_settlement_current_month)[1:5]
[1] "month" "D.info_state" "D.info_county" "D.info_settlement" "D.settlecounty"
colnames(master_settlement)
[1] "NAME" "NAMEJOIN" "NAMECOUNTY" "COUNTYJOIN" "DATE" "DATA_SOURC" "IMG_VERIFD"
[8] "X" "Y" "kobo_label" "X.3" "X.2" "X.1" "INDEX"
[15] "P_CODE" "aok_sett_id" "name_county_low" "ALT_NAME1" "ALT_NAME2" "ALT_NAME3" "ALT_NAME4"
[22] "FUNC_CLASS" "CONF_SCORE" "SRC_VERIFD" "num_dup" "check_coord_v38"
I think your syntax in the by = statement may be a little off.
library(dplyr)
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty" = "NAMECOUNTY"))

str_split on first and second occurence of delimter at different locations in character vector

I have a character list that has weather variables followed by "mean_#" where # is a number between 5 and 10. I want to subset the list to only have the weather variable names themselves. The mean weather variables look like this:
> mean_vars
[1] "dew_mean_10" "dew_mean_5" "dew_mean_6" "dew_mean_7"
[5] "dew_mean_8" "dew_mean_9" "humid_mean_10" "humid_mean_5"
[9] "humid_mean_6" "humid_mean_7" "humid_mean_8" "humid_mean_9"
[13] "rain_mean_10" "rain_mean_5" "rain_mean_6" "rain_mean_7"
[17] "rain_mean_8" "rain_mean_9" "soil_moist_mean_10" "soil_moist_mean_5"
[21] "soil_moist_mean_6" "soil_moist_mean_7" "soil_moist_mean_8" "soil_moist_mean_9"
[25] "soil_temp_mean_10" "soil_temp_mean_5" "soil_temp_mean_6" "soil_temp_mean_7"
[29] "soil_temp_mean_8" "soil_temp_mean_9" "solar_mean_10" "solar_mean_5"
[33] "solar_mean_6" "solar_mean_7" "solar_mean_8" "solar_mean_9"
[37] "temp_mean_10" "temp_mean_5" "temp_mean_6" "temp_mean_7"
[41] "temp_mean_8" "temp_mean_9" "wind_dir_mean_10" "wind_dir_mean_5"
[45] "wind_dir_mean_6" "wind_dir_mean_7" "wind_dir_mean_8" "wind_dir_mean_9"
[49] "wind_gust_mean_10" "wind_gust_mean_5" "wind_gust_mean_6" "wind_gust_mean_7"
[53] "wind_gust_mean_8" "wind_gust_mean_9" "wind_spd_mean_10" "wind_spd_mean_5"
[57] "wind_spd_mean_6" "wind_spd_mean_7" "wind_spd_mean_8" "wind_spd_mean_9"
And this is all I want at the end:
> var_names
"dew" "humid" "rain" "solar" "temp" "soil_moist" "soil_temp" "wind_dir" "wind_gust" "wind_spd"
Now I figured out how to do it but I fill my method is extraneous due to a lack of ability with regular expressions. I also will have to repeat my process 20 times substituting "mean" with other words.
var_names <- unique(str_split_fixed(mean_vars, "_", n = 3)[c(1:18,31:42),1])
var_names <- unlist(c(var_names, unique(unite(as_tibble(str_split_fixed(mean_vars, "_", n = 3)[c(19:30,43:60), 1:2])))))
I've been trying to stay within the realm of the tidyverse packages as much as possible so I was using stringr::str_split_fixed.
If you have a solution using this same function that would be ideal as I could continue the same programming style, but I'm open to all suggestions.
Thanks.
Use sub and unique. This is shorter and has no package dependencies (or use unique(str_replace(mean_vars, "_mean.*", "")) with stringr):
unique(sub("_mean.*", "", mean_vars))
giving:
[1] "dew" "humid" "rain" "soil_moist" "soil_temp"
[6] "solar" "temp" "wind_dir" "wind_gust" "wind_spd"
If for some reason you really want to use str_split then:
rmMean <- function(x) paste(head(x, -2), collapse = "_")
unique(sapply(str_split(mean_vars, "_"), rmMean))
Note
mean_vars <- c("dew_mean_10", "dew_mean_5", "dew_mean_6", "dew_mean_7", "dew_mean_8",
"dew_mean_9", "humid_mean_10", "humid_mean_5", "humid_mean_6",
"humid_mean_7", "humid_mean_8", "humid_mean_9", "rain_mean_10",
"rain_mean_5", "rain_mean_6", "rain_mean_7", "rain_mean_8", "rain_mean_9",
"soil_moist_mean_10", "soil_moist_mean_5", "soil_moist_mean_6",
"soil_moist_mean_7", "soil_moist_mean_8", "soil_moist_mean_9",
"soil_temp_mean_10", "soil_temp_mean_5", "soil_temp_mean_6",
"soil_temp_mean_7", "soil_temp_mean_8", "soil_temp_mean_9", "solar_mean_10",
"solar_mean_5", "solar_mean_6", "solar_mean_7", "solar_mean_8",
"solar_mean_9", "temp_mean_10", "temp_mean_5", "temp_mean_6",
"temp_mean_7", "temp_mean_8", "temp_mean_9", "wind_dir_mean_10",
"wind_dir_mean_5", "wind_dir_mean_6", "wind_dir_mean_7", "wind_dir_mean_8",
"wind_dir_mean_9", "wind_gust_mean_10", "wind_gust_mean_5", "wind_gust_mean_6",
"wind_gust_mean_7", "wind_gust_mean_8", "wind_gust_mean_9", "wind_spd_mean_10",
"wind_spd_mean_5", "wind_spd_mean_6", "wind_spd_mean_7", "wind_spd_mean_8",
"wind_spd_mean_9")

Error with R dplyr left_join

So I've been trying to use left_join to get the columns of a new dataset onto my main dataset (called employee)
I've double checked the vector names and the cleaning that I've don't and nothing seems to work. Here is my code. Would appreciate any help.
job_codes <- read_csv("Quest_UMMS_JobCodes.csv")
job_codes <- job_codes %>%
clean_names() %>%
select(job_code, pos_desc = pos_des_desc)
job_codes$is_nurse <- str_detect(tolower(job_codes$pos_desc), "nurse")
employee <- employee %>%
left_join(job_codes, by = "job_code")
The error I keep getting:Error in eval(substitute(expr), envir, enclos) :
'job_code' column not found in rhs, cannot join
here are the results of
names(job_code)
> names(job_codes)
[1] "job_code" "pos_desc" "is_nurse"
names(employee)
> names(employee)
[1] "REC_NUM" "ZIP" "STATE"
[4] "SEX" "EEO_CLASS" "BIRTH_YEAR"
[7] "EMP_STATUS" "PROCESS_LEVEL" "DEPARTMENT"
[10] "JOB_CODE" "UNION_CODE" "SUPERVISOR"
[13] "DATE_HIRED" "R_SHIFT" "SALARY_CLASS"
[16] "EXEMPT_EMP" "PAY_RATE" "ADJ_HIRE_DATE"
[19] "ANNIVERS_DATE" "TERM_DATE" "NBR_FTE"
[22] "PENSION_PLAN" "PAY_GRADE" "SCHEDULE"
[25] "OT_PLAN_CODE" "DECEASED" "POSITION"
[28] "WORK_SCHED" "SUPERVISOR_IND" "FTE_TOTAL"
[31] "PRO_RATE_TOTAL" "PRO_RATE_A_SAL" "NEW_HIRE_DATE"
[34] "COUNTY" "FST_DAY_WORKED" "date_hired"
[37] "date_hired_adj" "term_date" "employment_duration"
[40] "current" "age" "emp_duration_years"
[43] "DESCRIPTION.x" "PAY_STATUS.x" "DESCRIPTION.y"
[46] "PAY_STATUS.y"
Now, after the OP has added the column names of both tables in the Q, it is evident that the columns to join on are written in different ways (upper vs lower case).
If the column names are different, help("left_join") suggests:
To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
So, in this case it should read
employee <- employee %>% left_join(job_codes, by = c("JOB_CODE" = "job_code"))

How to split a string list whose elements are "name-year" based on years in R

I have some codes like this example, if you run these codes
library(hurricaneexposure)
library(hurricaneexposuredata)
data("hurr_tracks")
storms <- unique(hurr_tracks$storm_id)
storms
then you will see that "storms" has a long string list with "stormname-year" structure.
[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988" "Gilbert-1988" "Keith-1988" "Allison-1989" "Chantal-1989"
[9] "Hugo-1989" "Jerry-1989" "Bertha-1990" "Marco-1990" "Ana-1991" "Bob-1991" "Fabian-1991" "Notnamed-1991"
[17] "Andrew-1992" "Danielle-1992" "Earl-1992" "Arlene-1993" "Emily-1993" "Alberto-1994" "Beryl-1994" "Gordon-1994"
[25] "Allison-1995" "Dean-1995" "Erin-1995" "Gabrielle-1995" "Jerry-1995" "Opal-1995" "Arthur-1996" "Bertha-1996"
[33] "Edouard-1996" "Fran-1996" "Josephine-1996" "Subtrop-1997" "Ana-1997" "Danny-1997" "Bonnie-1998" "Charley-1998"
[41] "Earl-1998" "Frances-1998" "Georges-1998" "Hermine-1998" "Mitch-1998" "Bret-1999" "Dennis-1999" "Floyd-1999"
[49] "Harvey-1999" "Irene-1999" "Beryl-2000" "Gordon-2000" "Helene-2000" "Leslie-2000" "Allison-2001" "Barry-2001"
My question is how to split these elements based on same year. For example, I want to create a new variable "y1988" which is a list that has all storms in 1998. If I run y1988, it will output:
y1988
[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988" "Gilbert-1988" "Keith-1988"
So as for y1989 until 2001. I am guessing it might use gsub() and a for-loop,however, I am a rookie in R, so really hope you could give me some suggestion.
We can use split with grouping variable created by removing the prefix substring including the - with sub.
lst <- split(storms, sub(".*-", "", storms))
lst$`1988`
#[1] "Alberto-1988" "Beryl-1988" "Chris-1988" "Florence-1988"
#[5] "Gilbert-1988" "Keith-1988"
data
storms <- c("Alberto-1988", "Beryl-1988", "Chris-1988", "Florence-1988",
"Gilbert-1988", "Keith-1988", "Allison-1989", "Chantal-1989",
"Hugo-1989", "Jerry-1989", "Bertha-1990", "Marco-1990", "Ana-1991",
"Bob-1991", "Fabian-1991", "Notnamed-1991", "Andrew-1992", "Danielle-1992",
"Earl-1992", "Arlene-1993", "Emily-1993", "Alberto-1994", "Beryl-1994",
"Gordon-1994", "Allison-1995", "Dean-1995", "Erin-1995", "Gabrielle-1995",
"Jerry-1995", "Opal-1995", "Arthur-1996", "Bertha-1996", "Edouard-1996",
"Fran-1996", "Josephine-1996", "Subtrop-1997", "Ana-1997", "Danny-1997",
"Bonnie-1998", "Charley-1998", "Earl-1998", "Frances-1998", "Georges-1998",
"Hermine-1998", "Mitch-1998", "Bret-1999", "Dennis-1999", "Floyd-1999",
"Harvey-1999", "Irene-1999", "Beryl-2000", "Gordon-2000", "Helene-2000",
"Leslie-2000", "Allison-2001", "Barry-2001")
Why don't you extract the year directly within your original dataframe? libraries dplyr and tidyr are well suited for problems like this.
I suggest the following:
library(dplyr)
library(tidyr)
hurr_tracks %>%
extract(storm_id, c("storm", "year"),"(.+)-(.+)")
Alternative way using stringr
split(storms,str_extract(storms,"[0-9]+"))

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

Resources