Calculate values based on matched substrings within names - r

I am trying to identify column names with matching substrings, and then calculate the differences of the values in those columns.
Sample data:
V1_ABC <- c(1,2,3,4)
V2_ABC <- c(2,3,4,5)
V1_WXYZ <- c(10,11,12,13)
V2_WXYZ <- c(11,12,13,14)
Date <- c(2001,2002,2003,2004)
So df looks like:
df <- data.frame(Date, V1_ABC, V2_ABC, V1_WXYZ, V2_WXYZ)
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ
1 2001 1 2 10 11
2 2002 2 3 11 12
3 2003 3 4 12 13
4 2004 4 5 13 14
I want to calculate V1 minus V2 for ABC and WXYZ. My original dataset is much larger, so I don't want to do this manually for each. I'd like to automate this so that R compares the column headers and finds which columns have the same ending substring (V1_ABC and V2_ABC, and V1_WXYZ and V2_WXYZ), then subtracts the V2_ from the V1_. Like this:
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
1 2001 1 2 10 11 -1 -1
2 2002 2 3 11 12 -1 -1
3 2003 3 4 12 13 -1 -1
4 2004 4 5 13 14 -1 -1
Most of the functions I have found such as grep or intersect either look for a specific string you input, or return the values where the vectors are the same.
Any ideas on how to automate pairing based on names/substrings?

You could stack V1 and V2 separately, calculate the differences, and reshape them back to the wide form. This approach can deal with any numbers of pairs of V1_xxx and V2_xxx.
library(tidyverse)
df %>%
pivot_longer(contains("_"), names_to = c(".value", "grp"), names_sep = "_") %>%
mutate(dif = V1 - V2) %>%
pivot_wider(names_from = grp, values_from = c(V1, V2, dif))
# # A tibble: 4 × 7
# Date V1_ABC V1_WXYZ V2_ABC V2_WXYZ dif_ABC dif_WXYZ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2001 1 10 2 11 -1 -1
# 2 2002 2 11 3 12 -1 -1
# 3 2003 3 12 4 13 -1 -1
# 4 2004 4 13 5 14 -1 -1

Here is a base R solution. You mention that your data frame is large so this checks for columns where there are exactly 2 shared suffixes and only operates on those. It assumes that they are all of the format "V1_suffix" and "V2_suffix" but could be easily modified if they are in other formats.
suffixes <- unlist(regmatches(names(df), gregexpr("_.+", names(df))))
# Limit to suffixes where there are 2
suffixes <- names(table(suffixes)[table(suffixes) == 2])
diffs <- sapply(suffixes,
\(suffix) df[[paste0("V1", suffix)]] - df[[paste0("V2", suffix)]]
)
diff_df <- data.frame(diffs) |>
setNames(paste0("dif", suffixes))
cbind(df, diff_df)
# Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
# 1 2001 1 2 10 11 -1 -1
# 2 2002 2 3 11 12 -1 -1
# 3 2003 3 4 12 13 -1 -1
# 4 2004 4 5 13 14 -1 -1

Related

Aggregating Max using h2o in R

I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe.
df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
ID Group VarA VarB VarD
1 1 1 NA NA 10
2 2 1 1 NA 12
3 3 1 2 NA 14
4 4 1 3 NA 16
5 5 2 12 10 10
6 6 2 12 12 12
7 7 2 12 14 14
8 8 3 12 16 16
9 9 3 0 10 10
10 10 3 14 12 12
11 11 4 NA 14 14
12 12 4 14 16 16
13 13 5 16 10 10
14 14 5 16 12 12
15 15 5 NA 14 14
16 16 5 16 16 16
In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.
This is a link which includes the behaviour of the behaviour of the na.methods argument (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html).
If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))
Group max_ID max_VarA max_VarB max_VarD
1 1 4 NaN NaN 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 NaN 16 16
5 5 16 NaN 16 16
If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
Similarly I get the same output if set the na.methods = 'ignore'.
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(everything(), ~max(.x, na.rm = TRUE)))
Group ID VarA VarB VarD
<dbl> <int> <dbl> <dbl> <dbl>
1 1 4 3 -Inf 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 14 16 16
5 5 16 16 16 16
This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number. This value corresponds to -1.80e308. I agree this is confusing and I would consider this to be a bug. You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)
Not sure how to achieve that with h2o.group_by() – I get the same weird value when running your code. If you are open for a somewhat hacky workaround, you might want to try the following (I included the part on H2O initialization for future reference):
convert your frame to long format, ie key-value representation
split by group and apply aggregate function using h2o.ddply()
convert your frame back to wide format
## initialize h2o
library(h2o)
h2o.init(
nthreads = parallel::detectCores() * 0.5
)
df_h2o = as.h2o(
df
)
## aggregate per group
df_h2o |>
# convert to long format
h2o.melt(
id_vars = "Group"
, skipna = TRUE # does not include `NA` in the result
) |>
# calculate `max()` per group
h2o.ddply(
.variables = c("Group", "variable")
, FUN = function(df) {
max(df[, 3])
}
) |>
# convert back to wide format
h2o.pivot(
index = "Group"
, column = "variable"
, value = "ddply_C1"
)
# Group ID VarA VarB VarD
# 1 4 3 NaN 16
# 2 7 12 14 14
# 3 10 14 16 16
# 4 12 14 16 16
# 5 16 16 16 16
#
# [5 rows x 5 columns]
## shut down h2o instance
h2o.shutdown(
prompt = FALSE
)

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

pull information from each unique pair

I have coordinates for each site and the year each site was sampled (fake dataframe below).
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
> dfA
LAT LONG YEAR
1 1 11 2001
2 2 12 2002
3 3 13 2003
4 4 14 2004
5 5 15 2005
6 1 11 2006
7 2 12 2007
8 3 13 2008
9 4 14 2009
10 5 15 2010
11 1 11 2011
12 2 12 2012
13 3 13 2013
14 4 14 2014
15 5 15 2015
16 1 16 2016
17 2 17 2017
18 3 18 2018
19 4 19 2019
20 5 20 2020
I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code
dfB <- dfA %>%
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB)
LAT LONG Freq
1 1 11 3
2 1 16 1
3 2 12 3
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
I am now trying to get the year for each unique location. I.e. I ultimately want this:
LAT LONG Freq . Year
1 1 11 3 . 2001,2006,2011
2 1 16 1 . 2016
3 2 12 3 . 2002,2007,2012
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
This is what I've tried:
1) Find which rows in dfA that corresponds with dfB:
dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))
> idx
[1] 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 2 4 6 8 10
So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].
I've tried this to extract info:
for (row in 1:20){
year<-as.character(dfA$YEAR[row])
tmp<-dfB$obs_Year[idx[row]]
if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-year
}
if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
}
}
I keep getting this error code:
number of items to replace is not a multiple of replacement length
Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!
You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.
The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.
The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.
Code below:
library(dplyr)
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
mutate( # The mutate verb is for building new variables.
year_string = paste(YEAR, collapse = ","), # the function paste()
# collapses the vector YEAR into a string
# the argument collapse = "," says to
# separate each element of the string with a comma
Freq = n()) %>% # I compute the frequency as you did
select(LAT, LONG, Freq, year_string) %>%
# Now I select only the columns that index
# location, frequency and the combined years
unique() # Now I filter for only unique observations. Since I have not picked
# YEAR in the select function only unique locations are retained
dfB
#> # A tibble: 10 x 4
#> # Groups: LAT, LONG [10]
#> LAT LONG Freq year_string
#> <int> <int> <int> <chr>
#> 1 1 11 3 2001,2006,2011
#> 2 2 12 3 2002,2007,2012
#> 3 3 13 3 2003,2008,2013
#> 4 4 14 3 2004,2009,2014
#> 5 5 15 3 2005,2010,2015
#> 6 1 16 1 2016
#> 7 2 17 1 2017
#> 8 3 18 1 2018
#> 9 4 19 1 2019
#> 10 5 20 1 2020
Created on 2019-01-21 by the reprex package (v0.2.1)

Create multiple sums

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA

Trying to keep values of a column based on the unique values of two other columns

I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].

Resources