Finding the location of half the max value in a column - r

I have a list of data frames, and I'd like to apply a function to that list to find the location in the "julian" column that corresponds to half the max value in the "total_cover" column. Here's some data that represents the data I have:
df1 <- data.frame(julian = c(81,85,88,97,101,104,126,167),
total_cover = c(43,52,75,92,94,97,188,172))
df2 <- data.frame(julian = c(81,85,88,97,101,104,126,167),
total_cover = c(30,55,73,80,75,85,138,154))
df3 <- data.frame(julian = c(107,111,115,119,123,129,131,133,135,137),
total_cover = c(36,41,43,47,55,55,55,65,75,80))
data.list <- list(df1=df1,df2=df2,df3=df3)
The code below is what I've tried, but I'm not getting the correct output. This doesn't seem to be finding the julian day that corresponds to half the max value
unlist(lapply(X = data.list, FUN = function(x){
x[which.max(x[["total_cover"]] >= which.max(x[["total_cover"]])/2), "julian"]
}))
output:
df1 df2 df3
81 81 107
My ideal output would be what's shown below, with the julian dates that correspond to >= max(total_cover)/2
df1 df2 df3
101 97 111
Using R 4.2.2

I believe the following answers the question.
sapply(data.list, \(x) {
half_max <- max(x$total_cover)/2
d <- abs(x$total_cover - half_max)
is.na(d) <- x$total_cover < half_max
x$julian[which.min(d)]
})
#> df1 df2 df3
#> 101 97 111
Created on 2022-12-13 with reprex v2.0.2

find_julian <- function(df){
#calculate the distance from half of the maximum
distance <- df[["total_cover"]]- max(df[["total_cover"]])/2
#find smallest value greater than half of the maximum and select corresponding julian
df[distance==min(distance[distance>=0]),"julian"]
}
unlist(lapply(X = data.list, FUN = find_julian))
df1 df2 df3
101 97 111

Here is step by step dplyr solution: The main issue is that the difference is sometimes negative and we have to remove them:
The result of
df1 df2 df3
81 81 107
occurs because the code does not take into consideration negative numbers!
Long version:
library(dplyr)
bind_rows(data.list, .id = 'id') %>%
group_by(id) %>%
mutate(x = (max(total_cover)/2)) %>%
mutate(y = total_cover-x) %>%
filter(y >=0) %>%
filter(y == min(y)) %>%
select(1:2) %>%
pull(julian, name = id)
Or a little shorter:
bind_rows(data.list, .id = 'id') %>%
group_by(id) %>%
filter(total_cover-(max(total_cover)/2) >=0) %>%
filter(total_cover == min(total_cover)) %>%
select(1:2) %>%
pull(julian, name = id)
result:
df1 df2 df3
101 97 111

Related

Shifting Values in R in rows

I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14

Addressing another dataframe with dynamic variables in R (ideally dplyr, using mutate)

I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.
Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5
Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))

Closest value to a specific column in R

I would like to find the closest value to column x3 below.
data=data.frame(x1=c(24,12,76),x2=c(15,30,20),x3=c(45,27,15))
data
x1 x2 x3
1 24 15 45
2 12 30 27
3 76 20 15
So desired output will be
Closest_Value_to_x3
24
30
20
Please help. Thank you
Use max.col(-abs(data[, 3] - data[, -3])) to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind
col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20
A tidyverse solution:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
select(val)
val
<dbl>
1 24
2 30
3 20
First, it adds a row ID. Second, it transforms the data from wide to long. Third, it calculates the difference between "x3" and the other variables. Finally, it groups by the row ID and keeps the rows where the absolute difference is the smallest.
Or:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
pull(val)
[1] 24 30 20
Or using an approach originally proposed by #markus (it assumes that your columns are named "x"):
data %>%
mutate(temp = paste0("x", max.col(-abs(.[, -3] - .[, 3])))) %>%
rowwise() %>%
summarise(val = eval(as.symbol(temp)))
val
<dbl>
1 24.
2 30.
3 20.
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest and combines it with "x". Then, it evaluates the combination of x and column index as a variable and returns the appropriate value.
Also borrowing the idea from #markus (not assuming that your columns are named "x"):
data %>%
mutate(temp = max.col(-abs(.[, -3] - .[, 3]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest. Second, it returns the column name based on the column index. Finally, it evaluates it as a variable and returns the appropriate value.
Or a variant where you can reference the "x3" variable by its name and not by column index (the basic idea still from #markus):
data %>%
mutate(temp = max.col(-abs(.[, !grepl("x3", colnames(.))] - .[, grepl("x3", colnames(.))]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))
Here is another approach using matrixStats
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
x[matrixStats::rowMins(y) == y]
# [1] 24 30 20
Or in base using vapply
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
vapply(1:nrow(data),
function(k) x[k,][which.min(y[k,])],
numeric(1))
# [1] 24 30 20
Define a function closest_to_3 that operates on a vector and returns the value in the vector that's closest to the third member:
closest_to_3 <- function(v) v[-3][which.min(abs( v[-3]-v[3] ))]
(The idiom v[-3] deletes the 3rd member from v.) Then apply this function to each row of your data frame:
apply(data, 1, closest_to_3)
#[1] 24 30 20

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.
An example of what each CV fold would comprise:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
Thanks in advance!
For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
example with two rows per each time:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
with unequal number of rows per time
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
Now you can define these hold out folds inside trainControl with the argument indexOut.
EDIT: to get similar output as caret::groupKFold one can:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname

Filtering data in a data frame

I have a data frame that looks like this:
S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20
I would like to filter and select the state(from S1State and S2State) that has the maximum value(from S1Value and S2Value). The result should look like this:
SState SValue
VIC 30
WA 30
I am new to R and have been experimenting with dplyr.
The answer I was hinting at is as follows:
library(dplyr)
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)
answer = dt %>%
mutate(SState = ifelse(S1Value > S2Value, S1State, S2State),
SValue = ifelse(S1Value > S2Value, S1Value, S2Value)) %>%
select(SState, SValue)
Just to show that this is far from impossible with standard R tools:
nams <- c("State","Value")
tmp <- reshape(dt, direction="long", varying=lapply(nams, grep, x=names(dt)),
v.names=nams, timevar=NULL)
tmp[with(tmp, Value == ave(Value, id, FUN=max)),]
# State Value id
#2.1 WA 30 2
#1.2 VIC 30 1
I assume that the OP may have more states in the data frame, such as S3State, S4State, ...
The following solutions are based on this assumption, trying to be able to process more than one states. If there are only two states, the approach proposed by #lebelinoz is simple and straightforward.
Solution 1
A solution using functions from dplyr and tidyr. dt2 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Process the data
dt2 <- dt %>%
gather(Num, Value, contains("Value")) %>%
gather(State, Name, contains("State")) %>%
# Only keep records with the same state number
filter(substring(Num, 1, 2) == substring(State, 1, 2)) %>%
mutate(Group = substring(Num, 1, 2)) %>%
group_by(Group) %>%
filter(Value == max(Value)) %>%
ungroup() %>%
select(SState = Name, SSValue = Value)
Solution 2
A solution using functions from dplyr, purrr, and stringr. I loaded the package tidyverse for the first two packages. Again, dt2 is the final output.
# Load packages
library(tidyverse)
library(stringr)
# Extract the column names
Col <- colnames(dt)
# Extract state numbers
ColNum <- Col %>%
str_extract(pattern = "[0-9]") %>%
unique()
# Design a function to process the data
dt_process <- function(pattern, dt){
dt2 <- dt %>%
# Extract columns based on a pattern (numbers)
select(dplyr::contains(pattern)) %>%
# Rename the columns
rename_all(~sub(pattern, "", .)) %>%
# Filter the maximum row
filter(SValue == max(SValue))
return(dt2)
}
# Apply the dt_process function
dt_list <- map(.x = ColNum, .f = dt_process, dt = dt)
# Bind all data frames
dt2 <- bind_rows(dt_list) %>% arrange(SState)
Data Preparation
# Create example data frame
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)

Resources