How do I find the column name corresponding to the maximum value in multiple rows and columns? - r

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.

I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)

A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

Related

how to use multiple left_joins without leftover NAS

I am working with a handful of dataframes whose information I would like to combine.
df1 <- data.frame(id = c(1, 1, 2, 2), year = c(1990, 2000, 1990, 2000))
df1990 <- data.frame(id = c(1, 2), year = c(1990, 1990), X= c(1, 2))
df2000 <- data.frame(id = c(1, 2), year = c(2000, 2000), X= c(1, 2))
Above is code for creating toy inputs.
I want to append the information in df2 and df3 to df1, resulting in a dataframe like this
df <- data.frame(id = c(1, 1, 2, 2), year = c(1990, 2000, 1990, 2000), X = c(1, 1, 2, 2))
To do this, my first thought was to use left_join() but I can only do this successfully once -- it works with the first attempted merge, but the NAs remain NAs when I try to do a second merge.
So I run:
df <- left_join(df1, df1990)
df <- left_join(df, df2000)
But I still have NAs.
Any idea how I can fix this?
As suggested by TarJae in comments, create the right-hand side of the join, then join only once.
Assuming your dfs are all named dfYEAR, you can pull them out of the workspace
year_dfs <- lapply(grep("df[0-9]{4}", ls(), value = TRUE), get)
(even better would be to load them into a list to begin with)
Then join the two tables
# base-r
merge(df1, do.call(rbind, year_dfs))
# dplyr
year_dfs |> bind_rows() |> right_join(df1)
Note that if you have non-unique combinations of id and year in the year_dfs, you will end up with more rows than df1 started with.
Use rows_update:
library(dplyr)
left_join(df1, df3) %>%
rows_update(df2)
output
Joining, by = "X"
Matching, by = "X"
X X0 X2
1 1 1 1
2 2 2 2

How to identify the name of a column with the maximum of a the dataset in R? [duplicate]

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

New column with random boolean values while controlling the ratio of TRUE/FALSE per category

In R I've got a dataset like this one:
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
Now I want to add a new column with randomized boolean values, but inside each category the proportion of TRUE and FALSE values should be the same (i.e. the randomizing process should generate the same count of true and false values, in the above data frame 5 TRUEs and 5 FALSEs in each of the 3 categories). How to do this?
You can sample a vector of "TRUE" and "FALSE" values without replacement so you have a randomized and balanced column in your data-frame.
sample(rep(c("TRUE","FALSE"),each=5),10,replace=FALSE)
Based on Yacine Hajji answer:
addRandomBool <- function(df, p){
n <- ceiling(nrow(df) * p)
df$bool <- sample(rep(c("TRUE","FALSE"), times = c(n, nrow(df) - n)))
df
}
Reduce(rbind, lapply(split(df, df$category), addRandomBool, p = 0.5))
where parametar p determines the proportion of TRUE.
This will sample within each group from a vector of 5 TRUE and 5 FALSE without replacement. It will assume that there are always 10 records per group.
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){ # Function to saple and assign the new_col
df$new_col <- sample(rep(c(FALSE, TRUE),
each = 5),
size = 10,
replace = FALSE)
df
})) %>%
unnest(cols = "data")
This next example is a little more generalized, but still assumes (approximately) even distribution of TRUE and FALSE within a group. But it can accomodate variable group sizes, and even groups with odd numbers of records (but will favor FALSE for odd numbers of records)
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data")
Maintaining Column Order
A couple of options to maintain the column order:
First, you can save the column order before you do your group_by - nest, and then use select to set the order when you're done.
set.seed(pi)
orig_col <- names(df) # original column order
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data") %>%
select_at(c(orig_col, "new_col")) # Restore the column order
Or you can use a base R solution that doesn't change the column order in the first place
df <- split(df, df["category"])
df <- lapply(df,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})
do.call("rbind", c(df, list(make.row.names = FALSE)))
There are likely a dozen other ways to do this, and probably more efficient ways that I'm not thinking of.

gather 3 different detections of three different variables

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

Calculation between groups in one column in tidy data

I have data like that:
df <- (
tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
)
I want to calculate the ratio between "Height" and "Waist" and between "Waist" and "Hip".
I have the following solution. But my solution requires using spread() and delivers only the calculation for "Waist-to-hip".
df <- rbind(df,
spread(df, Parameter, Value)
%>% transmute(ID = ID,
Group = Group,
Parameter = "Ratio.Height-to-Hip",
Value = Height / Hip,
Parameter = "Ratio.Waist-to-Hip",
Value = Waist / Hip))
Is it possible to stay in tidy data format and avoid switching to the long-format? Why is the calculation for "Height-to-hip" missing?
Here is one the possible solution:
# Calculate ratios "Height" vs "Waist" and "Waist" vs "Hip"
# 1. Load packages
library(tidyr)
library(dplyr)
# 2. Data set
df <- tibble(
id = rep(1:2, 4),
group = c("A", "B", "A", "B","A", "B", "A", "B"),
parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# 3. Filter and transform data set
df <- df %>%
filter(parameter %in% c("Height", "Waist", "Hip")) %>%
spread(parameter, value)
# 4. Convert column names to lower case
colnames(df) <- tolower(colnames(df))
# 5. Calcutate ratios
df <- df %>%
mutate(
ratio_height_vs_waist = round(height / waist, 2),
ratio_waist_vs_hip = round(waist / hip, 2))
The main problem is that the data are not in a tidy format.
Two key features of the tidy format are (Wickham, 2013):
Each variable forms a column;
Each observation forms a row.
In its original format, your data violates these two rules. For example, the Parameter column contains four variables (Blood, Height, Waist, and Hip). The knock-on effect of grouping several variables within Parameter is that each observation has to be repeated across several rows. In general, repeated rows of an identifier (ID in this case) in the absence of repeated measures is a sign that two or more variables have been grouped under a single column.
Anyway, here's my attempt to clean the data (I have used mutate and and not transmute for illustrative purposes).
# Load packages
library(dplyr)
library(tidyr)
library(magrittr) # For the %<>% function, which I love
# Make data frame, df
df <- tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# Wrangle df
df %<>%
# ID and Group appear to be repeated, so use them to group_by
group_by(ID, Group) %>%
# Spread the Value column by the Parameter column
spread(key = Parameter,
value = Value) %>%
# Ungroup, just because its a good habit
ungroup() %>%
# Generate new columns.
mutate(Ratio_height_to_hip = Height / Hip,
Ratio_waist_to_hip = Waist / Hip)
# Print df
df
#> # A tibble: 2 x 8
#> ID Group Blood Height Hip Waist Ratio_height_to_hip
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 A 6.3 180 60 90 3.000000
#> 2 2 B 6.0 170 65 102 2.615385
#> # ... with 1 more variables: Ratio_waist_to_hip <dbl>
df <- df %>%
spread(Parameter, Value) %>%
mutate("Ratio.Height-to-Hip" = Height / Hip) %>%
mutate("Ratio.Waist-to-Hip" = Hip / Waist) %>%
gather("Parameter", "Value", -c("ID", "Group"))
Your data is not in tidy format ;) If you want your data in tidy format remove the last step.

Resources