Find the average of 3 minimum prices of numeric column - r

How can I find the average of the 3 minimum prices of a numeric column (Country_1) ?Imagine that I have thousands of values?
d<-structure(list(Subarea = c("SA_1", "SA_2", "SA_3", "SA_4", "SA_5",
"SA_6", "SA_7", "SA_8", "SA_10", "SA_9"), Country_1 = c(101.37519256645,
105.268942332558, 100.49933368058, 104.531597221684, NA, 83.4404308144341,
86.2833044714836, 81.808967345926, 79.6786979951661, 77.6863475527052
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))

Sort your vector by ascending value, take the 3 first values, and compute the mean.
mean(head(sort(d$Country_1), 3))
# [1] 79.72467
Use sapply or dplyr::across if you want to do that to multiple columns:
sapply(df[, your_columns], \(x) mean(head(sort(x), 3)))
# or
library(dplyr)
d %>%
mutate(across(your_columns, ~ mean(head(sort(.x), 3)))

If you only care the minimal 3 values and the amount of data is large, using sort() with partial = 1:3 is more efficient.
mean(sort(sample(d$Country_1), partial = 1:3)[1:3])

An option with slice_min
library(dplyr)
d %>%
slice_min(n = 3, order_by = Country_1) %>%
summarise(Mean = mean(Country_1))
# A tibble: 1 × 1
Mean
<dbl>
1 79.7

Related

R code to merge 2 data frames by whether values in the first "by" variable contain string values in the second "by" variable

I have 2 data frames: one with a list of medications, the other with a different but highly overlapping list of medications along with corresponding medication ID codes. I want to merge these two data frames to apply the medication codes to the first data frame's medication list. I have a lot of partial string matches, and I want to detect strings in a case-insensitive manner.
library(tidyverse)
library(stringr)
label <- c("0.4% Lidocaine Hydrochloride", "10% Dextrose", "Act Raloxifene")
df1 <- as.DataFrame(label)
label2 <- c("LIDOCAINE", "RALOXIFENE", "JANUMET", "ESOMEPRAZOLE", "METFORMIN")
code <- c(0003, 0005, 0006, 0001, 0011)
df2 <- data.frame(label2, code)%>%
rename(label=label2)
I try to use str_detect from stringr package
merge_df <- merge(df1, df2,
by.x=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df1$label, NA)),
by.y=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df2$label, NA)),
ignore.case=T,all.x=T,all.y=T,
suffixes = c("_list", "_dict"),
nomatch=0)
And I get the error:
Error in str_detect():
! Can't recycle string (size 3) to match pattern (size 5).
An approach using left_join.
First add a variable l_lower in both sets containing all tolower strings, separated by strsplit to enable match of all entries.
After joining and arranging the y-labels remove duplicated entries and the helper column.
library(dplyr)
library(tidyr)
left_join(df1 %>%
rowwise() %>%
mutate(l_label = strsplit(tolower(label), " ")) %>%
unnest(l_label),
df2 %>%
rowwise() %>%
mutate(l_label = unlist(strsplit(tolower(label), " "))), "l_label") %>%
arrange(label.y) %>%
group_by(label.x) %>%
filter(!duplicated(label.x)) %>%
select(-l_label) %>%
ungroup()
# A tibble: 3 × 3
label.x label.y code
<chr> <chr> <dbl>
1 0.4% Lidocaine Hydrochloride LIDOCAINE 3
2 Act Raloxifene RALOXIFENE 5
3 10% Dextrose NA NA
Data
df1 <- structure(list(label = c("0.4% Lidocaine Hydrochloride", "10% Dextrose",
"Act Raloxifene")), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(label = c("LIDOCAINE", "RALOXIFENE", "JANUMET",
"ESOMEPRAZOLE", "METFORMIN"), code = c(3, 5, 6, 1, 11)),
class = "data.frame", row.names = c(NA,
-5L))

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))

Using pivot_longer to separate columns into long format

I have a df that is of non-finite length that looks like the table below.
The example here only has 2 traits: "lipids" and "density". Other rows may have 50 traits or more. But will always have the same pattern of trait, unit, method. When importing into R using read_excel it changes non unique names to xxx...[col.number]. I want to use pivot_longer to cast the data into a long format from wide. I'm having difficulty manipulating the function and would appreciate some help. The final column names I would like would be geno_name, observation_id, trait, value, unit, method
Sample Data
Desired Output (without the drop_na statement to show example)
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,-1L))
So far I have:
x %>% pivot_longer(
cols = 3:ncol(x),
names_to = c("trait","unit","method"),
#need help with these other arguments
values_drop_na = T)
The data column names to be used in 'long' format doesn't all have the same pattern in column names. Therefore, the steps included are
rename columns that doesn't have the ... or _ in their column names by adding those with paste/str_c
reshape to long format with pivot_longer - taking into account the pattern in names with either names_sep or names_pattern, specify the names_to as a vector of c(".value", "trait") in the same order we want the column values and the suffix value to be stored as separate columns
Once we reshaped, create a grouping column based on the values in the 'trait' (some of them are numbers - create a logical vector and get the cumulative sum) along with the other grouping 'geno_name', 'observation_id' (which doesn't create a unique column though))
Now summarise the other columns by slicing the first row after ordering based on NA elements i.e. if there are no NA, the first value will be non-NA or else it will be NA
library(dplyr)
library(stringr)
library(tidyr)
x %>%
rename_at(vars(names(.)[!str_detect(names(.), "[_.]+")]),
~ str_c("value...", .)) %>%
pivot_longer(cols = 3:ncol(.),
names_to = c(".value", "trait"), names_sep = "\\.+") %>%
group_by(geno_name, observation_id,
grp = cumsum(str_detect(trait, "\\D+"))) %>%
summarise(across(everything(), ~ .[order(is.na(.))][1]),
.groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 6
# geno_name observation_id trait value unit method
# <chr> <dbl> <chr> <dbl> <chr> <chr>
#1 MB mixed 10 lipids NA <NA> <NA>
#2 MB mixed 10 density 1.12 g cm^-3 3D scanning
data
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,
-1L))

Separating Column Based on First Value of String

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.
We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

How to use dplyr::mutate_all for rounding selected columns

I'm using the following package version
# devtools::install_github("hadley/dplyr")
> packageVersion("dplyr")
[1] ‘0.5.0.9001’
With the following tibble:
library(dplyr)
df <- structure(list(gene_symbol = structure(1:6, .Label = c("0610005C13Rik",
"0610007P14Rik", "0610009B22Rik", "0610009L18Rik", "0610009O20Rik",
"0610010B08Rik"), class = "factor"), fold_change = c(1.54037,
1.10976, 0.785, 0.79852, 0.91615, 0.87931), pvalue = c(0.5312,
0.00033, 0, 0.00011, 0.00387, 0.01455), ctr.mean_exp = c(0.00583,
59.67286, 83.2847, 6.88321, 14.67696, 1.10363), tre.mean_exp = c(0.00899,
66.22232, 65.37819, 5.49638, 13.4463, 0.97043), ctr.cv = c(5.49291,
0.20263, 0.17445, 0.46288, 0.2543, 0.39564), tre.cv = c(6.06505,
0.28827, 0.33958, 0.53295, 0.26679, 0.52364)), .Names = c("gene_symbol",
"fold_change", "pvalue", "ctr.mean_exp", "tre.mean_exp", "ctr.cv",
"tre.cv"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
That looks like this:
> df
# A tibble: 6 × 7
gene_symbol fold_change pvalue ctr.mean_exp tre.mean_exp ctr.cv tre.cv
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.54037 0.53120 0.00583 0.00899 5.49291 6.06505
2 0610007P14Rik 1.10976 0.00033 59.67286 66.22232 0.20263 0.28827
3 0610009B22Rik 0.78500 0.00000 83.28470 65.37819 0.17445 0.33958
4 0610009L18Rik 0.79852 0.00011 6.88321 5.49638 0.46288 0.53295
5 0610009O20Rik 0.91615 0.00387 14.67696 13.44630 0.25430 0.26679
6 0610010B08Rik 0.87931 0.01455 1.10363 0.97043 0.39564 0.52364
I'd like to round the floats (2nd columns onward) to 3 digits. What's the way to do it with dplyr::mutate_all()
I tried this:
cols <- names(df)[2:7]
# df <- df %>% mutate_each_(funs(round(.,3)), cols)
# Warning message:
#'mutate_each_' is deprecated.
# Use 'mutate_all' instead.
# See help("Deprecated")
df <- df %>% mutate_all(funs(round(.,3)), cols)
But get the following error:
Error in mutate_impl(.data, dots) :
3 arguments passed to 'round'which requires 1 or 2 arguments
While the new across() function is slightly more verbose than the previous mutate_if variant, the dplyr 1.0.0 updates make the tidyverse language and code more consistent and versatile.
This is how to round specified columns:
df %>% mutate(across(2:7, round, 3)) # columns 2-7 by position
df %>% mutate(across(cols, round, 3)) # columns specified by variable cols
This is how to round all numeric columns to 3 decimal places:
df %>% mutate(across(where(is.numeric), round, 3))
This is how to round all columns, but it won't work in this case because gene_symbol is not numeric:
df %>% mutate(across(everything(), round, 3))
Where we put where(is.numeric) in across's arguments, you could put in other column specifications such as -1 or -gene_symbol to exclude column 1. See help(tidyselect) for even more options.
Update for dplyr 1.0.0
The across() function replaces the _if/_all/_at/_each variants of dplyr verbs. https://dplyr.tidyverse.org/dev/articles/colwise.html#how-do-you-convert-existing-code
Since some columns are not numeric, you could use mutate_if with the added benefit of rounding columns iff (if and only if) it is numeric:
df %>% mutate_if(is.numeric, round, 3)
packageVersion("dplyr")
[1] '0.7.6'
Try
df %>% mutate_at(2:7, funs(round(., 3)))
It works!!

Resources