I am trying to order a dataset according to values in columns in ascending order.
I have a dataset with 1 row and 3000+ columns. I guess I can just change it to a list and use .[[n]] but I was thinking if there was another way.
data looks something like this only with more columns and values.
structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
I expect something like this:
b c a
1 -4.1135727372109401e-05 -0.00018142429393043499 -0.00106163456888295
I understand you can arrange by column number by doing the following:
.[[column number]]
for example:
mtcars %>% arrange(.[[2]])
what is the row number equivalent?
If I understand you correctly, you want to order the columns based on the values in the single row.
z <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
Base R:
z[,order(z[1,])]
# a c b
# 1 -0.00106163457 -0.000181424294 -0.0000411357274
Tidyverse:
library(dplyr)
z %>%
select_at(order(.))
Note: I think your expected output might not be correct, as the values are not ordered. Your intended output:
c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05)
# [1] -0.0001814242939 -0.0010616345689 -0.0000411357274
diff(c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05))
# [1] -0.000880210275 0.001020498842
shows the first value is greater than the second, but the second is less than the third. If they were ordered, I would expect the diff to be always-nonnegative; if reverse-ordered, diff should be always-nonpositive.
We can unlist the first row, order and use that in select
library(dplyr)
df1 %>%
select(order(-unlist(.[1,])))
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
It can be also used a general solution i.e if we want to do this based on a particular row
n <- 3
mtcars %>%
select(order(-unlist(.[n,])))
Or reshape to 'long' and then use arrange, get the column names and then select
library(tidyr)
df1 %>%
pivot_longer(everything()) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
Or enframe, then do a arrange, pull the 'name' column and use that in select
library(tibble)
as.list(df1) %>%
enframe %>%
unnest(c(value)) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
Or if we want to select the column 'c'
df1 %>%
select(c, everything())
# c a b
#1 -0.0001814243 -0.001061635 -4.113573e-05
In base R, we can do
df1[order(-unlist(df1[1,]))]
data
df1 <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
Related
I have two data frames.
The first one that contains all the possible combinations with their corresponding values and looks like this:
first
second
val
A
B
10
A
C
20
A
D
30
B
C
40
B
D
50
C
D
60
H
I
70
The second one that comes from the production line has two columns the date column that has grouped all the variables corresponding to their date and are concatenated:
date
var
2022-01-01
A
2022-02-01
B,C,F,E,G,H,I
I want to find all the combinations in the second data frame and to see if they match with any combinations in the first data frame. If a variable stands alone in the second data frame as A in 2022-01-01 to give me the 0 and otherwise the value of the combination.
Ideally I want the resulting data frame to look like this:
date
comb
val
2022-01-01
A
0
2022-02-01
B,C
40
2022-02-01
H,I
70
How can I do this in R using dplyr?
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2
Using tidyverse:
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2
df2_tidy <- df2 %>%
mutate(first = str_split(var, ","),
second = first) %>%
unnest(first) %>%
unnest(second) %>%
select(-var)
singles <- df2 %>%
filter(!str_detect(var, ",")) %>%
mutate(val = 0) %>%
select(date, comb = var, val)
combs <- df1 %>%
inner_join(df2_tidy, by = c("first", "second")) %>%
mutate(comb = paste(first, second, sep = ",")) %>%
select(date, comb, val)
bind_rows(singles, combs)
Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]
Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914
so in a dataset, I have a column named "Interventions", and each row looks like this:
row1: "Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600"
row2: "Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
I want to only extract the Intervention type such as "Drug", "Biological", "Procedure" to remain in the column. And even better, if can only have the unique Intervention type instead of "Drug" 4 times like the first row.
The expected output would look like this:
row1: "Drug"
row2: "Biological, Drug, Procedure"
I am just getting started with r, I have tidyverse installed and kinda used to playing with the %>%. If anyone can help me with this, much appreciated !
If we want to extract only the prefix part before the :
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure
Or another option is to separate the rows based on the delimiters, slice the alternate rows and paste together the sorted unique values in 'Interventions'
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure
data
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
Not as concise and the same logic as Akruns but in Base R:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))
I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)