arrange data based on user defined variables order? [duplicate] - r

This question already has answers here:
Order data frame rows according to vector with specific order
(6 answers)
Closed 1 year ago.
I have the following data.frame and would like to change the order of the rows in such a way that rows with variable == "C" come at the top followed by rows with "A" and then those with "B".
library(tidyverse)
set.seed(123)
D1 <- data.frame(Serial = 1:10, A= runif(10,1,5),
B = runif(10,3,6),
C = runif(10,2,5)) %>%
pivot_longer(-Serial, names_to = "variables", values_to = "Value" ) %>%
arrange(-desc(variables))

D1 %>%
mutate(variables = ordered(variables, c('C', 'A', 'B'))) %>%
arrange(variables)
Perhaps I did not get the question. If you want C then A then B, you could do:
D1 %>%
arrange(Serial, variables)

#Onyambu's answer is probably the most "tidyverse-ish" way to do it, but another is:
D1[order(match(D1$variables,c("C","A","B"))),]
or
D1 %>% slice(order(match(variables,c("C","A","B"))))
or
D1 %>% slice(variables %>% match(c("C","A","B")) %>% order())

Related

In R, dplyr mutate referencing column names by string [duplicate]

This question already has answers here:
How to pass dynamic column names in dplyr into custom function?
(2 answers)
Closed 1 year ago.
mydf = data.frame(a = c(1,2,3,4), b = c(5,6,7,8), c = c(3,4,5,6))
var1 = 'a'
var2 = 'b'
mydf = mydf %>% mutate(newCol = var1 + var2)
in our code, var1 and var2 can refer to different columns in mydf, and we need to create newCol by taking the sum of values in the columns whose names are saved in var1 and var2. I understand this can be done outside of dplyr, however I am wondering if there is a solution that uses dplyr and %>% like above.
We can convert to symbol and evaluate with !!
library(dplyr)
mydf %>%
mutate(newCol = !! rlang::sym(var1) + !! rlang::sym(var2))
Or another option is subset the column with .data
mydf %>%
mutate(newCol = .data[[var1]] + .data[[var2]])
or may use rowSums
mydf %>%
mutate(newCol = rowSums(select(cur_data(), all_of(c(var1, var2)))))

How can I remove the duplicate rows in R [duplicate]

This question already has answers here:
Deleting reversed duplicates with R
(3 answers)
Removing duplicate combinations (irrespective of order)
(1 answer)
data.table with two string columns of set elements, extract unique rows with each row unsorted
(4 answers)
Closed 2 years ago.
In my df, I define c('apple', 'banana') and c('banana', 'apple') are the same, casue the fruit type is the same just the arrangement is different.
Then, How can I remove row No.1 and row No.2 and only keep the last row(wanted_df).
df = data.frame(fruit1 = c('apple', 'banana', 'fig'),
fruit2 = c('banana', 'apple', 'cherry'))
df
wanted_df = df[3,]
Any help will be high appreciated!
============================
Something wrong with my real data.
The frames2 loses rows which lag = 2.
I wanted data frame shold like wanted_frames.
pollution1 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
pollution2 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
dis = 'n'
lag = 1:2
frames = expand.grid(pollution1 = pollution1,
pollution2 = pollution2,
dis = dis,
lag = lag) %>%
mutate(pollution1 = as.character(pollution1),
pollution2 = as.character(pollution2),
dis = as.character(dis)) %>%
as_tibble() %>%
filter(pollution1 != pollution2)
vec<- with(frames, paste(pmin(pollution1, pollution2), pmax(pollution1, pollution2)))
frames2 = frames[!duplicated(vec), ]
wanted_frames = frames2 %>% mutate(lag = 2) %>% bind_rows(frames2)
Try this.
library(dplyr)
d <- filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))
Which gives
> d
fruit1 fruit2
1 fig cherry
Update
As commented by #JonSpring and #Phil, the updated code should be
df %>% rowwise() %>% filter(!(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))%>% ungroup()
A base R way :
vec<- with(df, paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2)))
df[!(duplicated(vec) | duplicated(vec, fromLast = TRUE)), ]
# fruit1 fruit2
#3 fig cherry
Here's a low-tech dplyr approach. Make a sorted key, then keep rows with unique keys.
library(dplyr)
df %>%
mutate(key = paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2))) %>%
add_count(key) %>%
filter(n == 1)

Applying functions in dplyr pipes

Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914

Extract last value from vector in data frame [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 2 years ago.
I have one data set.This data set contain two columns first is column with ID and second is with VALUE.You can see code and table below :
DATA_TEST <- data.frame(
ID = c("03740270423222","03740270423222","03740270423222","03740270423222","01380926325248","01380926325248","01380926325248"),
VALUE = c("100","200","300","200","300","200","300"))
But here in table there are a lot of duplicate, so my intention is only to extract last value separably by each ID so final result should be like in table below:
So can anybody help me how to resolve this problem ?
A base R solution with aggregate() and tail()
aggregate(VALUE~ ID, DATA_TEST, tail, 1)
# ID VALUE
# 1 01380926325248 300
# 2 03740270423222 200
or by dplyr package.
library(dplyr)
option 1: summarise() + last()
DATA_TEST %>%
group_by(ID) %>%
summarise(VALUE = last(VALUE))
option 2: slice_tail() <=> slice(n())
DATA_TEST %>%
group_by(ID) %>%
slice_tail()
In data.table:
DATA_TEST<-data.frame(
ID=c("03740270423222","03740270423222","03740270423222","03740270423222","01380926325248","01380926325248","01380926325248"),
VALUE=c("100","200","300","200","300","200","300")
)
library(data.table)
DT <- as.data.table(DATA_TEST)
DT[, .(VALUE = last(VALUE)), by = ID]
ID VALUE
1: 03740270423222 200
2: 01380926325248 300

R convert df from wide to long by splitting column names [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert the below df_original data.frame to the form in df_goal. The columns from the original data.frame shall be split, with the latter part acting as a key, while the first part shall be kept as a variable name. Preferably I would like to use a tidyverse-solution, but am open to every aproach. Thank you very much!
df_original <-
data.frame(id = c(1,2,3),
variable1_partyx = c(4,5,6),
variable1_partyy = c(14,15,16),
variable2_partyx = c(24,25,26),
variable2_partyy = c(34,35,36))
df_goal <-
data.frame(id = c(1,1,2,2,3,3),
key = c("partyx","partyy","partyx","partyy","partyx","partyy"),
variable1 = c(4,14,5,15,6,16),
variable2 = c(24,34,25,35,26,36))
df_original %>%
tidyr::gather(key, value, -id) %>%
tidyr::separate(key, into = c("var", "key"), sep = "_") %>%
tidyr::spread(var, value)

Resources