I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth
Related
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]
I have a data.frame df with columns A and B:
df <- data.frame(A = 1:5, B = 11:15)
There's another data.frame, df2, which I'm building by various calculations that ends up having generic column names X1 and X2, which I cannot control directly (because it passes through being a matrix at one point). So it ends up being something like:
mtrx <- matrix(1:10, ncol = 2)
mtrx %>% data.frame()
I would like to rename the columns in df2 to be the same as df. I could, of course, do it after I finish building df2 with a simple assigning:
names(df2)<-names(df)
My question is - is there a way to do this directly within the pipe? I can't seem to use dplyr::rename, because these have to be in the form of newname=oldname, and I can't seem to vectorize it. Same goes to the data.frame call itself - I can't just give it a vector of column names, as far as I can tell. Is there another option I'm missing? What I'm hoping for is something like
mtrx %>% data.frame() %>% rename(names(df))
but this doesn't work - gives error Error: All arguments must be named.
Cheers!
You can use setNames
mtrx %>%
data.frame() %>%
setNames(., nm = names(df))
# A B
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Or use purrr's equivalent set_names
mtrx %>%
data.frame() %>%
purrr::set_names(., nm = names(df))
A third option is "names<-"
mtrx %>%
data.frame() %>%
"names<-"(names(df))
We can use rename_all from tidyverse
library(tidyverse)
mtrx %>%
as.data.frame %>%
rename_all(~ names(df))
# A B
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
Well, I know that there are already tons of related questions, but none gave an answer to my particular need.
I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.
"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.
As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?
Fortunately, there is a much simpler way available now.
With the new dplyr 1.0.0 coming out soon, you can leverage the across function for this purpose.
All you need to type is:
iris %>%
group_by(Species) %>%
summarize(
# I want the sum over the first two columns,
across(c(1,2), sum),
# the mean over the third
across(3, mean),
# the first value for all remaining columns (after a group_by(Species))
across(-c(1:3), first)
)
Great, isn't it?
I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across function can be very beneficial.
You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")
As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.
For the example you mentioned, you could try the following:
summarizer <- function(tb, colsone, colstwo, colsthree,
funsone, funstwo, funsthree, group_name) {
return(bind_cols(
summarize_all(select(tb, colsone), .funs = funsone),
summarize_all(select(tb, colstwo), .funs = funstwo) %>%
ungroup() %>% select(-matches(group_name)),
summarize_all(select(tb, colsthree), .funs = funsthree) %>%
ungroup() %>% select(-matches(group_name))
))
}
#With colnames
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = contains("Sepal"),
colstwo = matches("Petal.Length"),
colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
#With indexes
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
You could summarise the data with each function separately and then join the data later if needed.
So something like this for the iris example:
sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)
Though I would try to think of something else if there are more than a handful of summarising functions you need to use.
Try this:
library(plyr)
library(dplyr)
dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe
# var var2 var3 var4
#1 1 10 2 5
#2 1 9 3 5
#3 1 8 4 3
#4 2 7 5 2
#5 2 6 6 4
#6 2 5 7 2
funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
function(x,funcs,inds){
mapply(function(func,ind){
func(x[,ind])
},funcs,inds)
},funnames,colnums)
# var V1 V2 V3
#1 1 27 3 5
#2 2 18 6 2
See this - feature coming soon
I am working with the tidygraph package and try to find a "tidy" solution
for the example below. The problem is not really tied to tidygraph and more about data wrangling but I think it is interesting for people working with this package.
In the following code chunk I just generate some sample data.
library(tidyverse)
library(tidygraph)
library(igraph)
library(randomNames)
library(reshape2)
graph <- play_smallworld(1, 100, 3, 0.05)
labeled_graph <- graph %>%
activate(nodes) %>%
mutate(group = sample(letters[1:3], size = 100, replace = TRUE),
name = randomNames(100)
)
sub_graphs_df <- labeled_graph %>%
morph(to_split, group) %>%
crystallise()
The resulting data.frame looks as follows:
sub_graphs_df
# A tibble: 3 x 2
name graph
<chr> <list>
1 group: a <S3: tbl_graph>
2 group: b <S3: tbl_graph>
3 group: c <S3: tbl_graph>
Now to my actual problem. I want do apply a function to each element in the column graph. The result is simply a named vector.
sub_graphs_df$graph %>% map(degree)
The first thing I do not like is the subsetting by $. Is there a better way?
Next, I want to reshape this result into only one data.frame with 3 columns. One column for name (the name attribute of the vectors), one for group (the name attribute of the list) and one for the number (the elements of the vectors).
I tried melt from the reshape2 package.
sub_graphs_df$graph %>% map(degree) %>% melt
It works decently but the names are lost and as I read it, one should use
tidyr instead. However, I could not get gather to work because only data.frames are accepted.
Another option would be unlist:
sub_graphs_df$graph %>% map(degree) %>% unlist
Here the group and the name are in the names attribute and I would have to recover them with regular expressions.
I am pretty sure there is an easy way I just could not think of.
We can create a list column with mutate while applying the function with map, extract the names and integer and unnest to create the 'long' format dataset
sub_graphs_df %>%
mutate(newout = map(graph, degree)) %>%
transmute(name, group = map(newout, ~.x %>% names), number = map(newout, as.integer)) %>%
unnest
# A tibble: 100 x 3
# name group number
# <chr> <chr> <int>
# 1 group: a Seng, Trevor 0
# 2 group: a Buccieri, Joshua 1
# 3 group: a Street, Aimee 2
# 4 group: a Gonzalez, Corey 2
# 5 group: a Barber, Monique 1
# 6 group: a Doan, Christina 1
# 7 group: a Ninomiya, Janna 1
# 8 group: a Bazemore, Chao 1
# 9 group: a Perfecto, Jennifer 1
#10 group: a Lopez Jr, Vinette 0
# ... with 90 more rows