This question already has an answer here:
R delete last row in dataframe for each group
(1 answer)
Closed 2 years ago.
I need to delete every last row in a group after applying group_by.
I have tried something like that, but it does not work.
data=data %>%
group_by(isin) %>%
summarise(data=data[-length(isin),])
Thanks for your help!
We use the built in iris data set as an example. It has three groups of 50 rows each defined by the Species column. Next time please provide sample data in the question. See the top of the r tag page for info.
1) group_modify We can use group_modify from dplyr.
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ head(., -1)) %>%
ungroup
2) slice Another dplyr solution is to use slice
library(dplyr)
iris %>%
group_by(Species) %>%
slice(-n()) %>%
ungroup
3) by A base solution is to use by. It produces a list of data frames which we rbind back together.
do.call("rbind", by(iris, iris$Species, head, -1))
4) subset/ave Another base solution is to create a vector of numbers which count down to 1 for each group and then only keep those rows corresponding to a number greater than 1.
subset(iris, ave(1:nrow(iris), Species, FUN = function(x) length(x):1) > 1)
4a) or keep all rows except the one having the maximum row number in each group:
n <- nrow(iris)
subset(iris, ave(1:n, Species, FUN = max) != 1:n)
5) duplicated Yet another base solution uses duplicated. It only keeps rows whose Species column is duplicated counting back from the end.
subset(iris, duplicated(Species, fromLast = TRUE))
Try using the the base function by
new_data=do.call(rbind,by(data,data[,'isin'],function(x) x[-length(x),]))
By will return the groups in as list and do.call(rbind,...) will convert the list to a data.frame
Related
I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))
How can I extract (or isolate_ group-wise constant columns from a data frame, using dplyr/tidyverse?
This is an update of Dowle/Hadley's decades-old question here. The earlier poster's example...
Using a contrived example from iris (to generate a dataset with columns that are constant by group for this example )
irisX <- iris %>% mutate(
numspec = as.numeric(Species),
numspec2 = numspec*2
)
Now I want to generate a dataset that keeps the columns Species, numspec, and numspec2 only (and keeps only one row for each).
And I don't want to have to tell it which columns these are (constant by group) -- I want it to find these for me.
So what I want is
Species, numspec, numspec2
setosa, 1, 2
versicolor, 2, 4
virginica, 3, 6
Unlike in the older linked question I want to do something using the tidyverse so I can understand it better and the code looks cleaner.
I tried something like
single_iris <- irisX %>%
group_by(Species) %>%
select_if(function(.) n_distinct(.) == 1)
But the latter select_if ignores the groupings.
If we want to use select, do it outside the grouping
library(dplyr)
irisX %>%
select(where(~ n_distinct(.) == n_distinct(irisX$Species))) %>%
distinct()
You could do:
iris %>%
group_by(Species)%>%
summarise(numspec = as.numeric(first(Species)),
numspec2 = numspec*2)
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Here is my data:
For each x1 level, I am trying to duplicate a number of rows equal to number.class and I would like for each row the length class to goes from the Lmin..cm. to Lmax..cm. increasing by 1 for each row.I came up with this code:
test<-A.M %>% filter(x1=="Crenimugil crenilabis")
for (i in 1:test$number.class){test<-test %>% add_row()}
for (i in 1:nrow(test)){test[i,]=test[1,]}
for (i in 1:nrow(test)){test$length.class[i]<-print(i+test$Lmin..cm.)}
test$length.class<-test$length.class-1
which basically works and gives me the expected results: 2
However, this script does not allow me to run this for every species.
Thank you.
Here, we could use uncount from tidyr to replicate the rows, do a group by 'x1' and mutate the 'Lmin..cm' by adding the row_number()
library(dplyr)
library(tidyr)
A.M %>%
uncount(number.class) %>%
group_by(x1) %>%
mutate(`Lmin..cm.` = `Lmin..cm.` + row_number())
If we need to create a sequence from Lmin..cm to Lmax..cm, then instead of uncount, we could use map2 to create the sequence and then unnest
library(purrr)
A.M %>%
mutate(new = map2(`Lmin..cm.`, `Lmax..cm`, ~ seq(.x, .y, by = 1)) %>%
unnest(c(new))
I am trying to create a new data frame which is grouped by one column (i.e. Petal.Width below) and has new columns created from the groups of another variable (i.e. Species) and with the number of observations from each of the Species groups. I assume dplyr is able to do this but I cant quite get what I need
I have tried this code but it returns the length of all observations in Species rather than the length of each group (i.e. all columns have the same data)
iris=as.data.frame(iris)
groups= iris %>%
group_by(Petal.Width) %>%
summarize(Seposa=length(Species == "seposa"),
Versicolor=length(Species == "versicolor"),
Virginica=length(Species == "virginica"))
I assume I am just making a small error somewhere. Any help please!
As #Z.Lin notes you need sum() instead of length in your example, but using this method it's critical that you don't mis-spell.
Here's another way to do it:
library(dplyr)
iris=as.data.frame(iris)
iris %>%
group_by(Petal.Width, Species) %>%
count() %>%
spread(Species, n, fill = 0)
This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 4 years ago.
I am trying to get better in using pipes %>% in dplyr package. I understand that the whole point of using pipes (%>%) is that it replaces the first argument in a function by the one connected by pipe. That is, in this example:
area = rep(c(3:7), 5) + rnorm(5)
Pipes
area %>%
mean
equal normal function
`mean(area)`.
My problem is when it gets to a dataframe. I would like to split dataframe in a list of dataframes, and than calculate means per area columns. But, I can't figure out how to call the column instead of the dataframe?
I know that I can get means by year simply by aggregate(area~ year, df, mean) but I would like to practice pipes instead.
Thank you!
# Dummy data
set.seed(13)
df<-data.frame(year = rep(c(1:5), each = 5),
area = rep(c(3:7), each = 5) + rnorm(1))
# Calculate means.
# Neither `mean(df$area)`, `mean("area")` or `mean[area]` does not work. How to call the column correctly?
df %>%
split(df$year) %>%
mean
This?
df %>%
group_by(year) %>%
summarise(Mean=mean(area))
We need to extract the column from the list of data.frames in split. One option is to loop through the list with map, and summarise the 'area'.
df %>%
split(.$year) %>%
map_df(~ .x %>%
summarise(area = mean(area)))