Create new columns with length of groups in dplyr - r

I am trying to create a new data frame which is grouped by one column (i.e. Petal.Width below) and has new columns created from the groups of another variable (i.e. Species) and with the number of observations from each of the Species groups. I assume dplyr is able to do this but I cant quite get what I need
I have tried this code but it returns the length of all observations in Species rather than the length of each group (i.e. all columns have the same data)
iris=as.data.frame(iris)
groups= iris %>%
group_by(Petal.Width) %>%
summarize(Seposa=length(Species == "seposa"),
Versicolor=length(Species == "versicolor"),
Virginica=length(Species == "virginica"))
I assume I am just making a small error somewhere. Any help please!

As #Z.Lin notes you need sum() instead of length in your example, but using this method it's critical that you don't mis-spell.
Here's another way to do it:
library(dplyr)
iris=as.data.frame(iris)
iris %>%
group_by(Petal.Width, Species) %>%
count() %>%
spread(Species, n, fill = 0)

Related

Merging many columns in R

I have an issue with merging many columns by the same ID. I know that this is possible for two lists but I need to combine all species columns into one so I have first column as species (combined) and then w,w.1,w.2,w.3, w.4... The species columns all have the same species in them but are not in order so I can't just drop every other column as this would mean the w values aren't associated with the right species. This is an extremely large dataset of 10000 rows and 2000 columns so would need to automated. I need the w values to be associated to the corresponding species. Dataset attached.
Thank you for any help
dataset
If your data is in a frame called dt, you can use lapply() along with bind_rows() like this:
library(dplyr)
library(tidyr)
bind_rows(
lapply(seq(1,ncol(dt),2), function(x) {
dt[,c(x,x+1)] %>%
rename_with(~c("Species", "value")) %>%
mutate(w = colnames(dt)[x+1])
})
) %>%
pivot_wider(id_cols = Species, names_from = w)

Extract (or isolate) 'group-wise constant' columns from a data frame, *using dplyr/tidyverse*

How can I extract (or isolate_ group-wise constant columns from a data frame, using dplyr/tidyverse?
This is an update of Dowle/Hadley's decades-old question here. The earlier poster's example...
Using a contrived example from iris (to generate a dataset with columns that are constant by group for this example )
irisX <- iris %>% mutate(
numspec = as.numeric(Species),
numspec2 = numspec*2
)
Now I want to generate a dataset that keeps the columns Species, numspec, and numspec2 only (and keeps only one row for each).
And I don't want to have to tell it which columns these are (constant by group) -- I want it to find these for me.
So what I want is
Species, numspec, numspec2
setosa, 1, 2
versicolor, 2, 4
virginica, 3, 6
Unlike in the older linked question I want to do something using the tidyverse so I can understand it better and the code looks cleaner.
I tried something like
single_iris <- irisX %>%
group_by(Species) %>%
select_if(function(.) n_distinct(.) == 1)
But the latter select_if ignores the groupings.
If we want to use select, do it outside the grouping
library(dplyr)
irisX %>%
select(where(~ n_distinct(.) == n_distinct(irisX$Species))) %>%
distinct()
You could do:
iris %>%
group_by(Species)%>%
summarise(numspec = as.numeric(first(Species)),
numspec2 = numspec*2)

Deleting every last row in every group in R [duplicate]

This question already has an answer here:
R delete last row in dataframe for each group
(1 answer)
Closed 2 years ago.
I need to delete every last row in a group after applying group_by.
I have tried something like that, but it does not work.
data=data %>%
group_by(isin) %>%
summarise(data=data[-length(isin),])
Thanks for your help!
We use the built in iris data set as an example. It has three groups of 50 rows each defined by the Species column. Next time please provide sample data in the question. See the top of the r tag page for info.
1) group_modify We can use group_modify from dplyr.
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ head(., -1)) %>%
ungroup
2) slice Another dplyr solution is to use slice
library(dplyr)
iris %>%
group_by(Species) %>%
slice(-n()) %>%
ungroup
3) by A base solution is to use by. It produces a list of data frames which we rbind back together.
do.call("rbind", by(iris, iris$Species, head, -1))
4) subset/ave Another base solution is to create a vector of numbers which count down to 1 for each group and then only keep those rows corresponding to a number greater than 1.
subset(iris, ave(1:nrow(iris), Species, FUN = function(x) length(x):1) > 1)
4a) or keep all rows except the one having the maximum row number in each group:
n <- nrow(iris)
subset(iris, ave(1:n, Species, FUN = max) != 1:n)
5) duplicated Yet another base solution uses duplicated. It only keeps rows whose Species column is duplicated counting back from the end.
subset(iris, duplicated(Species, fromLast = TRUE))
Try using the the base function by
new_data=do.call(rbind,by(data,data[,'isin'],function(x) x[-length(x),]))
By will return the groups in as list and do.call(rbind,...) will convert the list to a data.frame

R check for outliers in multiple variables

I need to check my data fro outliers and I have 67 different variables. So I don't want to do it by hand. This is my code for checking it by hand (I have three factors to be checked - voiceID, gender and VP). But I don't know how I should change it to a loop that iterates over columns.
features %>%
group_by(voiceID, gender, VP) %>%
identify_outliers(meanF0)
The values are all numbers. The output should tell me which rows for what factors are outliers.
Thanks for help
The output of identify_outliers is a tibble with multiple columns and it can take a single variable at a time. The variable name can be either quoted or unquoted. In that case, we can group_split the data by the grouping variables, then loop over the columns of interest, and apply the identify_outliers
library(dplyr)
library(purrr)
library(rstatix)
nm1 <- c("score", "score2")
demo.data %>%
group_split(gender) %>%
map(~ map(nm1, function(x) .x %>%
identify_outliers(x)))
If we want to count the outliers,
features %>%
group_by(voiceID, gender, VP) %>%
summarise(across(everything(), ~ length(boxplot(., plot = FALSE)$out)))

How might I summarize the sum of all columns in a filtered dataset using dplyr?

I'm having trouble getting the sum of a column from a filtered dataset. Would someone be able to show me where I am going wrong? This summarize method worked before, but now I get an error. Thank you,
select("STNAME", "CTYNAME", "YEAR", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
save(popSample, file="./datafiles/popSample.rdata" )
load("./datafiles/popSample.rdata")
# We want to see Total Population for all years and all age groups
set1filter <- popSample %>%
filter(AGEGRP == 0) %>%
summarize(set1filter, set1 = sum(TOT_POP))
set1```
There is an extra %>% at the end of filter while creating the set1filter or remove the set1filter from the summarize if we are using the same chain
library(dplyr)
popSample %>%
filter(AGEGRP == 0) %>%
summarise(set1 = sum(TOT_POP))
We can't have an object that is not yet created in the summarize

Resources