I'm computing the frequency by group under dplyr. But the output is not automatically saved as a dataframe and only shows the first 10 rows. Does anyone know how to do that? I need to use all rows of data for further analyses. THANKS!
library(dplyr)
data01 %>%
group_by(Country, relsta) %>%
summarize(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
Output
Country relsta Freq married
<int> <chr> <int> <dbl>
1 1 1 15 0.176
2 1 3 1 0.0118
3 1 4 28 0.329
4 1 5 6 0.0706
5 1 6 22 0.259
6 1 7 1 0.0118
7 1 99 12 0.141
8 2 NA 273 1
9 3 NA 129 1
10 4 2 9 0.0796
# ... with 115 more rows
dplyr throws tibbles, the output is just hidden from you. Here an example using iris
library(dplyr)
res1 <- iris %>%
group_by(Sepal.Length, Species) %>%
summarize(Freq=n()) %>%
mutate(foo = Freq/sum(Freq))
res1
# Sepal.Length Species Freq foo
# <dbl> <fct> <int> <dbl>
# 1 4.3 setosa 1 1
# 2 4.4 setosa 3 1
# 3 4.5 setosa 1 1
# 4 4.6 setosa 4 1
# 5 4.7 setosa 2 1
# 6 4.8 setosa 5 1
# 7 4.9 setosa 4 0.667
# 8 4.9 versicolor 1 0.167
# 9 4.9 virginica 1 0.167
# 10 5 setosa 8 0.8
# # … with 47 more rows
Notice the … with 47 more rows. You may also check the dimensions:
dim(res1)
# [1] 57 4
Also,
class(res1)
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
whereas:
class(iris)
# [1] "data.frame"
To see more data, use as.data.frame(). If the data is too large, rows also get omitted. You may customize that with e.g. options(max.print=3000) where default is 1000.
as.data.frame(res1)
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 55 7.6 virginica 1 1.0000000
# 56 7.7 virginica 4 1.0000000
# 57 7.9 virginica 1 1.0000000
You could also consider using base R. Since following line already gives you the "Freq" column,
as.data.frame.table(with(iris, table(Sepal.Length, Species)))
you could do this:
res2 <- with(iris, table(Sepal.Length, Species)) |>
as.data.frame.table() |>
transform(foo=ave(Freq, Sepal.Length, FUN=\(x) x/sum(x))) |>
subset(Freq > 0)
res2
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 103 7.6 virginica 1 1.0000000
# 104 7.7 virginica 4 1.0000000
# 105 7.9 virginica 1 1.0000000
Where:
dim(res2)
# [1] 57 4
class(res2)
# [1] "data.frame"
Note: R >= 4.1 used
The summarize function always returns just one row per group. mutate will keep all the rows here. Try:
library(dplyr)
data02 = data01 %>%
group_by(Country, relsta) %>%
mutate(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
Related
I'm trying to get the total number of entries of each row in a dataframe in order to compress on these fields later.
However the dataframe has over 60 rows and writing the below 60 times is extremely inefficient
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
Is there a way I can write a for loop to loop through all the names in the dataframe and produce the pipe function result for each? I tried
for (i in colnames(df)) {
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
}
But I'm getting an 'i is unknown' error. Any help would be appreciated thanks.
If I understand correctly you want to count the number of occurrences of the unique elements in every single column or did I get that completely wrong? Why are you not just using a combination of some apply function and table?
set.seed(101)
df <- data.frame("x" = 1:20, "y" = LETTERS[sample(1:26, 20, replace = TRUE)], "z" = letters[sample(1:26, 20, replace = TRUE)])
l <- sapply(df, table)
lapply(l, sort, decreasing = T)
You can try this:
#Data
df <- iris
#Create list
List <- list()
#Compute
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count() %>%
arrange(desc(n))
}
#Print
List
$Sepal.Length
# A tibble: 35 x 2
# Groups: df[, colname] [35]
`df[, colname]` n
<dbl> <int>
1 5 10
2 5.1 9
3 6.3 9
4 5.7 8
5 6.7 8
6 5.5 7
7 5.8 7
8 6.4 7
9 4.9 6
10 5.4 6
# ... with 25 more rows
$Sepal.Width
# A tibble: 23 x 2
# Groups: df[, colname] [23]
`df[, colname]` n
<dbl> <int>
1 3 26
2 2.8 14
3 3.2 13
4 3.4 12
5 3.1 11
6 2.9 10
7 2.7 9
8 2.5 8
9 3.3 6
10 3.5 6
# ... with 13 more rows
$Petal.Length
# A tibble: 43 x 2
# Groups: df[, colname] [43]
`df[, colname]` n
<dbl> <int>
1 1.4 13
2 1.5 13
3 4.5 8
4 5.1 8
5 1.3 7
6 1.6 7
7 5.6 6
8 4 5
9 4.7 5
10 4.9 5
# ... with 33 more rows
$Petal.Width
# A tibble: 22 x 2
# Groups: df[, colname] [22]
`df[, colname]` n
<dbl> <int>
1 0.2 29
2 1.3 13
3 1.5 12
4 1.8 12
5 1.4 8
6 2.3 8
7 0.3 7
8 0.4 7
9 1 7
10 2 6
# ... with 12 more rows
$Species
# A tibble: 3 x 2
# Groups: df[, colname] [3]
`df[, colname]` n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
I wonder how to modify below code
xxx<-function(df,groupbys){
groupbys<-enquo(groupbys)
df%>%group_by_(groupbys)%>%summarise(count=n())
}
zzz<-xxx(iris,Species)
to have the option to feed in either one column or more than one column to group by? For example, goup_by_ both Speciesand Petal.Length with iris dataset.
When using enquo (single argument) or enquos (multiple), you should use the !! and !!! operators, respectively.
xxx <- function(df, ...) {
grps <- enquos(...)
df %>%
group_by(!!!grps) %>%
tally() %>%
ungroup()
}
mtcars %>% xxx(cyl, am)
# # A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
# 1 4 0 3
# 2 4 1 8
# 3 6 0 4
# 4 6 1 3
# 5 8 0 12
# 6 8 1 2
or if you want to keep a single argument in the function formals for one or more column names, I think you'll need to use vars() in the call. (Perhaps there's another way suggested in the Programming with dplyr vignette.)
xxx <- function(df, groups) {
df %>%
group_by(!!!groups) %>%
tally() %>%
ungroup()
}
xxx(mtcars, vars(cyl, am))
This is a point whereby you just need to use the .dots argument in the groupby function. Just ensure the groupbys is a character. ie
xxx<-function(df,groupbys){
df%>%group_by(.dots = groupbys)%>%summarise(count=n())
}
xxx(iris,"Species")
# A tibble: 3 x 2
Species count
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
xxx(iris,c("Species","Petal.Length"))
# A tibble: 48 x 3
# Groups: Species [3]
Species Petal.Length count
<fct> <dbl> <int>
1 setosa 1 1
2 setosa 1.1 1
3 setosa 1.2 2
4 setosa 1.3 7
5 setosa 1.4 13
6 setosa 1.5 13
7 setosa 1.6 7
8 setosa 1.7 4
9 setosa 1.9 2
10 versicolor 3 1
Here are two approaches to the problem. If you want to pass column name as unquoted variables, you can use ... and use it in count instead of group_by + summarise.
xxx<-function(df,...){
df %>% count(...)
}
xxx(mtcars, cyl)
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
xxx(mtcars, cyl, am)
# A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
#1 4 0 3
#2 4 1 8
#3 6 0 4
#4 6 1 3
#5 8 0 12
#6 8 1 2
Second approach if you want to pass column name as quoted variable (strings), you can use group_by_at which accepts string inputs.
xxx<-function(df,groupbys){
df %>% group_by_at(groupbys) %>% summarise(n = n())
}
xxx(mtcars, c("cyl", "am"))
I have a map of functions I want to apply to their respective columns.
Is there something liked a mapped mutate_at?
my_map <-
data_frame(col = names(iris)[-5],
calc = rep(c("floor", "ceiling"), 2))
my_map
# A tibble: 4 x 2
col calc
<chr> <chr>
Sepal.Length floor
Sepal.Width ceiling
Petal.Length floor
Petal.Width ceiling
Failed attempt:
tbl_df(iris) %>% mutate_at(vars(col_calcs$col), funs_(col_calcs$calc))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_floor Sepal.Width_floor Petal.Length_floor Petal.Width_floor Sepal.Length_ceiling
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
5.1 3.5 1.4 0.2 setosa 5 3 1 0 6
4.9 3 1.4 0.2 setosa 4 3 1 0 5
Desired output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
5.0 4.0 1.0 1.0 setosa
4.0 3.0 1.0 1.0 setosa
Last thing, my_map$calc may have unknown functions that may be applied.
Ex) Someone can change the last "floor" to "round".
I don't think there's a straight forward way to do this with dplyr::mutate_* function; One work around is to use the reduce (or reduce2) function and mutate column with the corresponding transform function one by one:
library(tidyverse)
reduce2(.x = my_map$col,
.y = my_map$calc,
.f = function(df, col, f) mutate_at(df, vars(col), f),
.init = iris) %>% head(2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
Here is a way to use map2 to replace each column.
library(tidyverse)
iris2 <- iris
iris2[, -5] <- map2(my_map$calc, my_map$col, function(x, y){
x2 <- eval(parse(text = x))
y2 <- iris2[[y]]
result <- x2(y2)
return(result)
})
head(iris2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
# 3 4 4 1 1 setosa
# 4 4 4 1 1 setosa
# 5 5 4 1 1 setosa
# 6 5 4 1 1 setosa
We could start from my_map :
library(tidyverse)
map2(my_map$col,my_map$calc,~transmute_at(iris,.x,.y)) %>%
bind_cols(iris[!names(iris) %in% my_map$col]) %>% # or less general: iris[-5]
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
# 3 4 4 1 1 setosa
# 4 4 4 1 1 setosa
# 5 5 4 1 1 setosa
# 6 5 4 1 1 setosa
If we assume that all variables, which you want to take the floor function, contain the same character, i.e. Length, and all variables, which you want to take the ceiling function, contain the same character, i.e. Width, then we can apply the following code:
library(tidyverse)
iris %>%
mutate_at(vars(ends_with("Length")), funs(floor)) %>%
mutate_at(vars(ends_with("Width")), funs(ceiling))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
Although verbose, I find the following very readable and a simple implementation of map:
iris2 <- iris %>%
mutate(id = 1:n()) %>%
gather(key = col, value, my_map$col ) %>%
full_join(my_map, by = "col") %>%
mutate(value = invoke_map(.f = calc, .x = value)) %>%
unnest() %>%
select(-calc) %>
spread(col, value) %>%
select(-id)
head(iris2)
# Species Petal.Length Petal.Width Sepal.Length Sepal.Width
# 1 setosa 1 1 5 4
# 2 setosa 1 1 4 3
# 3 setosa 1 1 4 4
# 4 setosa 1 1 4 4
# 5 setosa 1 1 5 4
How do you write a dplyr for loop that will provide summaries for each column of a data.table object?
Let's examine a toy example to help illustrate what I'm trying to achieve and what I have tried. We have 5 variables:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I can get a summary of counts by distinct Sepal.Length like so:
iris %>%
group_by(Sepal.Length) %>%
summarise(no_rows = length(Sepal.Length))
# A tibble: 35 x 2
Sepal.Length no_rows
<dbl> <int>
1 4.30 1
2 4.40 3
3 4.50 1
4 4.60 4
5 4.70 2
6 4.80 5
7 4.90 6
8 5.00 10
9 5.10 9
10 5.20 4
# ... with 25 more rows
I would like to write the above in a for loop that loops through each of the 5 variables in the data frame. I started by replacing Sepal.Length above with: paste(names(iris)[1]).
iris %>%
group_by( paste(names(iris)[1]) ) %>%
summarise(no_rows = length( paste(names(iris)[1])) )
But I get:
# A tibble: 1 x 2
`names(design_mat4)[1]` no_rows
<chr> <int>
1 email_status 1
Is there a better way of achieving my aims, perhaps one that avoids a for loop? Are there leads or suggestions that I can follow to write a working for loop? Code or suggestions welcome.
Not sure if it matters, but please note that I am working with a data.table object while the above toy example is a data.frame object. I know that there are nuances between the two that may impact the syntax needed.
Or do it in base R.
lapply(iris, function(x) aggregate(x, by = list(x), length))
This gives you the results in a list
lapply(names(iris),
function(var){
iris %>%
group_by(rlang::sym(var)) %>%
summarise(no_rows = n())
})
Here's a better dplyr answer from #Frank
lapply(names(iris) %>% setNames(.,.), function(var) iris %>% count(!!as.name(var)))
And a data.table answer
lapply(names(iris) %>% setNames(.,.), function(x) as.data.table(iris)[, .(n = .N), by = x])
If all variables are of the same type, a simpler way to approach the problem is to reshape to long form:
library(tidyverse)
iris %>%
select(-Species) %>%
gather(variable, value) %>%
count(variable, value)
#> # A tibble: 123 x 3
#> variable value n
#> <chr> <dbl> <int>
#> 1 Petal.Length 1.00 1
#> 2 Petal.Length 1.10 1
#> 3 Petal.Length 1.20 2
#> 4 Petal.Length 1.30 7
#> 5 Petal.Length 1.40 13
#> 6 Petal.Length 1.50 13
#> 7 Petal.Length 1.60 7
#> 8 Petal.Length 1.70 4
#> 9 Petal.Length 1.90 2
#> 10 Petal.Length 3.00 1
#> # ... with 113 more rows
If you include Species, the value column will be coerced to character, though.
I have the follow dataset
# Dataset
x<-tbl_df(data.frame(locus=c(1,2,2,3,4,4,5,5,5,6),v=c(1,1,2,1,1,2,1,2,3,1),rpkm=rnorm(10,10)))
If I use the follow command
# Subset
x%>%group_by(locus)%>%summarize(max(rpkm))
I obtained
locus max(rpkm)
1 9.316949
2 10.273270
3 9.879886
4 10.944641
5 10.837681
6 13.450680
While I'd like to obtain
locus v max(rpkm)
1 1 9.316949
2 1 10.273270
3 1 9.879886
4 2 10.944641
5 1 10.837681
6 1 13.450680
So, I'd like to have in the output table the "v" correspondent row.
Is it possible?
Try:
x %>% group_by(locus) %>%
summarize(max(rpkm), v = v[which(rpkm==max(rpkm))])
You can use the top_n function instead
# with set.seed(15)
x %>% group_by(locus) %>% top_n(1, rpkm)
# locus v rpkm
# 1 1 1 10.258823
# 2 2 1 11.831121
# 3 3 1 10.897198
# 4 4 1 10.488016
# 5 5 2 11.090773
# 6 6 1 8.924999
Try this:
x %>% group_by(locus) %>% filter(rpkm==max(rpkm))
I assume you're looking for a way to not type all of the column names by hand, and you achieve that by using across within summarize, like so:
iris %>%
group_by(Species) %>%
dplyr::summarize(
across(everything()),
mean_l = mean(Sepal.Length)
) %>%
head()
# A tibble: 6 × 6
# Groups: Species [1]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width mean_l
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2 5.01
2 setosa 4.9 3 1.4 0.2 5.01
3 setosa 4.7 3.2 1.3 0.2 5.01
4 setosa 4.6 3.1 1.5 0.2 5.01
5 setosa 5 3.6 1.4 0.2 5.01
6 setosa 5.4 3.9 1.7 0.4 5.01