I have some numeric variables which are categorised into a few bands (like 1-3, 3-5, 5-7 etc). I want to main their band order. For example, in the data frame below.
df <- data.frame(x = c("1-3", "3-5","5-9", "9-10", "10-12"))
When I run any data manipulation operation (like group_by, count) in this column, it returns this output.
Current Output
library(tidyverse)
df %>% count(x)
x n
<fct> <int>
1 1-3 1
2 3-5 1
3 5-9 1
4 9-10 1
5 10-12 1
Desired Output
x n
<fct> <int>
1 1-3 1
2 3-5 1
3 5-9 1
4 9-10 1
5 10-12 1
Important Note - Solution should be dynamic which means it should run on any type of numeric bands even if it starts from 1000 or any other numeric value (For example 1250 - 2500, 2500 - 5000, 5000 - 10000, 10000 - 20000 etc). Solution in dplyr is preferred one.
If x is always sorted and in the same order as shown in the example you could arrange the factor levels based on their appearance before using count.
library(dplyr)
library(rlang)
df %>%
mutate(x = factor(x, levels = unique(x))) %>%
count(x)
However, a general solution would be to get the number before "-" and arrange data based on that.
df %>%
mutate(x1 = as.numeric(sub('-.*', '', x)),
x = factor(x, levels = x[order(x1)])) %>%
count(x)
To wrap this in a function we can use :
count_band_data <- function(data, col, sep = '-') {
data %>%
mutate(temp = as.numeric(sub(paste0(sep, '.*'), '', {{col}})),
{{col}} := factor({{col}}, levels = {{col}}[order(temp)])) %>%
count({{col}})
}
and then use it as :
df %>% count_band_data(x)
# A tibble: 5 x 2
# x n
# <fct> <int>
#1 1-3 1
#2 3-5 1
#3 5-9 1
#4 9-10 1
#5 10-12 1
Related
I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!
We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2
I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".
df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1
I have a dataframe like so
ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)
cluster <- c(1,2,3,1)
test <- data.frame(ID, Type1, Type2, cluster)
I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.
I achieved it through
test.sum <- test %>%
group_by(cluster)%>%
summarise(sum(Type1), sum(Type2))
However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?
This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:
test %>%
group_by(cluster) %>%
summarise(across(contains("Type"), sum))
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-c(ID, cluster)) %>%
group_by(cluster, name) %>%
summarise(sum_value = sum(value)) %>%
pivot_wider(names_from = "name", values_from = "sum_value")
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Base R
You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.
my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
t(my_sums) )
Output
> my_sums
cluster Type1 Type2
1 1 2 0
2 2 1 1
3 3 0 1
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
I have a task that seems easy, but after working on it for a few hours I've decided that I'm stumped.
I have a dataframe:
mydata <- read.table(header=TRUE, text="
rime point sound
Y Y Y
N N Y
Y Y Y
NA NA NA
")
I would like my dataframe to look like this:
mydata <- read.table(header=TRUE, text="
standard Y N NA
rime 2 1 1
point 2 1 1
sound 3 0 1
")
My first thought was to use dplyr::count(). I can get the correct numbers, but I have over 100 columns and don't want to call all of them by hand. Is there an R function that I could use to get the count I'm looking for?
xtabs
Using only base R we convert to long form using stack and then perform the frequency counting using xtabs. To get it into the orientation shown in the question the resulting table is transposed.
t(xtabs(~., stack(mydata), addNA = TRUE))
## values
## ind N Y <NA>
## rime 1 2 1
## point 1 2 1
## sound 0 3 1
table
This variation also works and gives a similar result. (The data portions of both are the same but the class of the xtabs solution is c("xtabs", "table") and it has a call attribute whereas the one below has the "table" class.)
t(table(stack(mydata), useNA = "ifany"))
tapply
We can use tapply giving a matrix output. We first change the NA's to ordinary levels since tapply would remove those NAs.
s <- transform(stack(mydata), values = addNA(values))
t(tapply(rownames(s), s, length, default = 0))
pivot_*
Using tidyr we can convert to long form and back to wide form giving a tibble result:
library(tidyr)
mydata %>%
pivot_longer(everything()) %>%
pivot_wider(name, names_from = "value", values_fn = length, values_fill = 0)
## # A tibble: 3 x 4
## name Y N `NA`
## <chr> <int> <int> <int>
## 1 rime 2 1 1
## 2 point 2 1 1
## 3 sound 3 0 1
ctable
ctable in the summarytools package has many arguments to customnize the output. Here is the default output.
library(summarytools)
with(stack(mydata), ctable(ind, values))
giving:
Cross-Tabulation, Row Proportions
ind * values
Data Frame: stack
------- -------- ----------- ----------- ----------- -------------
values N Y <NA> Total
ind
rime 1 (25.0%) 2 (50.0%) 1 (25.0%) 4 (100.0%)
point 1 (25.0%) 2 (50.0%) 1 (25.0%) 4 (100.0%)
sound 0 ( 0.0%) 3 (75.0%) 1 (25.0%) 4 (100.0%)
Total 2 (16.7%) 7 (58.3%) 3 (25.0%) 12 (100.0%)
------- -------- ----------- ----------- ----------- -------------
Update
Have added additional approaches.
Try this. The key is to format you data to long, compute the desired summary (in your case you want the number of observations per groups). After that you can reshape again to wide in order to format the data as you want. Here the code using a tidyverse approach with the data you shared mydata:
library(tidyverse)
#Code
mydata %>%
pivot_longer(everything()) %>%
group_by(name,value) %>%
summarise(N=n()) %>% ungroup() %>%
group_by(name) %>% mutate(id=cur_group_id()) %>%
pivot_wider(names_from = value,values_from=N) %>%
select(-id) %>%
replace(is.na(.),0)
Output:
# A tibble: 3 x 4
# Groups: name [3]
name N Y `NA`
<chr> <int> <int> <int>
1 point 1 2 1
2 rime 1 2 1
3 sound 0 3 1
I'm trying to create decile factors corresponding to my dataframe's values. I would like the factors to appear as a range e.g. if the value is "164" then the factored result should be "160 - 166".
In the past I would do this:
quantile(countries.Imported$Imported, seq(0,1, 0.1), na.rm = T) # display deciles
Imported.levels <- c(0, 1000, 10000, 20000, 30000, 50000, 80000) # create levels from observed deciles
Imported.labels <- c('< 1,000t', '1,000t - 10,000t', '10,000t - 20,000t', etc) # create corresponding labels
colfunc <- colorRampPalette(c('#E5E4E2', '#8290af','#512888'))
# apply factor function
Imported.colors <- colfunc(10)
names(Imported.colors) <- Imported.labels
countries.Imported$Imported.fc <- factor(
cut(countries.Imported$Imported, Imported.levels),labels = Imported.labels)
Instead, I would like to apply a function that will factor the values into decile range. I want to avoid manually setting factor labels since I will be running many queries and plotting maps that have discrete legends. I've created a column called Value.fc but I cannot format it to "160 - 166" from "(160, 166]". Please see the problematic code below:
corn_df <- corn_df %>%
mutate(Value.fc = gtools::quantcut(Value, 10))
corn_df %>%
select(Value, unit_desc, domain_desc, Value.fc) %>%
head(6)
A tibble: 6 x 4
Value unit_desc domain_desc Value.fc
<dbl> <chr> <chr> <fct>
1 164. BU / ACRE TOTAL (160,166]
2 196. BU / ACRE TOTAL (191,200]
3 203. BU / ACRE TOTAL (200,230]
4 205. BU / ACRE TOTAL (200,230]
5 172. BU / ACRE TOTAL (171,178]
6 213. BU / ACRE TOTAL (200,230]
You can try to use dplyr::ntile() or Hmisc::cut2().
If you're interested where the decline of the variable starts and ends you can use Hmisc::cut2() and stringr::str_extract_all()
require(tidyverse)
require(Hmisc)
require(stringr)
df <- data.frame(value = 1:100) %>%
mutate(decline = cut2(value, g=10),
decline = factor(sapply(str_extract_all(decline, "\\d+"),
function(x) paste(x, collapse="-"))))
head(df)
value decline
1 1 1-11
2 2 1-11
3 3 1-11
4 4 1-11
5 5 1-11
6 6 1-11
If you're looking only for the decline of the variable you can use dplyr::ntile().
require(tidyverse)
df <- data.frame(value = 1:100) %>%
mutate(decline = ntile(value, 10))
head(df)
value decline
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows