I want to create a new column that counts the number of rows that meet a value.
Creating replicable data:
data <- tibble(Category = c("A", "B", "A", "A", "A"))
I want the data to eventually look like this code, but instead of just creating the variable manually like this, I create a new variable CountA using a conditional mutate() or something similar that counts the total number of rows where the value of Category is A only:
tibble(Category = c("A", "B", "A", "A", "A"), CountA = c(4,4,4,4,4))
I know that I could filter out the non-A values and then generate the CountA variable, but I need to keep those rows still for a different purpose.
You can create a logical in mutate, then sum the number of TRUE observations.
library(dplyr)
data %>%
mutate(countA = sum(Category == "A", na.rm = TRUE))
Or in base R:
data$countA <- sum(data$Category == "A", na.rm = TRUE)
Output
Category countA
<chr> <int>
1 A 4
2 B 4
3 A 4
4 A 4
5 A 4
If you are wanting to create a new column for every Category, then you could do something like this:
library(tidyverse)
data %>%
group_by(Category) %>%
mutate(obs = n(),
grp = Category,
row = row_number()) %>%
pivot_wider(names_from = "grp", values_from = "obs", names_prefix = "Count") %>%
ungroup %>%
select(-row) %>%
fill(-"Category", .direction = "updown")
Output
Category CountA CountB
<chr> <int> <int>
1 A 4 1
2 B 4 1
3 A 4 1
4 A 4 1
5 A 4 1
Related
I have a table that looks like this:
Id
Types
1
A
1
A
1
A
1
B
2
A
2
B
3
A
3
B
4
A
4
B
4
B
What I would like to do is 1. count for every ID the amount of A's and B's it has. 2. Compute the distribution of every combination of the amounts of A and B.
So at the end of step 2 I should have the table:
Amount of A
Amount of B
Number of Different IDs
1
1
2
1
2
1
3
1
1
How can this be achieved?
Thank you.
Here's a solution with dplyr and tidyr:
library(dplyr)
library(tidyr)
# ...
# Code to generate your original table: "your_table".
# ...
result <- your_table %>%
# Count the amount of each type for each Id.
group_by(Id) %>% count(Types) %>% ungroup() %>%
# "Pivot" the Types column, such that each type (here "A" and "B") gets its
# own column (here "Amount of A" and "Amount of B") to hold its amount (as
# calculated right above).
pivot_wider(id_cols = c(Id, Types),
names_from = Types, names_prefix = "Amount of ",
values_from = n) %>%
# For each combination of amounts among those pivoted columns (ie. all the
# columns except "Id"), count how many distinct IDs there are.
group_by(across(-c(Id))) %>%
summarize("Number of Different IDs" = n_distinct(Id)) %>% ungroup()
# Print the result.
result
Given the example of your_table that you provided
your_table <- tibble::tribble(
~Id, ~Types,
1, "A",
1, "A",
1, "A",
1, "B",
2, "A",
2, "B",
3, "A",
3, "B",
4, "A",
4, "B",
4, "B"
)
you should get the following result:
# A tibble: 3 x 3
`Amount of A` `Amount of B` `Number of Different IDs`
<int> <int> <int>
1 1 1 2
2 1 2 1
3 3 1 1
I'm trying to compute mean + standard deviation for a dataset. I have a list of organizations, but one organization has just one single row for the column "cpue." When I try to compute the grouped mean for each organization and another variable (scientific name), this organization is removed and yields a NA. I would like to retain the single-group value however, and for it to be in the "mean" column so that I can plot it (without sd). Is there a way to tell dplyr to retain groups with a single row when calculating the mean? Data below:
l<- df<- data.frame(organization = c("A","B", "B", "A","B", "A", "C"),
species= c("turtle", "shark", "turtle", "bird", "turtle", "shark", "bird"),
cpue= c(1, 2, 1, 5, 6, 1, 3))
l2<- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd=sd(cpue))
Any help would be much appreciated!
We can create an if/else condition in sd to check for the number of rows i.e. if n() ==1 then return the 'cpue' or else compute the sd of 'cpue'
library(dplyr)
l1 <- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd= if(n() == 1) cpue else sd(cpue), .groups = 'drop')
-output
l1
# A tibble: 6 x 4
# organization species mean sd
#* <chr> <chr> <dbl> <dbl>
#1 A bird 5 5
#2 A shark 1 1
#3 A turtle 1 1
#4 B shark 2 2
#5 B turtle 3.5 3.54
#6 C bird 3 3
If the condition is based on the value of grouping variable 'organization', then create the condition in if/else by extracting the grouping variable with cur_group()
l %>%
group_by(organization, species) %>%
summarise(mean = mean(cpue),
sd = if(cur_group()$organization == 'A') cpue else sd(cpue),
.groups = 'drop')
I have a dataframe that looks like this (but with lots more columns, and no helpful "KEEP" column):
df <- tribble( ~Lots.of.cols, ~analyte, ~meta, ~value, ~KEEP,
1, "A", "analyte", NA, FALSE,
1, "A", "unit", "m", FALSE,
1, "A", "method", NA, FALSE,
1, "B", "analyte", "4", TRUE,
1, "B", "unit", "kg", TRUE,
1, "B", "method", "xxx", TRUE)
What I want to do is filter out all the rows of a particular analyte if the row where meta is "analyte" the value column is also NA. So in the df above, the first three rows should be filtered out because row one has meta = "analyte" and value = NA. The final three rows (analyte = "B") should be kept because the fourth row (meta = "analyte") has !is.na(value).
So there are two approaches I've tried. The first is to group_by(analyte) and then try filtering or alternatively
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte", "meta")) -> df
With both approaches I can remove the individual row where meta = "analyte" & is.na(value) but not the other rows in the group.
The issue is that your table is not in tidy format, i.e. 1 observation = 1 row.
To have as tidy data, you'd need to pivot wider. This is why I pivotted, filtered, then re-pivotted.
Also, it's confusing that you have two things named "analyte" that are not the same thing, hence why I changed the name.
df %>%
mutate(meta = str_replace(meta, "analyte", "analyte_value")) %>%
pivot_wider(names_from = meta, values_from = value) %>%
filter(!is.na(analyte_value)) %>%
pivot_longer(cols = analyte_value:method)
#> # A tibble: 3 x 4
#> Lots.of.cols analyte name value
#> <dbl> <chr> <chr> <chr>
#> 1 1 B analyte_value 4
#> 2 1 B unit kg
#> 3 1 B method xxx
Your anti_join was almost good, just don't put the "meta" variable in the by = c(...) like that :
df %>%
anti_join(.[is.na(.$value) & .$meta == "analyte", ],
by = c("Lots.of.cols", "analyte")) -> df
Result :
# A tibble: 3 x 5
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 TRUE
2 1 B unit kg TRUE
3 1 B method xxx TRUE
I would first fix your KEEP column, and them filter the data by it. First I group your data by analyte using group_by() from dplyr, them I apply the logical test to discover if in some row of each group, there is a row with meta = analyte and value = NA, and them I use the any() function to discover if any of these results from the test, are TRUE in each group. After that, I just use filter() to select the desired rows.
library(tidyverse)
df <- df %>%
group_by(analyte) %>%
mutate(KEEP = any(meta == "analyte" & is.na(value))) %>%
filter(KEEP == FALSE)
Here is the result:
# A tibble: 3 x 5
# Groups: analyte [1]
Lots.of.cols analyte meta value KEEP
<dbl> <chr> <chr> <chr> <lgl>
1 1 B analyte 4 FALSE
2 1 B unit kg FALSE
3 1 B method xxx FALSE
This question already has answers here:
Using Group_by create aggregated counts conditional on value
(1 answer)
reshape of a large data
(2 answers)
Aggregating factor level counts - by factor
(1 answer)
Closed 4 years ago.
I have a dataframe with 3 categorical variables (x,y,z) along with an ID column :
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
I want to apply spread() to each of the categorical variables group by ID .
Output should be like this :
id a b c d v f
1 1 1 1 1 1 1
2 1 1 0 2 2 0
I tried doing it but I was able to do it only for one variable at once not all together .
For e.g: Applying spread only to the y column (similarly , it can be done for x and z separately) but not together in a single line
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Explaining my codes in three steps:
Step 1: count frequency
df %>% count(id,y)
id y n
<dbl> <chr> <int>
1.00 c 1
1.00 d 1
2.00 d 2
Step 2 : applying spread()
df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1 1.00 1 1
2 2.00 NA 2
Step 3: Adding fill = 0 , replaces NA which means there was zero occurrence of c in y column for id 2 (as you can see in df)
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once.
Is it possible apply spread() in tidyr for all of categorical variables all together ? If not can you please suggest an alternative
Note: I also gave a try to these answers but were not helpful for this particular case:
R spreading multiple columns with tidyr
Is it possible to use spread on multiple columns in tidyr similar to dcast?
Can spread() in tidyr spread across multiple value?
Expanding columns associated with a categorical variable into multiple columns with dplyr/tidyr while retaining id variable
Additional related helpful question :
It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.
df <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr> <chr>
1 1.00 yes yes
2 1.00 yes no
3 2.00 yes no
4 2.00 no yes
Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
id no yes
1 1 3
2 2 2
Whereas, expected output should be :
id Do_you_Watch_TV_no Do_you_Watch_TV_yes Do_you_Drive_no Do_you_Drive_yes
1 0 2 1 1
2 1 1 1 1
So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .
How can we achieve this?
Thanks
First you need to convert your data frame in long format before you can actually transform it in wide format. Hence, first you need to use tidyr::gather and convert data frame to long format. Afterwards, you have couple of options:
Option#1: Using tidyr::spread:
#data
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
library(tidyverse)
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Option#2: Another option can be is to use reshape2::dcast as:
library(tidyverse)
library(reshape2)
df %>% gather(Key, value, -id) %>%
dcast(id~value, fun.aggregate = length)
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Edited: To include solution for 2nd data frame.
#Data
df1 <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1 1 1 1 0 2
# 2 2 1 1 1 1
Consider the following sample dataframe:
> df
id name time
1 1 b 10
2 1 b 12
3 1 a 0
4 2 a 5
5 2 b 11
6 2 a 9
7 2 b 7
8 1 a 15
9 2 b 1
10 1 a 3
df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L),
name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id",
"name", "time"), row.names = c(NA, -10L), class = "data.frame")
I need to identify and record all sequences seq <- c("a","b"), where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5.
The expected result for the sample data is
a b
1 3 10
2 5 7
3 9 11
There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?
library(dplyr); library(tidyr)
# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>%
# get logical vector indicating rows of a followed by b and mark each pair as unique
# by cumsum
mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>%
# subset rows where conditions are met
filter(ab | lag(ab)) %>%
# reshape your data frame to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [3 x 4]
#Groups: id [2]
# id g a b
#* <int> <int> <int> <int>
#1 1 1 3 10
#2 2 1 5 7
#3 2 2 9 11
If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift function(which accepts a vector as lag/lead steps) from data.table combined with Reduce, say if we need to check pattern abb:
library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)
df %>% arrange(id, time) %>% group_by(id) %>%
# same logic as before but use Reduce function to check multiple lags condition
mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)),
g = cumsum(ab)) %>%
# use reduce or to subset sequence rows having the same length as the pattern
filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>%
# make unique names
group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>%
# pivoting the table to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [1 x 5]
#Groups: id, g [1]
# id g a_1 b_2 b_3
#* <int> <int> <int> <int> <int>
#1 1 1 3 10 12
It's somewhat convoluted, but how about a rolling join?
library(data.table)
setorder(setDT(df), id, time)
df[ name == "b" ][
df[, if(name == "a") .(time = last(time)), by=.(id, name, r = rleid(id,name))],
on = .(id, time),
roll = -Inf,
nomatch = 0,
.(a = i.time, b = x.time)
]
a b
1: 3 10
2: 5 7
3: 9 11
You can use an ifelse in filter with lag and lead, and then tidyr::spread to reshape to wide:
library(tidyverse)
df %>% arrange(id, time) %>% group_by(id) %>%
filter(ifelse(name == 'b', # if name is b...
lag(name) == 'a', # is the previous name a?
lead(name) == 'b')) %>% # else if name is not b, is next name b?
ungroup() %>% mutate(i = rep(seq(n() / 2), each = 2)) %>% # create indices to spread by
spread(name, time) %>% select(a, b) # spread to wide and clean up
## # A tibble: 3 × 2
## a b
## * <int> <int>
## 1 3 10
## 2 5 7
## 3 9 11
Based on the comment below, here's a version that uses gregexpr to find the first index of a matched pattern, which while more complicated, scales more easily to longer patterns like "aabb":
df %>% group_by(pattern = 'aabb', id) %>% # add pattern as column, group
arrange(time) %>%
# collapse each group to a string for name and a list column for time
summarise(name = paste(name, collapse = ''), time = list(time)) %>%
# group and add list-column of start indices for each match
rowwise() %>% mutate(i = gregexpr(pattern, name)) %>%
unnest(i, .drop = FALSE) %>% # expand, keeping other list columns
filter(i != -1) %>% # chop out rows with no match from gregexpr
rowwise() %>% # regroup
# subset with sequence from index through pattern length
mutate(time = list(time[i + 0:(nchar(pattern) - 1)]),
pattern = strsplit(pattern, '')) %>% # expand pattern to list column
rownames_to_column('match') %>% # add rownames as match index column
unnest(pattern, time) %>% # expand matches in parallel
# paste sequence onto each letter (important for spreading if repeated letters)
group_by(match) %>% mutate(pattern = paste0(pattern, seq(n()))) %>%
spread(pattern, time) # spread to wide form
## Source: local data frame [1 x 8]
## Groups: match [1]
##
## match id name i a1 a2 b3 b4
## * <chr> <int> <chr> <int> <int> <int> <int> <int>
## 1 1 1 aabba 1 0 3 10 12
Note that if the pattern doesn't happen to be in alphabetical order, the resulting columns will not be ordered by their indices. Since indices are preserved, though, you can sort with something like select(1:4, parse_number(names(.)[-1:-4]) + 4).