To sum the columns and display results in other column - r

This is bugging me from 2days.
I have data like
Account.ID asset_name
6yS A
6yS B
6yS B
6yS C
6yU D
876 C
From here I want to make more columns of like dummies. But I want only one row each ID.
My output should look like this
Account.ID asset_name Flag_A Flag_B Flag_C Flag_D
6yS A 1 2 1 0
6yU D 0 0 0 1
876 C 0 0 1 0
I tried aggregating but they make it into another table, which I do not want to merge again, because I will be losing information.
Please help me out.
Thank y'll in advance.

This one?
df %>%
count(Account.ID, asset_name) %>%
tidyr::pivot_wider( names_from = asset_name,
values_from = n,
values_fill = list(n = 0))
# A tibble: 3 x 5
Account.ID A B C D
<chr> <int> <int> <int> <int>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0

You can use dcast from data.table with fun.aggregate argument:
library(data.table)
dcast(data = setDT(df)[, asset_name := paste0('Flag_', asset_name)],
formula = Account.ID ~ asset_name,
fun.aggregate = length)
Output:
Account.ID Flag_A Flag_B Flag_C Flag_D
1: 6yS 1 2 1 0
2: 6yU 0 0 0 1
3: 876 0 0 1 0

Here's a tidyverse solution, although not the most elegant.
Account.ID <- c('6yS', '6yS', '6yS', '6yS', '6yU', '876')
asset_name <- c('A','B','B','C','D','C')
df <- data.frame(Account.ID, asset_name)
df <- df %>%
group_by(Account.ID, asset_name) %>%
summarise(Count = n()) %>%
spread(key = asset_name, value = Count, fill = 0)
Returns:
Account.ID A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0

I think I have an answer for you. So this is your dataset:
Account.ID <- c("6yS", "6yS", "6yS", "6yS", "6yU", 876)
asset_name <- c("A", "B", "B", "C", "D", "C")
df <- data.frame(Account.ID, asset_name)
df
Account.ID asset_name
1 6yS A
2 6yS B
3 6yS B
4 6yS C
5 6yU D
6 876 C
For further transformations I am using tidyverse, so install it and load the library:
install.packages("tidyverse")
library(tidyverse)
df <-df %>%
group_by(Account.ID, asset_name) %>%
summarize(n=n()) %>%
spread(asset_name, n)
df
# A tibble: 3 x 5
# Groups: Account.ID [3]
Account.ID A B C D
<fct> <int> <int> <int> <int>
1 6yS 1 2 1 NA
2 6yU NA NA NA 1
3 876 NA NA 1 NA
Now all that needs to be done is turn NAs into 0 and rename columns:
df[is.na(df)] <- 0
names(df)[2:ncol(df)] <- paste0("Flag_", names(df)[2:ncol(df)])
df
# A tibble: 3 x 5
# Groups: Account.ID [3]
Account.ID Flag_A Flag_B Flag_C Flag_D
<fct> <dbl> <dbl> <dbl> <dbl>
1 6yS 1 2 1 0
2 6yU 0 0 0 1
3 876 0 0 1 0
Is this what you were looking for?

Related

dplyr mutate ifelse returning first value of group instead of by-row

I'm trying to mutate a data.frame using ifelse:
df = data.frame(grp = c('a', 'a', 'a', 'b', 'b', 'b'),
value1 = c(0, 0, 0, 0, 1, 2),
value2 = 1:6)
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(all(value1 == 0), 0, value2))
which returns
# # A tibble: 6 x 3
# # Groups: grp [2]
# grp value1 value2
# <chr> <dbl> <dbl>
# 1 a 0 0
# 2 a 0 0
# 3 a 0 0
# 4 b 0 4
# 5 b 1 4
# 6 b 2 4
instead of
# # A tibble: 6 x 3
# # Groups: grp [2]
# grp value1 value2
# <chr> <dbl> <dbl>
# 1 a 0 0
# 2 a 0 0
# 3 a 0 0
# 4 b 0 4
# 5 b 1 5
# 6 b 2 6
How can I change the mutate so that the rows of "value2" are unchanged if the condition is false?
You can use if and else instead of ifelse():
df %>%
group_by(grp) %>%
mutate(value2 = if(all(value1 == 0)) 0 else value2)
grp value1 value2
<fct> <dbl> <dbl>
1 a 0 0
2 a 0 0
3 a 0 0
4 b 0 4
5 b 1 5
6 b 2 6
You can try ifelse as a mask, e.g.,
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(all(value1 == 0), 0, 1)*value2)
or (thank #tmfmnk's comment)
df %>%
group_by(grp) %>%
mutate(value2 = any(value1 != 0)*value2)
which gives
grp value1 value2
<chr> <dbl> <dbl>
1 a 0 0
2 a 0 0
3 a 0 0
4 b 0 4
5 b 1 5
6 b 2 6
The problem you encountered is due to the fact that all(value1 == 0) returns a single logical value. You need to have a vector of logic values to have your desired output, e.g.,
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(rep(all(value1 == 0),n()), 0, value2))

How can I filter by subjects who have all levels of a factor?

I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?
We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.
I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1

Re-Framing datasets

Hi all I have a got a 2 datasets below. From these 2 datasets(dataset1 is formed from dataset2. I mean the dataset1 is the count of users from dataset2) can we build the the third datasets(expected output)
dataset1
Apps # user Enteries
A 3
B 4
C 6
dataset2
Apps Users
A X
A Y
A Z
B Y
B Y
B Z
B A
C X
C X
C X
C X
C X
C X
Expected output
Apps Entries X Y Z A
A 3 1 1 1
B 4 2 1 1
C 6 6
We can first count first for Apps and Users, get the data in wide format and join with the table for count of Apps.
library(dplyr)
df %>%
count(Apps, Users) %>%
tidyr::pivot_wider(names_from = Users, values_from = n,
values_fill = list(n = 0)) %>%
left_join(df %>% count(Apps), by = 'Apps')
# Apps X Y Z A n
# <chr> <int> <int> <int> <int> <int>
#1 A 1 1 1 0 3
#2 B 0 2 1 1 4
#3 C 6 0 0 0 6
I showing 0 is no problem and having a different column order you can use table and rowSums to produce the expected output.
x <- table(dataset2)
cbind(Entries=rowSums(x), x)
# Entries A X Y Z
#A 3 0 1 1 1
#B 4 1 0 2 1
#C 6 0 6 0 0
A solution where you need not have to calculate Total separately and do joins...
This solution uses purrr::pmap and dplyr::mutate for dynamically calculating Total.
library(tidyverse) # dplyr, tidyr, purrr
df %>% count(Apps, Users) %>%
pivot_wider(id_cols = Apps, names_from = Users, values_from = n, values_fill = list(n = 0)) %>%
mutate(Total = pmap_int(.l = select_if(., is.numeric),
.f = sum))
which have output what you need
# A tibble: 3 x 6
Apps X Y Z A Total
<chr> <int> <int> <int> <int> <int>
1 A 1 1 1 0 3
2 B 0 2 1 1 4
3 C 6 0 0 0 6

Filter Data completely user defined r - multiple columns and filters

I am attempting to create a function that will allow a user to define an infinite number of columns and apply matching filters to those columns.
df <- data.frame(a=1:10, b=round(runif(10)), c=round(runif(10)))
|a| b|c|
|1| 1|1|
|2| 0|0|
|3| 0|1|
|4| 1|0|
|5| 1|0|
|6| 1|0|
|7| 1|1|
|8| 1|1|
|9| 1|0|
|10|1|1|
I would like the user to be able to filter the data based off either column, and apply different filters to each column. I know the following does not work. But this would be the general idea.
test <- function(df, fCol, fParam){
df %>% filter(fCol[1] %in% fParam[1] | fCol[2] %in% fParam[2])
}
test(df, c("b","c"),c(1,0)
# Which I would want it to return
|a|b|c|
|4|1|0|
|5|1|0|
|6|1|0|
|9|1|0|
The issue that I run into is that I won't know how many columns the user will want to filter, nor will I know the column names.
Any help at all would be greatly appreciated. Please ask questions if you have them. I tried my best to give a reprex.
I believe this should satisfy what you want
library(tidyr)
library(dplyr)
test <- function(df,
fCol,
fParam,
match_type = "any")
{
if(!is.element(match_type, c("any","all"))|length(match_type)!=1){
stop()
}
df <- df %>% ungroup() %>%
mutate(..id..=1:n())
meta <- data.frame(fCol=fCol,fParam=fParam)
logi <- df %>%
select("..id..",fCol) %>%
gather(key = "key", value = "value", -..id..) %>%
left_join(., y = meta, by = c("key"="fCol")) %>%
mutate(match = value==fParam) %>%
select(-key,-value, -fParam) %>%
group_by_at(setdiff(names(.),"match")) %>%
summarise(match = ifelse(match_type%in%"any",any(match), all(match)))
df2 <- left_join(df, logi, by = intersect(colnames(df),colnames(logi))) %>%
filter(match)%>%
select(-match, -..id..)
return(df2)
}
df <- data.frame(a=1:10, b=round(runif(10)), c=round(runif(10)))
df
# a b c
#1 1 0 1
#2 2 1 0
#3 3 0 0
#4 4 0 1
#5 5 0 1
#6 6 0 1
#7 7 1 0
#8 8 1 1
#9 9 1 0
#10 10 1 0
#use "any" to do an | match
test(df, c("b","c"),c(1,0), match_type = "any")
# a b c
#1 2 1 0
#2 3 0 0
#3 7 1 0
#4 8 1 1
#5 9 1 0
#6 10 1 0
#use "all" to do an & match
test(df, c("b","c"),c(1,0), match_type = "all")
# a b c
#1 2 1 0
#2 7 1 0
#3 9 1 0
#4 10 1 0
You can also specify the same colname for fCol multiple times if you want to match multiple values
test(df, c("b","b"),c(1,0)) #matches everything but you get the point
(my original response):
I am not sure this quite gives you the process you
want, but here's my best attempt before running out of
patience!!! :-)
I am sure there is a good way to make this an AND filter not an OR but I
can't quite get there myself. (Maybe a combination of map_dfc and
inner_join?)
Edit: got there in the end! Improved code below (original code deleted).
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(purrr))
my_df <- tibble(
a=1:10,
b=round(runif(10)),
c=round(runif(10))
)
my_df
#> # A tibble: 10 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 3 0 1
#> 4 4 0 0
#> 5 5 1 1
#> 6 6 0 1
#> 7 7 0 0
#> 8 8 0 1
#> 9 9 1 0
#> 10 10 1 0
col_names <- c("b", "c")
tests <- c(1, 0)
# option 1: with a named function:
make_test_frame <- function(col_name, test) {
tibble({{col_name}} := test)
}
my_df1 <- map2_dfc(col_names, tests, make_test_frame) %>%
inner_join(x = my_df)
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
# 2. or with an anonymous function:
my_df1 <- map2_dfc(
col_names, tests,
function(col_name, test) {
tibble({{col_name}} := test)
}
) %>%
inner_join(x = my_df)
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
# 3. or as one big, hairy function:
filter_df <- function(df, col_names, tests) {
map2_dfc(
col_names, tests,
function(col_name, test) {
tibble({{col_name}} := test)
}
) %>%
inner_join(x = df)
}
my_df1 <- filter_df(my_df, col_names = c("b", "c"), tests = c(1, 0))
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
Created on 2020-02-28 by the reprex package (v0.3.0)

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

Resources