Summarizing dataframe based on multiple columns - r

I'm having some trouble figuring this one out. Say, I have a table like this:
Name Activity Day
1 John cycle 1
2 John work 1
3 Tina work 1
4 Monika work 1
5 Tina swim 1
6 Tina jogging 2
7 John work 2
8 Tina work 2
I want to summarize it in a way that the activity of each individual is grouped according to the day.
It should look like this:
Name Activity Day
1 John cycle;work 1
2 Tina work;swim 1
3 Monika work 1
4 Tina jogging;work 2
5 John work 2
I am thinking that dplyr package would be the answer here, but I don't know how to do it. Any help?
Thanks!

try:
library(dplyr)
dat <- tribble(~"Name", ~"Activity", ~"Day",
"John", "cycle", 1,
"John", "work" , 1,
"Tina", "work", 1,
"Monika", "work", 1,
"Tina", "swim", 1,
"Tina", "jogging", 2,
"John", "work", 2,
"Tina", "work", 2)
dat %>%
group_by(Name, Day) %>%
summarise(activity = paste(Activity, collapse = "; "))
# A tibble: 5 x 3
# Groups: Name [3]
Name Day activity
<chr> <dbl> <chr>
1 John 1 cycle; work
2 John 2 work
3 Monika 1 work
4 Tina 1 work; swim
5 Tina 2 jogging; work

An option with data.table
library(data.table)
setDT(dat)[, .(Activity = toString(Activity)), .(Name, Day)]

You can use the aggregate function, for example:
> aggregate(dat$Activity,list(dat$Name,dat$Day),as.character)
Group.1 Group.2 x
1 John 1 cycle, work
2 Monika 1 work
3 Tina 1 work, swim
4 John 2 work
5 Tina 2 jogging, work

Related

Find rows that are identical in one column but not another

There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:
> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
name id_num
1 george 1
2 george 1
3 george 2
4 sara 3
5 sara 3
6 sam 4
7 bill 5
8 bill 5
I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".
I think the easiest way will be to use dplyr::count twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)
Using tapply() to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))

How to sort the values of each obs of a data.frame? [duplicate]

This question already has answers here:
Row wise Sorting in R
(2 answers)
Closed 3 years ago.
I have this data.set
people <- c("Arthur", "Jean", "Paul", "Fred", "Gary")
question1 <- c(1, 3, 2, 2, 5)
question2 <- c(1, 0, 1, 0, 3)
question3<- c(1, 0, 2, 2, 4)
question4 <- c(1, 5, 2, 1, 5)
test <- data.frame(people, question1, question2, question3, question4)
test
Here is my output :
people question1 question2 question3 question4
1 Arthur 1 1 1 1
2 Jean 3 0 0 5
3 Paul 2 1 2 2
4 Fred 2 0 2 1
5 Gary 5 3 4 5
I want to order the results of each people like this (descending order based on values from left to right columns) in a new data.frame. Ne names of the new columns are letters or anything else.
people A B C D
1 Arthur 1 1 1 1
2 Jean 5 3 0 0
3 Paul 2 2 2 1
4 Fred 2 2 2 0
5 Gary 5 5 4 3
With base R apply function sort to the rows in question but be carefull, apply returns the transpose:
test[-1] <- t(apply(test[-1], 1, sort, decreasing = TRUE))
test
# people question1 question2 question3 question4
#1 Arthur 1 1 1 1
#2 Jean 5 3 0 0
#3 Paul 2 2 2 1
#4 Fred 2 2 1 0
#5 Gary 5 5 4 3
Solution using tidyverse (i.e. dplyr and tidyr):
library(tidyverse)
test %>%
pivot_longer(cols=-people, names_to="variable",values_to = "values") %>%
arrange(people, -values) %>%
select(people, values) %>%
mutate(new_names = rep(letters[1:4], length(unique(test$people)))) %>%
pivot_wider(names_from = new_names,
values_from = values)
This returns:
# A tibble: 5 x 5
people a b c d
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1
Explanation:
bring data into 'long' form so we can order it on the values of all the 'question' variables.
order (arrange) on people and -values (see above)
remove the not used variable variable
create a new column to hold the new names, name them A-D, for each value of person
bring the data into 'wide' form, creating new columns from the new names
One dplyr and tidyr option could be:
test %>%
pivot_longer(-people) %>%
group_by(people) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(name = LETTERS[1:n()]) %>%
pivot_wider(names_from = "name", values_from = "value")
people A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1

Find the number of times a unique value is appeared in more than one files and the number of those files

I have those 3 dataframes below:
Name<-c("jack","jack","bob","david","mary")
n1<-data.frame(Name)
Name<-c("jack","bill","dean","mary","steven")
n2<-data.frame(Name)
Name<-c("fred","alex","mary")
n3<-data.frame(Name)
I would like to create a new dataframe with 3 columns.All unique names present across all 3 source files in Column 1,the number
of source files in which it's located, in Column 2, and the total number of instances of that name across all files, in Column
3.
The result should be like
Name Number_of_files Number_of_instances
1 jack 2 3
2 bob 1 1
3 david 1 1
4 mary 3 3
5 bill 1 1
6 dean 1 1
7 steven 1 1
8 fred 1 1
9 alex 1 1
Is there an automated way to achieve all these at once?
One dplyr possibility could be:
bind_rows(n1, n2, n3, .id = "ID") %>%
group_by(Name) %>%
summarise(Number_of_files = n_distinct(ID),
Number_of_instances = n())
Name Number_of_files Number_of_instances
<chr> <int> <int>
1 alex 1 1
2 bill 1 1
3 bob 1 1
4 david 1 1
5 dean 1 1
6 fred 1 1
7 jack 2 3
8 mary 3 3
9 steven 1 1
This is conceptually similar answer as #tmfmnk but a base R version
#Get names of all the objects n1, n2, n3, n4 . etc
name_df <- ls(pattern = "n\\d+")
#Combine them in one dataframe
all_df <- do.call(rbind, Map(cbind, mget(name_df), id = name_df))
#get aggregated values
aggregate(id~Name, all_df, function(x) c(length(unique(x)), length(x)))
# Name id.1 id.2
#1 bob 1 1
#2 david 1 1
#3 jack 2 3
#4 mary 3 3
#5 bill 1 1
#6 dean 1 1
#7 steven 1 1
#8 alex 1 1
#9 fred 1 1
You can rename the columns if needed.
And for completeness data.table version
library(data.table)
dt < - rbindlist(mget(name_df), idcol = "ID")
dt[, list(Number_of_files = uniqueN(ID), Number_of_instances = .N), by = .(Name)]

Find recurrencies in pairs distributed in 2 columns of a data.frame

Let's say that there is a need to find out frequencies for each pair:
Eg. Mark -Maria appears three times and the rest one time
Name1 Name2
Mark Maria
John Xesca
Steve Rose
Mark Maria
John John
Mark Maria
John Xesca
Which is the best way to perform this? Take into account that those are frequencies for both elements. I think this is more complex than the expected... Thanks in advance,
If you need to take account of the order of name1 and name2 :
subset(as.data.frame(table(df)), Freq > 0)
# Name1 Name2 Freq
# 1 John John 1
# 5 Mark Maria 3
# 9 Steve Rose 1
# 10 John Xesca 2
We loop through the rows of the dataset, sort and paste it together, then get the frequency with table
table(apply(df1, 1, function(x) paste(sort(x), collapse='-')))
# John-John John-Xesca Maria-Mark Rose-Steve
# 1 2 3 1
data
df1 <- structure(list(Name1 = c("Mark", "John", "Steve", "Mark", "John",
"Mark", "John"), Name2 = c("Maria", "Xesca", "Rose", "Maria",
"John", "Maria", "Xesca")), class = "data.frame", row.names = c(NA,
-7L))
Actually you don't even need to paste, just group:
dat %>%
group_by(Name1, Name2) %>%
count()
# # A tibble: 4 x 3
# # Groups: Name1, Name2 [4]
# Name1 Name2 n
# <fct> <fct> <int>
# 1 John John 1
# 2 John Xesca 2
# 3 Mark Maria 3
# 4 Steve Rose 1
You can paste0 together the columns then count with dplyr:
library(dplyr)
dat %>%
mutate(pasted = paste0(Name1,Name2)) %>%
group_by(pasted) %>%
count()
# # A tibble: 4 x 2
# # Groups: pasted [4]
# pasted n
# <chr> <int>
# 1 JohnJohn 1
# 2 JohnXesca 2
# 3 MarkMaria 3
# 4 SteveRose 1
Note that JohnXesca will be treated as different from XescaJohn.
Data:
tt <- "Name1 Name2
Mark Maria
John Xesca
Steve Rose
Mark Maria
John John
Mark Maria
John Xesca"
dat <- read.table(text=tt, header = T)

Expanding a list to include all possible pairwise combinations within a group

I am currently running a randomization where individuals of a given population are sampled and placed into groups of defined size. The result is a data frame seen below:
Ind Group
Sally 1
Bob 1
Sue 1
Joe 2
Jeff 2
Jess 2
Mary 2
Jim 3
James 3
Is there a function which will allow me to expand the data set to show every possible within group pairing? (Desired output below). The pairings do not need to be reciprocal.
Group Ind1 Ind2
1 Sally Bob
1 Sally Sue
1 Sue Bob
2 Joe Jeff
2 Joe Jess
2 Joe Mary
2 Jeff Jess
2 Jess Mary
2 Jeff Mary
3 Jim James
I feel like there must be a way to do this in dplyr, but for the life of me I can't seem to sort it out.
An alternative dplyr & tidyr approach: The pipeline is a little longer, but the wrangling feels more straightforward to me. Start with combining all records in each group together. Next, pool and alphabetize all the names together to be able to eliminate the reciprocal/duplicates. Then finally separate the results back apart again.
left_join(dt, dt, by = "Group") %>%
filter(Ind.x != Ind.y) %>%
rowwise %>%
mutate(name = toString(sort(c(Ind.x,Ind.y)))) %>%
select(Group, name) %>%
distinct %>%
separate(name, into = c("Ind1", "Ind2")) %>%
arrange(Group, Ind1, Ind2)
start off with a weak cross join of all records in each group
filter out the self joins
collect up all the names in each row, sort them, and set them down together in the name column.
now that the names are alphabetized, remove the alphabetized reciprocals
pull the data apart back into separate columns.
# A tibble: 10 x 3
Group Ind1 Ind2
* <int> <chr> <chr>
1 1 Bob Sally
2 1 Sally Sue
3 1 Bob Sue
4 2 Jeff Joe
5 2 Jess Joe
6 2 Joe Mary
7 2 Jeff Jess
8 2 Jeff Mary
9 2 Jess Mary
10 3 James Jim
Here is an option using data.table. Convert to data.table (setDT(dt)), Do a cross join (CJ) grouped by 'Group' and remove the duplicated elements
library(data.table)
setDT(dt)[, CJ(Ind1 = Ind, Ind2 = Ind, unique = TRUE)[Ind1 != Ind2],
Group][!duplicated(data.table(pmax(Ind1, Ind2), pmin(Ind1, Ind2)))]
# Group Ind1 Ind2
#1: 1 Bob Sally
#2: 1 Bob Sue
#3: 1 Sally Sue
#4: 2 Jeff Jess
#5: 2 Jeff Joe
#6: 2 Jeff Mary
#7: 2 Jess Joe
#8: 2 Jess Mary
#9: 2 Joe Mary
#10: 3 James Jim
Or using combn by 'Group'
setDT(dt)[, {temp <- combn(Ind, 2); .(Ind1 = temp[1,], Ind2 = temp[2,])}, Group]
A solution using dplyr. We can use group_by and do to apply the combn function to each group and combine the results to form a data frame.
library(dplyr)
dt2 <- dt %>%
group_by(Group) %>%
do(as_data_frame(t(combn(.$Ind, m = 2)))) %>%
ungroup() %>%
setNames(sub("V", "Ind", colnames(.)))
dt2
# # A tibble: 10 x 3
# Group Ind1 Ind2
# <int> <chr> <chr>
# 1 1 Sally Bob
# 2 1 Sally Sue
# 3 1 Bob Sue
# 4 2 Joe Jeff
# 5 2 Joe Jess
# 6 2 Joe Mary
# 7 2 Jeff Jess
# 8 2 Jeff Mary
# 9 2 Jess Mary
# 10 3 Jim James
DATA
dt <- read.table(text = "Ind Group
Sally 1
Bob 1
Sue 1
Joe 2
Jeff 2
Jess 2
Mary 2
Jim 3
James 3",
header = TRUE, stringsAsFactors = FALSE)

Resources