Find recurrencies in pairs distributed in 2 columns of a data.frame

Find recurrencies in pairs distributed in 2 columns of a data.frame - r

Let's say that there is a need to find out frequencies for each pair:
Eg. Mark -Maria appears three times and the rest one time
Name1 Name2
Mark Maria
John Xesca
Steve Rose
Mark Maria
John John
Mark Maria
John Xesca
Which is the best way to perform this? Take into account that those are frequencies for both elements. I think this is more complex than the expected... Thanks in advance,

If you need to take account of the order of name1 and name2 :
subset(as.data.frame(table(df)), Freq > 0)
# Name1 Name2 Freq
# 1 John John 1
# 5 Mark Maria 3
# 9 Steve Rose 1
# 10 John Xesca 2

We loop through the rows of the dataset, sort and paste it together, then get the frequency with table
table(apply(df1, 1, function(x) paste(sort(x), collapse='-')))
# John-John John-Xesca Maria-Mark Rose-Steve
# 1 2 3 1
data
df1 <- structure(list(Name1 = c("Mark", "John", "Steve", "Mark", "John",
"Mark", "John"), Name2 = c("Maria", "Xesca", "Rose", "Maria",
"John", "Maria", "Xesca")), class = "data.frame", row.names = c(NA,
-7L))

Actually you don't even need to paste, just group:
dat %>%
group_by(Name1, Name2) %>%
count()
# # A tibble: 4 x 3
# # Groups: Name1, Name2 [4]
# Name1 Name2 n
# <fct> <fct> <int>
# 1 John John 1
# 2 John Xesca 2
# 3 Mark Maria 3
# 4 Steve Rose 1
You can paste0 together the columns then count with dplyr:
library(dplyr)
dat %>%
mutate(pasted = paste0(Name1,Name2)) %>%
group_by(pasted) %>%
count()
# # A tibble: 4 x 2
# # Groups: pasted [4]
# pasted n
# <chr> <int>
# 1 JohnJohn 1
# 2 JohnXesca 2
# 3 MarkMaria 3
# 4 SteveRose 1
Note that JohnXesca will be treated as different from XescaJohn.
Data:
tt <- "Name1 Name2
Mark Maria
John Xesca
Steve Rose
Mark Maria
John John
Mark Maria
John Xesca"
dat <- read.table(text=tt, header = T)

Related

How to compare value of two columns then create a third/new data frame depending on the result

I have two columns, one name column and then a text column. I want to find the linking matches between the name columns that share the same text in the text column and create a third column with that name (the result could also just be a new data frame). Also, if you could tell me what this type of transform would be called - it might be easier for me to search for similar examples (or other users). I have tried to do some self joins and filtering and have not had much luck.
Name <- c("John","Sally","Alex", "Sarah", "Joe", "Sue")
Status <- c('A', 'B', 'A', 'B', 'C', 'A')
df <- data.frame(Name,Status)
What I have:
Name Status
John A
Sally B
Alex A
Sarah B
Joe C
Sue A
This is the result I want:
Column 1 Column 2 Column 3
John A Alex
John A Sue
Sally B Sarah
Joe C

Network analysis might be relevant here, using igraph package we can find the memebership.
For example, we can see "John", "Sue" and "Alex" are all connected to each other via status "A".
library(igraph)
g <- graph_from_data_frame(df)
plot(g)
components(g)$membership
# John Sally Alex Sarah Joe Sue A B C
# 1 2 1 2 3 1 1 2 3

Try a dplyr full-join:
library(dplyr)
full_join(df, df, by = "Status") %>%
group_by(Status) %>%
mutate(Name.y = if_else(Name.x == Name.y & n() == 1, Name.y[NA], Name.y)) %>%
filter(is.na(Name.y) | Name.x < Name.y) %>%
ungroup()
# # A tibble: 5 × 3
# Name.x Status Name.y
# <chr> <chr> <chr>
# 1 John A Sue
# 2 Sally B Sarah
# 3 Alex A John
# 4 Alex A Sue
# 5 Joe C NA

a data.table approach
library(data.table)
setDT(df)
# split based on duplicates in the Status-column
dt1 <- df[!duplicated(Status), ]
dt2 <- df[duplicated(Status), ]
# join
final <- dt2[dt1, on = .(Status)]
# the following lines are for column-order and -naming only
setcolorder(final, c(3,2,1))
setnames(final, c("col1", "col2", "col3"))
col1 col2 col3
1: John A Alex
2: John A Sue
3: Sally B Sarah
4: Joe C <NA>

1) dplyr For each Status we create 3 columns consisting of the Name and Status columns of the first row and Name column of the other rows or NA if no other rows.
library(dplyr) # version 1.1.0 or later
df %>%
reframe(col1 = first(Name),
col2 = first(Status),
col3 = if (n() == 1) NA else tail(Name, -1), .by = Status) %>%
select(-Status)
giving:
col1 col2 col3
1 John A Alex
2 John A Sue
3 Sally B Sarah
4 Joe C <NA>
2) base R This uses base R to perform a left join:
merge(subset(df, !duplicated(Status)),
subset(df, duplicated(Status)), all.x = TRUE, by = "Status")
giving
Status Name.x Name.y
1 A John Alex
2 A John Sue
3 B Sally Sarah
4 C Joe <NA>

Restructure binary "multiple response" data to categorical

I want to restructure some "multiple response" survey data from binary to nominal categories.
The survey asks the responder which ten people they most often interact with and gives a list of 50 names. The data comes back with 50 columns, one column for each name, and a name value in each cell for each name selected and blank for unselected names. I want to convert the fifty columns into ten columns (name1 to name10).
Below is an example of what I mean with (for simplicity) 5 names, where the person must select two names with five responders.
id <- 1:5
mike <- c("","mike","","","mike")
tim <- c("tim","","tim","","")
mary <- c("mary","mary","mary","","")
jane <- c("","","","jane","jane")
liz <- c("","","","liz","")
surveyData <- data.frame(id,mike,tim,mary,jane,liz)
Name1 <- c("tim","mike","tim","jane","mike")
Name2 <- c("mary","mary","mary","liz","jane")
restructuredSurveyData <- data.frame(id,Name1,Name2)

replace '' with NA and apply na.omit.
cbind(surveyData[1], `colnames<-`(t(apply(replace(surveyData[-1],
surveyData[-1] == '', NA), 1,
na.omit)), paste0('name_', 1:2)))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
A spoiled eye may like this better these days:
replace(surveyData[-1], surveyData[-1] == '', NA) |>
apply(1, na.omit) |>
t() |>
`colnames<-`(paste0('name_', 1:2)) |>
cbind(surveyData[1]) |>
subset(select=c('id', 'name_1', 'name_2'))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
Note: R >= 4.1 used.

Another possible solution, based on tidyverse:
library(tidyverse)
surveyData %>%
pivot_longer(-id) %>%
filter(value != "") %>%
mutate(nam = if_else(row_number() %% 2 == 1, "names1", "names2")) %>%
pivot_wider(id, names_from = nam)
#> # A tibble: 5 × 3
#> id names1 names2
#> <int> <chr> <chr>
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
Or using purrr::pmap_df:
library(tidyverse)
pmap_df(surveyData[-1], ~ str_c(c(...)[c(...) != ""], collapse = ",") %>%
set_names("names")) %>%
separate(names, into = str_c("names", 1:2), sep = ",") %>%
bind_cols(select(surveyData, id), .)
#> id names1 names2
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane

Find rows that are identical in one column but not another

There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:
> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
name id_num
1 george 1
2 george 1
3 george 2
4 sara 3
5 sara 3
6 sam 4
7 bill 5
8 bill 5
I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".

I think the easiest way will be to use dplyr::count twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)

Using tapply() to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))

Using dplyr, count non-numeric grades in each class

Given the input and code below, using dplyr and groups, how can I produce the results shown in the output? I know how to sum columns in groups using dplyr, but in this case I need to count how many of each non-numeric grade occurred in each class.
**INPUT**
Class Student Grade
1 Jack C
1 Mary B
1 Mo B
1 Jane A
1 Tom C
2 Don C
2 Betsy B
2 Sue C
2 Tayna B
2 Kim C
**CODE**
# Create the dataframe
Class <- c(1,1,1,1,1,2,2,2,2,2)
Name <- c("Jack", "Mary", "Mo", "Jane", "Tom", "Don", "Betsy", "Sue", "Tayna", "Kim")
Grade <- c("C","B","B","A","C","C","B","C","B","C")
StudentGrades <- data.frame(Class, Name, Grade)
**OUTPUT**
Class Grade-A Grade-B Grade-C
1 1 2 2
2 0 2 3

We can use count to get the frequency count and then with pivot_wider change from 'long' to 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
StudentGrades %>%
count(Class, Grade = str_c('Grade_', Grade)) %>%
pivot_wider(names_from = Grade, values_from = n, values_fill = list(n = 0))
# A tibble: 2 x 4
# Class Grade_A Grade_B Grade_C
# <dbl> <int> <int> <int>
#1 1 1 2 2
#2 2 0 2 3
Or in base R
table(StudentGrades[c('Class', 'Grade')])

Here is a base R solution, where table() + split() are used
dfout <- do.call(rbind,lapply(split(StudentGrades,StudentGrades$Class),
function(v) c(unique(v[1]),table(v$Grade))))
such that
> dfout
Class A B C
1 1 1 2 2
2 2 0 2 3

Summarizing dataframe based on multiple columns

I'm having some trouble figuring this one out. Say, I have a table like this:
Name Activity Day
1 John cycle 1
2 John work 1
3 Tina work 1
4 Monika work 1
5 Tina swim 1
6 Tina jogging 2
7 John work 2
8 Tina work 2
I want to summarize it in a way that the activity of each individual is grouped according to the day.
It should look like this:
Name Activity Day
1 John cycle;work 1
2 Tina work;swim 1
3 Monika work 1
4 Tina jogging;work 2
5 John work 2
I am thinking that dplyr package would be the answer here, but I don't know how to do it. Any help?
Thanks!

try:
library(dplyr)
dat <- tribble(~"Name", ~"Activity", ~"Day",
"John", "cycle", 1,
"John", "work" , 1,
"Tina", "work", 1,
"Monika", "work", 1,
"Tina", "swim", 1,
"Tina", "jogging", 2,
"John", "work", 2,
"Tina", "work", 2)
dat %>%
group_by(Name, Day) %>%
summarise(activity = paste(Activity, collapse = "; "))
# A tibble: 5 x 3
# Groups: Name [3]
Name Day activity
<chr> <dbl> <chr>
1 John 1 cycle; work
2 John 2 work
3 Monika 1 work
4 Tina 1 work; swim
5 Tina 2 jogging; work

An option with data.table
library(data.table)
setDT(dat)[, .(Activity = toString(Activity)), .(Name, Day)]

You can use the aggregate function, for example:
> aggregate(dat$Activity,list(dat$Name,dat$Day),as.character)
Group.1 Group.2 x
1 John 1 cycle, work
2 Monika 1 work
3 Tina 1 work, swim
4 John 2 work
5 Tina 2 jogging, work

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find recurrencies in pairs distributed in 2 columns of a data.frame - r

If you need to take account of the order of name1 and name2 : subset(as.data.frame(table(df)), Freq > 0) # Name1 Name2 Freq # 1 John John 1 # 5 Mark Maria 3 # 9 Steve Rose 1 # 10 John Xesca 2

Related

How to compare value of two columns then create a third/new data frame depending on the result

Restructure binary "multiple response" data to categorical

Find rows that are identical in one column but not another

Using dplyr, count non-numeric grades in each class

Summarizing dataframe based on multiple columns

Categories

Resources