How to count pairs in R with data in one column? - r

I have one column of names of children who have teamed up in class together over multiple projects / activities, like so:
Note: This is ONE column.
Names
Tom,Jack,Meave
Tom,Arial
Arial,Tim,Tom
Neena,Meave
Meave
Tim,Meave
I want to use R so that I can see how many times two children have been paired over the projects they have done:
So:
Pair Counts
Meave,Jack 1
Tom,Jack 1
Meave,none 1
Tom,Arial 2
.
.
.
How do I go about doing this? A tidy-friendly solution would be appreciated.
(Ultimately, I would like to use this data to make a circle-network graph, but that is for another question.)

In Base R:
a <- tcrossprod(table(stack(setNames(strsplit(df$Names,","), rownames(df)))))
a
values
values Arial Jack Meave Neena Tim Tom
Arial 2 0 0 0 1 2
Jack 0 1 1 0 0 1
Meave 0 1 4 1 1 1
Neena 0 0 1 1 0 0
Tim 1 0 1 0 2 1
Tom 2 1 1 0 1 3
You could make the above look like the data you want. eg:
subset(as.data.frame.table(a),
as.character(values) > as.character(values.1) & Freq>0)
values values.1 Freq
5 Tim Arial 1
6 Tom Arial 2
9 Meave Jack 1
12 Tom Jack 1
16 Neena Meave 1
17 Tim Meave 1
18 Tom Meave 1
30 Tom Tim 1
In tidyverse:
df %>%
rownames_to_column()%>%
separate_rows(Names)%>%
table()%>%
crossprod()%>%
as.data.frame.table()%>%
filter(Freq>0 & as.character(Names) > as.character(Names.1))
Names Names.1 Freq
1 Tim Arial 1
2 Tom Arial 2
3 Meave Jack 1
4 Tom Jack 1
5 Neena Meave 1
6 Tim Meave 1
7 Tom Meave 1
8 Tom Tim 1
Data:
df <- structure(list(Names = c("Tom,Jack,Meave", "Tom,Arial", "Arial,Tim,Tom",
"Neena,Meave", "Meave", "Tim,Meave")), class = "data.frame", row.names = c(NA,
-6L))

Here is one tidyverse approach...
If df is
Names
<chr>
Tom,Jack,Meave
Tom,Arial
Arial,Tim,Tom
Neena,Meave
Meave
Tim,Meave
Then
df2 <- df %>%
mutate(ref = row_number(),
Names = ifelse(str_count(Names, ",") == 0, #add nobody if only one
paste0(Names, ",nobody"),
Names),
Names = str_split(Names, ",")) %>%
unnest(Names) %>%
nest(data = ref) %>% #creates a list of refs for each name
mutate(Names2 = list(Names)) %>% #add a column of second names for the pairs
unnest(Names2) %>%
filter(Names != Names2) %>% #remove self-pairs
left_join({.} %>% select(Names2 = Names, data2 = data) %>%
distinct()) %>% #create data for second column of names
mutate(paired = map2_dbl(data, data2, ~length(intersect(.x$ref, .y$ref)))) %>%
select(-data, -data2) %>%
filter(paired > 0, #remove non-occurring combinations
Names > Names2) #remove duplicates
Which gives...
> df2
# A tibble: 18 × 3
Names Names2 paired
<chr> <chr> <dbl>
1 Tom Jack 1
2 Tom Meave 1
3 Tom Arial 2
4 Tom Tim 1
5 Meave Jack 1
6 Tim Meave 1
7 Tim Arial 1
8 Neena Meave 1
9 nobody Meave 1
The code changes the dataframe from a list of names for each value of ref to a list of refs for each name. It then creates a column of other names (i.e. the second of a pair) and left-joins the refs to these other names. Note that the {.} in the left_join refers to the piped dataframe at that point, creating a left join with itself.

Related

Restructure binary "multiple response" data to categorical

I want to restructure some "multiple response" survey data from binary to nominal categories.
The survey asks the responder which ten people they most often interact with and gives a list of 50 names. The data comes back with 50 columns, one column for each name, and a name value in each cell for each name selected and blank for unselected names. I want to convert the fifty columns into ten columns (name1 to name10).
Below is an example of what I mean with (for simplicity) 5 names, where the person must select two names with five responders.
id <- 1:5
mike <- c("","mike","","","mike")
tim <- c("tim","","tim","","")
mary <- c("mary","mary","mary","","")
jane <- c("","","","jane","jane")
liz <- c("","","","liz","")
surveyData <- data.frame(id,mike,tim,mary,jane,liz)
Name1 <- c("tim","mike","tim","jane","mike")
Name2 <- c("mary","mary","mary","liz","jane")
restructuredSurveyData <- data.frame(id,Name1,Name2)
replace '' with NA and apply na.omit.
cbind(surveyData[1], `colnames<-`(t(apply(replace(surveyData[-1],
surveyData[-1] == '', NA), 1,
na.omit)), paste0('name_', 1:2)))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
A spoiled eye may like this better these days:
replace(surveyData[-1], surveyData[-1] == '', NA) |>
apply(1, na.omit) |>
t() |>
`colnames<-`(paste0('name_', 1:2)) |>
cbind(surveyData[1]) |>
subset(select=c('id', 'name_1', 'name_2'))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
Note: R >= 4.1 used.
Another possible solution, based on tidyverse:
library(tidyverse)
surveyData %>%
pivot_longer(-id) %>%
filter(value != "") %>%
mutate(nam = if_else(row_number() %% 2 == 1, "names1", "names2")) %>%
pivot_wider(id, names_from = nam)
#> # A tibble: 5 × 3
#> id names1 names2
#> <int> <chr> <chr>
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
Or using purrr::pmap_df:
library(tidyverse)
pmap_df(surveyData[-1], ~ str_c(c(...)[c(...) != ""], collapse = ",") %>%
set_names("names")) %>%
separate(names, into = str_c("names", 1:2), sep = ",") %>%
bind_cols(select(surveyData, id), .)
#> id names1 names2
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

Counting occurrences and occurrences which do not appear

I have a dataframe which looks like this:
head(df)
id id_child
1 1
1 2
1 3
2 1
4 1
4 2
I would like to create a variable which counts the number of children per parent. So I would like something like this:
head(nb_chilren)
id id_child
1 3
2 1
3 0
4 2
If possible, I would like that the person 3 is indicated as having 0 child even though she does not exist in the first frame.
Note: ids are sequential, in real data they are 1 to 10628.
Any suggestions? I suppose I must use the split() function, but I really do not know how to use it.
Convert id to factor with levels from minimum id value to maximum.
df$id <- factor(df$id, levels = min(df$id):max(df$id))
You can then use table in base R :
stack(table(df$id))[2:1]
Or count in dplyr :
library(dplyr)
df %>% count(id, .drop = FALSE)
# id n
#1 1 3
#2 2 1
#3 3 0
#4 4 2
One dplyr option could be:
df %>%
group_by(id = factor(id, levels = min(id):max(id)), .drop = FALSE) %>%
summarise(id_child = n_distinct(id_child))
id id_child
<fct> <int>
1 1 3
2 2 1
3 3 0
4 4 2
Here is a solution with table
table(factor(df[[1]], levels = Reduce(':', range(df[[1]]))))
#1 2 3 4
#3 1 0 2
In data.frame format:
tbl <- table(id = factor(df[[1]], levels = Reduce(':', range(df[[1]]))))
as.data.frame(tbl)
# id Freq
#1 1 3
#2 2 1
#3 3 0
#4 4 2

How to sort the values of each obs of a data.frame? [duplicate]

This question already has answers here:
Row wise Sorting in R
(2 answers)
Closed 3 years ago.
I have this data.set
people <- c("Arthur", "Jean", "Paul", "Fred", "Gary")
question1 <- c(1, 3, 2, 2, 5)
question2 <- c(1, 0, 1, 0, 3)
question3<- c(1, 0, 2, 2, 4)
question4 <- c(1, 5, 2, 1, 5)
test <- data.frame(people, question1, question2, question3, question4)
test
Here is my output :
people question1 question2 question3 question4
1 Arthur 1 1 1 1
2 Jean 3 0 0 5
3 Paul 2 1 2 2
4 Fred 2 0 2 1
5 Gary 5 3 4 5
I want to order the results of each people like this (descending order based on values from left to right columns) in a new data.frame. Ne names of the new columns are letters or anything else.
people A B C D
1 Arthur 1 1 1 1
2 Jean 5 3 0 0
3 Paul 2 2 2 1
4 Fred 2 2 2 0
5 Gary 5 5 4 3
With base R apply function sort to the rows in question but be carefull, apply returns the transpose:
test[-1] <- t(apply(test[-1], 1, sort, decreasing = TRUE))
test
# people question1 question2 question3 question4
#1 Arthur 1 1 1 1
#2 Jean 5 3 0 0
#3 Paul 2 2 2 1
#4 Fred 2 2 1 0
#5 Gary 5 5 4 3
Solution using tidyverse (i.e. dplyr and tidyr):
library(tidyverse)
test %>%
pivot_longer(cols=-people, names_to="variable",values_to = "values") %>%
arrange(people, -values) %>%
select(people, values) %>%
mutate(new_names = rep(letters[1:4], length(unique(test$people)))) %>%
pivot_wider(names_from = new_names,
values_from = values)
This returns:
# A tibble: 5 x 5
people a b c d
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1
Explanation:
bring data into 'long' form so we can order it on the values of all the 'question' variables.
order (arrange) on people and -values (see above)
remove the not used variable variable
create a new column to hold the new names, name them A-D, for each value of person
bring the data into 'wide' form, creating new columns from the new names
One dplyr and tidyr option could be:
test %>%
pivot_longer(-people) %>%
group_by(people) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(name = LETTERS[1:n()]) %>%
pivot_wider(names_from = "name", values_from = "value")
people A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1

Find the number of times a unique value is appeared in more than one files and the number of those files

I have those 3 dataframes below:
Name<-c("jack","jack","bob","david","mary")
n1<-data.frame(Name)
Name<-c("jack","bill","dean","mary","steven")
n2<-data.frame(Name)
Name<-c("fred","alex","mary")
n3<-data.frame(Name)
I would like to create a new dataframe with 3 columns.All unique names present across all 3 source files in Column 1,the number
of source files in which it's located, in Column 2, and the total number of instances of that name across all files, in Column
3.
The result should be like
Name Number_of_files Number_of_instances
1 jack 2 3
2 bob 1 1
3 david 1 1
4 mary 3 3
5 bill 1 1
6 dean 1 1
7 steven 1 1
8 fred 1 1
9 alex 1 1
Is there an automated way to achieve all these at once?
One dplyr possibility could be:
bind_rows(n1, n2, n3, .id = "ID") %>%
group_by(Name) %>%
summarise(Number_of_files = n_distinct(ID),
Number_of_instances = n())
Name Number_of_files Number_of_instances
<chr> <int> <int>
1 alex 1 1
2 bill 1 1
3 bob 1 1
4 david 1 1
5 dean 1 1
6 fred 1 1
7 jack 2 3
8 mary 3 3
9 steven 1 1
This is conceptually similar answer as #tmfmnk but a base R version
#Get names of all the objects n1, n2, n3, n4 . etc
name_df <- ls(pattern = "n\\d+")
#Combine them in one dataframe
all_df <- do.call(rbind, Map(cbind, mget(name_df), id = name_df))
#get aggregated values
aggregate(id~Name, all_df, function(x) c(length(unique(x)), length(x)))
# Name id.1 id.2
#1 bob 1 1
#2 david 1 1
#3 jack 2 3
#4 mary 3 3
#5 bill 1 1
#6 dean 1 1
#7 steven 1 1
#8 alex 1 1
#9 fred 1 1
You can rename the columns if needed.
And for completeness data.table version
library(data.table)
dt < - rbindlist(mget(name_df), idcol = "ID")
dt[, list(Number_of_files = uniqueN(ID), Number_of_instances = .N), by = .(Name)]

Resources