Find rows that are identical in one column but not another

Find rows that are identical in one column but not another - r

There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:
> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
name id_num
1 george 1
2 george 1
3 george 2
4 sara 3
5 sara 3
6 sam 4
7 bill 5
8 bill 5
I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".

I think the easiest way will be to use dplyr::count twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)

Using tapply() to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))

Related

R Group By and Sum to Ignore NA

I have a data frame like this:
library(dplyr)
name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)
my_df <- data.frame(cbind(name, count, score)) %>%
mutate(count = as.numeric(count),
score = as.numeric(score))
my_df
name count score
1 Bob 2 5
2 Bob 3 NA
3 Bob 4 NA
4 Bob 5 NA
5 John 2 3
6 John 3 4
7 John 4 2
Then I create another column by taking the product between count and score:
my_df %>%
mutate(product = count*score)
name count score product
1 Bob 2 5 10
2 Bob 3 NA NA
3 Bob 4 NA NA
4 Bob 5 NA NA
5 John 2 3 6
6 John 3 4 12
7 John 4 2 8
I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob's result is calculated as 10/(2+3+4+5) = 0.71 but I want Bob's result to be 10/2 = 5.
my_df %>%
mutate(product = count*score)
group_by(name) %>%
summarize(result = sum(product, na.rm = TRUE)/sum(count))
name result
<chr> <dbl>
1 Bob 0.714
2 John 2.89

We may need to subset the count by the non-NA values in 'product'
library(dplyr)
my_df %>%
mutate(product = count*score) %>%
group_by(name) %>%
summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))
-output
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89
Or do a filter before the grouping
my_df %>%
filter(complete.cases(score)) %>%
group_by(name) %>%
summarise(result = sum(score * count)/sum(count))
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89

Restructure binary "multiple response" data to categorical

I want to restructure some "multiple response" survey data from binary to nominal categories.
The survey asks the responder which ten people they most often interact with and gives a list of 50 names. The data comes back with 50 columns, one column for each name, and a name value in each cell for each name selected and blank for unselected names. I want to convert the fifty columns into ten columns (name1 to name10).
Below is an example of what I mean with (for simplicity) 5 names, where the person must select two names with five responders.
id <- 1:5
mike <- c("","mike","","","mike")
tim <- c("tim","","tim","","")
mary <- c("mary","mary","mary","","")
jane <- c("","","","jane","jane")
liz <- c("","","","liz","")
surveyData <- data.frame(id,mike,tim,mary,jane,liz)
Name1 <- c("tim","mike","tim","jane","mike")
Name2 <- c("mary","mary","mary","liz","jane")
restructuredSurveyData <- data.frame(id,Name1,Name2)

replace '' with NA and apply na.omit.
cbind(surveyData[1], `colnames<-`(t(apply(replace(surveyData[-1],
surveyData[-1] == '', NA), 1,
na.omit)), paste0('name_', 1:2)))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
A spoiled eye may like this better these days:
replace(surveyData[-1], surveyData[-1] == '', NA) |>
apply(1, na.omit) |>
t() |>
`colnames<-`(paste0('name_', 1:2)) |>
cbind(surveyData[1]) |>
subset(select=c('id', 'name_1', 'name_2'))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
Note: R >= 4.1 used.

Another possible solution, based on tidyverse:
library(tidyverse)
surveyData %>%
pivot_longer(-id) %>%
filter(value != "") %>%
mutate(nam = if_else(row_number() %% 2 == 1, "names1", "names2")) %>%
pivot_wider(id, names_from = nam)
#> # A tibble: 5 × 3
#> id names1 names2
#> <int> <chr> <chr>
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
Or using purrr::pmap_df:
library(tidyverse)
pmap_df(surveyData[-1], ~ str_c(c(...)[c(...) != ""], collapse = ",") %>%
set_names("names")) %>%
separate(names, into = str_c("names", 1:2), sep = ",") %>%
bind_cols(select(surveyData, id), .)
#> id names1 names2
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane

Combine groups within dataframe if they share at least one common item

In my dataframe, I have a 'Groups' column and a 'Person' column. I want to join groups together if they share at least one common person. Consider the following example data:
Group Person
1 David
1 Sarah
1 John
2 Brian
2 Andrew
3 David
3 Charlie
4 Clare
4 Greg
5 Greg
5 Clare
5 Alan
In this example, Group 1 and Group 3 share a common person - David. The people in Group 2 do not overlap with the people in any other group. Group 4 and Group 5 share two common people Clare and Greg.
My desired output would be as follows:
Group Person
1 David
1 Sarah
1 John
1 Charlie
2 Brian
2 Andrew
3 Clare
3 Greg
3 Alan
Reproducible data:
structure(list(Group = c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5), Person = c("David",
"Sarah", "John", "Brian", "Andrew", "David", "Charlie", "Clare",
"Greg", "Clare")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(Group = structure(list(), class = c("collector_double",
"collector")), Person = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

Using igraph cluster membership:
library(igraph)
#convert to graph object
g <- graph_from_data_frame(df1)
#get cluster memberships
x <- clusters(g)$membership
x
# 1 2 3 4 5 David Sarah John Brian Andrew Charlie Clare Greg
# 1 2 1 3 3 1 1 1 2 2 1 3 3
# assign membership back to dataframe
df1$membership <- x[ df1$Person ]
df1
# Group Person membership
# 1 1 David 1
# 2 1 Sarah 1
# 3 1 John 1
# 4 2 Brian 2
# 5 2 Andrew 2
# 6 3 David 1
# 7 3 Charlie 1
# 8 4 Clare 3
# 9 5 Greg 3
# 10 5 Clare 3
We can use unique to avoid duplicated rows, and sort:
unique(df1[order(df1$membership), -1 ])
# Person membership
# 1 David 1
# 2 Sarah 1
# 3 John 1
# 7 Charlie 1
# 4 Brian 2
# 5 Andrew 2
# 8 Clare 3
# 9 Greg 3

This first groups by Person and gets their list of Groups grouped_groups.
Then it groups by Group and creates a new character variable, new_grouping which is the union of each list of groups within each Group. All using tidyverse verbs.
DF %>%
group_by(Person) %>%
mutate(grouped_groups = list(Group)) %>%
group_by(Group) %>%
mutate(new_grouping = paste0(list(sort(reduce(groups, union))), collapse = "-"))

How to sort the values of each obs of a data.frame? [duplicate]

This question already has answers here:
Row wise Sorting in R
(2 answers)
Closed 3 years ago.
I have this data.set
people <- c("Arthur", "Jean", "Paul", "Fred", "Gary")
question1 <- c(1, 3, 2, 2, 5)
question2 <- c(1, 0, 1, 0, 3)
question3<- c(1, 0, 2, 2, 4)
question4 <- c(1, 5, 2, 1, 5)
test <- data.frame(people, question1, question2, question3, question4)
test
Here is my output :
people question1 question2 question3 question4
1 Arthur 1 1 1 1
2 Jean 3 0 0 5
3 Paul 2 1 2 2
4 Fred 2 0 2 1
5 Gary 5 3 4 5
I want to order the results of each people like this (descending order based on values from left to right columns) in a new data.frame. Ne names of the new columns are letters or anything else.
people A B C D
1 Arthur 1 1 1 1
2 Jean 5 3 0 0
3 Paul 2 2 2 1
4 Fred 2 2 2 0
5 Gary 5 5 4 3

With base R apply function sort to the rows in question but be carefull, apply returns the transpose:
test[-1] <- t(apply(test[-1], 1, sort, decreasing = TRUE))
test
# people question1 question2 question3 question4
#1 Arthur 1 1 1 1
#2 Jean 5 3 0 0
#3 Paul 2 2 2 1
#4 Fred 2 2 1 0
#5 Gary 5 5 4 3

Solution using tidyverse (i.e. dplyr and tidyr):
library(tidyverse)
test %>%
pivot_longer(cols=-people, names_to="variable",values_to = "values") %>%
arrange(people, -values) %>%
select(people, values) %>%
mutate(new_names = rep(letters[1:4], length(unique(test$people)))) %>%
pivot_wider(names_from = new_names,
values_from = values)
This returns:
# A tibble: 5 x 5
people a b c d
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1
Explanation:
bring data into 'long' form so we can order it on the values of all the 'question' variables.
order (arrange) on people and -values (see above)
remove the not used variable variable
create a new column to hold the new names, name them A-D, for each value of person
bring the data into 'wide' form, creating new columns from the new names

One dplyr and tidyr option could be:
test %>%
pivot_longer(-people) %>%
group_by(people) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(name = LETTERS[1:n()]) %>%
pivot_wider(names_from = "name", values_from = "value")
people A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1

Summarizing dataframe based on multiple columns

I'm having some trouble figuring this one out. Say, I have a table like this:
Name Activity Day
1 John cycle 1
2 John work 1
3 Tina work 1
4 Monika work 1
5 Tina swim 1
6 Tina jogging 2
7 John work 2
8 Tina work 2
I want to summarize it in a way that the activity of each individual is grouped according to the day.
It should look like this:
Name Activity Day
1 John cycle;work 1
2 Tina work;swim 1
3 Monika work 1
4 Tina jogging;work 2
5 John work 2
I am thinking that dplyr package would be the answer here, but I don't know how to do it. Any help?
Thanks!

try:
library(dplyr)
dat <- tribble(~"Name", ~"Activity", ~"Day",
"John", "cycle", 1,
"John", "work" , 1,
"Tina", "work", 1,
"Monika", "work", 1,
"Tina", "swim", 1,
"Tina", "jogging", 2,
"John", "work", 2,
"Tina", "work", 2)
dat %>%
group_by(Name, Day) %>%
summarise(activity = paste(Activity, collapse = "; "))
# A tibble: 5 x 3
# Groups: Name [3]
Name Day activity
<chr> <dbl> <chr>
1 John 1 cycle; work
2 John 2 work
3 Monika 1 work
4 Tina 1 work; swim
5 Tina 2 jogging; work

An option with data.table
library(data.table)
setDT(dat)[, .(Activity = toString(Activity)), .(Name, Day)]

You can use the aggregate function, for example:
> aggregate(dat$Activity,list(dat$Name,dat$Day),as.character)
Group.1 Group.2 x
1 John 1 cycle, work
2 Monika 1 work
3 Tina 1 work, swim
4 John 2 work
5 Tina 2 jogging, work

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find rows that are identical in one column but not another - r

Related

R Group By and Sum to Ignore NA

Restructure binary "multiple response" data to categorical

Combine groups within dataframe if they share at least one common item

How to sort the values of each obs of a data.frame? [duplicate]

Summarizing dataframe based on multiple columns

Categories

Resources