There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:
> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
name id_num
1 george 1
2 george 1
3 george 2
4 sara 3
5 sara 3
6 sam 4
7 bill 5
8 bill 5
I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".
I think the easiest way will be to use dplyr::count twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)
Using tapply() to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))
Related
I have a data frame like this:
library(dplyr)
name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)
my_df <- data.frame(cbind(name, count, score)) %>%
mutate(count = as.numeric(count),
score = as.numeric(score))
my_df
name count score
1 Bob 2 5
2 Bob 3 NA
3 Bob 4 NA
4 Bob 5 NA
5 John 2 3
6 John 3 4
7 John 4 2
Then I create another column by taking the product between count and score:
my_df %>%
mutate(product = count*score)
name count score product
1 Bob 2 5 10
2 Bob 3 NA NA
3 Bob 4 NA NA
4 Bob 5 NA NA
5 John 2 3 6
6 John 3 4 12
7 John 4 2 8
I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob's result is calculated as 10/(2+3+4+5) = 0.71 but I want Bob's result to be 10/2 = 5.
my_df %>%
mutate(product = count*score)
group_by(name) %>%
summarize(result = sum(product, na.rm = TRUE)/sum(count))
name result
<chr> <dbl>
1 Bob 0.714
2 John 2.89
We may need to subset the count by the non-NA values in 'product'
library(dplyr)
my_df %>%
mutate(product = count*score) %>%
group_by(name) %>%
summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))
-output
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89
Or do a filter before the grouping
my_df %>%
filter(complete.cases(score)) %>%
group_by(name) %>%
summarise(result = sum(score * count)/sum(count))
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89
I want to restructure some "multiple response" survey data from binary to nominal categories.
The survey asks the responder which ten people they most often interact with and gives a list of 50 names. The data comes back with 50 columns, one column for each name, and a name value in each cell for each name selected and blank for unselected names. I want to convert the fifty columns into ten columns (name1 to name10).
Below is an example of what I mean with (for simplicity) 5 names, where the person must select two names with five responders.
id <- 1:5
mike <- c("","mike","","","mike")
tim <- c("tim","","tim","","")
mary <- c("mary","mary","mary","","")
jane <- c("","","","jane","jane")
liz <- c("","","","liz","")
surveyData <- data.frame(id,mike,tim,mary,jane,liz)
Name1 <- c("tim","mike","tim","jane","mike")
Name2 <- c("mary","mary","mary","liz","jane")
restructuredSurveyData <- data.frame(id,Name1,Name2)
replace '' with NA and apply na.omit.
cbind(surveyData[1], `colnames<-`(t(apply(replace(surveyData[-1],
surveyData[-1] == '', NA), 1,
na.omit)), paste0('name_', 1:2)))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
A spoiled eye may like this better these days:
replace(surveyData[-1], surveyData[-1] == '', NA) |>
apply(1, na.omit) |>
t() |>
`colnames<-`(paste0('name_', 1:2)) |>
cbind(surveyData[1]) |>
subset(select=c('id', 'name_1', 'name_2'))
# id name_1 name_2
# 1 1 tim mary
# 2 2 mike mary
# 3 3 tim mary
# 4 4 jane liz
# 5 5 mike jane
Note: R >= 4.1 used.
Another possible solution, based on tidyverse:
library(tidyverse)
surveyData %>%
pivot_longer(-id) %>%
filter(value != "") %>%
mutate(nam = if_else(row_number() %% 2 == 1, "names1", "names2")) %>%
pivot_wider(id, names_from = nam)
#> # A tibble: 5 × 3
#> id names1 names2
#> <int> <chr> <chr>
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
Or using purrr::pmap_df:
library(tidyverse)
pmap_df(surveyData[-1], ~ str_c(c(...)[c(...) != ""], collapse = ",") %>%
set_names("names")) %>%
separate(names, into = str_c("names", 1:2), sep = ",") %>%
bind_cols(select(surveyData, id), .)
#> id names1 names2
#> 1 1 tim mary
#> 2 2 mike mary
#> 3 3 tim mary
#> 4 4 jane liz
#> 5 5 mike jane
In my dataframe, I have a 'Groups' column and a 'Person' column. I want to join groups together if they share at least one common person. Consider the following example data:
Group Person
1 David
1 Sarah
1 John
2 Brian
2 Andrew
3 David
3 Charlie
4 Clare
4 Greg
5 Greg
5 Clare
5 Alan
In this example, Group 1 and Group 3 share a common person - David. The people in Group 2 do not overlap with the people in any other group. Group 4 and Group 5 share two common people Clare and Greg.
My desired output would be as follows:
Group Person
1 David
1 Sarah
1 John
1 Charlie
2 Brian
2 Andrew
3 Clare
3 Greg
3 Alan
Reproducible data:
structure(list(Group = c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5), Person = c("David",
"Sarah", "John", "Brian", "Andrew", "David", "Charlie", "Clare",
"Greg", "Clare")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(Group = structure(list(), class = c("collector_double",
"collector")), Person = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Using igraph cluster membership:
library(igraph)
#convert to graph object
g <- graph_from_data_frame(df1)
#get cluster memberships
x <- clusters(g)$membership
x
# 1 2 3 4 5 David Sarah John Brian Andrew Charlie Clare Greg
# 1 2 1 3 3 1 1 1 2 2 1 3 3
# assign membership back to dataframe
df1$membership <- x[ df1$Person ]
df1
# Group Person membership
# 1 1 David 1
# 2 1 Sarah 1
# 3 1 John 1
# 4 2 Brian 2
# 5 2 Andrew 2
# 6 3 David 1
# 7 3 Charlie 1
# 8 4 Clare 3
# 9 5 Greg 3
# 10 5 Clare 3
We can use unique to avoid duplicated rows, and sort:
unique(df1[order(df1$membership), -1 ])
# Person membership
# 1 David 1
# 2 Sarah 1
# 3 John 1
# 7 Charlie 1
# 4 Brian 2
# 5 Andrew 2
# 8 Clare 3
# 9 Greg 3
This first groups by Person and gets their list of Groups grouped_groups.
Then it groups by Group and creates a new character variable, new_grouping which is the union of each list of groups within each Group. All using tidyverse verbs.
DF %>%
group_by(Person) %>%
mutate(grouped_groups = list(Group)) %>%
group_by(Group) %>%
mutate(new_grouping = paste0(list(sort(reduce(groups, union))), collapse = "-"))
This question already has answers here:
Row wise Sorting in R
(2 answers)
Closed 3 years ago.
I have this data.set
people <- c("Arthur", "Jean", "Paul", "Fred", "Gary")
question1 <- c(1, 3, 2, 2, 5)
question2 <- c(1, 0, 1, 0, 3)
question3<- c(1, 0, 2, 2, 4)
question4 <- c(1, 5, 2, 1, 5)
test <- data.frame(people, question1, question2, question3, question4)
test
Here is my output :
people question1 question2 question3 question4
1 Arthur 1 1 1 1
2 Jean 3 0 0 5
3 Paul 2 1 2 2
4 Fred 2 0 2 1
5 Gary 5 3 4 5
I want to order the results of each people like this (descending order based on values from left to right columns) in a new data.frame. Ne names of the new columns are letters or anything else.
people A B C D
1 Arthur 1 1 1 1
2 Jean 5 3 0 0
3 Paul 2 2 2 1
4 Fred 2 2 2 0
5 Gary 5 5 4 3
With base R apply function sort to the rows in question but be carefull, apply returns the transpose:
test[-1] <- t(apply(test[-1], 1, sort, decreasing = TRUE))
test
# people question1 question2 question3 question4
#1 Arthur 1 1 1 1
#2 Jean 5 3 0 0
#3 Paul 2 2 2 1
#4 Fred 2 2 1 0
#5 Gary 5 5 4 3
Solution using tidyverse (i.e. dplyr and tidyr):
library(tidyverse)
test %>%
pivot_longer(cols=-people, names_to="variable",values_to = "values") %>%
arrange(people, -values) %>%
select(people, values) %>%
mutate(new_names = rep(letters[1:4], length(unique(test$people)))) %>%
pivot_wider(names_from = new_names,
values_from = values)
This returns:
# A tibble: 5 x 5
people a b c d
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1
Explanation:
bring data into 'long' form so we can order it on the values of all the 'question' variables.
order (arrange) on people and -values (see above)
remove the not used variable variable
create a new column to hold the new names, name them A-D, for each value of person
bring the data into 'wide' form, creating new columns from the new names
One dplyr and tidyr option could be:
test %>%
pivot_longer(-people) %>%
group_by(people) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(name = LETTERS[1:n()]) %>%
pivot_wider(names_from = "name", values_from = "value")
people A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 Arthur 1 1 1 1
2 Fred 2 2 1 0
3 Gary 5 5 4 3
4 Jean 5 3 0 0
5 Paul 2 2 2 1
I'm having some trouble figuring this one out. Say, I have a table like this:
Name Activity Day
1 John cycle 1
2 John work 1
3 Tina work 1
4 Monika work 1
5 Tina swim 1
6 Tina jogging 2
7 John work 2
8 Tina work 2
I want to summarize it in a way that the activity of each individual is grouped according to the day.
It should look like this:
Name Activity Day
1 John cycle;work 1
2 Tina work;swim 1
3 Monika work 1
4 Tina jogging;work 2
5 John work 2
I am thinking that dplyr package would be the answer here, but I don't know how to do it. Any help?
Thanks!
try:
library(dplyr)
dat <- tribble(~"Name", ~"Activity", ~"Day",
"John", "cycle", 1,
"John", "work" , 1,
"Tina", "work", 1,
"Monika", "work", 1,
"Tina", "swim", 1,
"Tina", "jogging", 2,
"John", "work", 2,
"Tina", "work", 2)
dat %>%
group_by(Name, Day) %>%
summarise(activity = paste(Activity, collapse = "; "))
# A tibble: 5 x 3
# Groups: Name [3]
Name Day activity
<chr> <dbl> <chr>
1 John 1 cycle; work
2 John 2 work
3 Monika 1 work
4 Tina 1 work; swim
5 Tina 2 jogging; work
An option with data.table
library(data.table)
setDT(dat)[, .(Activity = toString(Activity)), .(Name, Day)]
You can use the aggregate function, for example:
> aggregate(dat$Activity,list(dat$Name,dat$Day),as.character)
Group.1 Group.2 x
1 John 1 cycle, work
2 Monika 1 work
3 Tina 1 work, swim
4 John 2 work
5 Tina 2 jogging, work