I have data with a list of people's names and their ID numbers. Not all people with the same name will have the same ID number but everyone with different names should have a different ID number. Like this:
Name david david john john john john megan bill barbara chris chris
ID 1 1 2 2 2 3 4 5 6 7 8
I need to make sure that these IDs are correct. So, I want to write a code that says "subset only if ID numbers are the same but their names are different"(so I will be only subsetting ID errors). I don't even know where to start with this because I tried
df1<-df(subset(duplicated(df$Name) & duplicated(df$ID)))
Error in subset.default(duplicated(df$officer) & duplicated(df$ID)) :
argument "subset" is missing, with no default
but it didn't work and I know it doesn't tell R to match and compare names and ID numbers.
Thank you so much in advance.
Updated with the information in the comments below
Here are some test data:
> DF <- data.frame(name = c("A", "A", "A", "B", "B", "C"), id=c(1,1,2,3,4,4))
> DF
name id
1 A 1
2 A 1
3 A 2
4 B 3
5 B 4
6 C 4
So ... if I understand your problem correctly you want to get the information that there are problems with id 4 since two different names (B and C) appear for that id.
library(dplyr)
DF %>% group_by(id) %>% distinct(name) %>% tally()
# A tibble: 4 x 2
id n
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 2
Here we get a summary and see that there are two different names (n) for id 4. You can combine that with filter to only see the ids with more than one name
> DF %>% group_by(id) %>% distinct(name) %>% tally() %>% filter(n > 1)
# A tibble: 1 x 2
id n
<dbl> <int>
1 4 2
Did that help?
Related
A similar question was asked here. However, I did not manage to adopt that solution to my particular problem, hence the separate question.
An example dataset:
id group
1 1 5
2 1 998
3 2 2
4 2 3
5 3 998
I would like to delete all rows that are duplicated in id and where group has value 998.
In this example, only row 2 should be deleted.
I tried something along those lines:
df1 <- df %>%
subset((unique(by = "id") | group != 998))
but got
Error in is.factor(x) : Argument "x" is missing, with no default
Thank you in advance
Here is an idea
library(dplyr)
df %>%
group_by(id) %>%
filter(!any(n() > 1 & group == 998))
# A tibble: 3 x 2
# Groups: id [2]
id group
<int> <int>
1 2 2
2 2 3
3 3 998
In case you want to remove only the 998 entry from the group then,
df %>%
group_by(id) %>%
filter(!(n() > 1 & group == 998))
One way could be:
library(dplyr)
df1 <- df %>%
filter(duplicated(id) & group=="998")
anti_join(df, df1)
Joining, by = c("id", "group")
id group
1 1 5
3 2 2
4 2 3
5 3 998
This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 2 years ago.
I have a dataframe with several "people".
There are repeat instances for "people", however, the measured "value" is different in each instance.
Here is an example of dataframe.
df2 <- data.frame(
value = c(1, 2, 3, 4, 5),
people = c("d", "c", "b", "d", "b")
)
which looks like:
value people
1 d
2 c
3 b
4 d
5 b
I would like to group the data by "people", then sort the groups of rows by "value", and within the groups, I would like to sort descending by the "value".
That is, I want to keep duplicates together while sorting by value.
Here is how I would like the data to look:
value people
1 d
4 d
2 c
3 b
5 b
I have tried multiple attempts with group_by and arrange using {dplyr} but seems I am missing something.
Thanks for the help.
I have made a change - for clarity, I do not want "people" sorted alphabetically - this is a schedule in reality - person D has the first appointment (1), and his second appointment is 4. I want them to appear first and together. Person C has a 2nd appointment. Person B has a 3rd appointment, his other appointment is 5. I hope this makes it more clear. Thanks again
You can use arrange in this form :
library(dplyr)
df2 %>%
arrange(value) %>%
arrange(match(people, unique(people)))
# value people
#1 1 d
#2 4 d
#3 2 c
#4 3 b
#5 5 b
Though a longer code, but this will also work
df2 %>% group_by(people) %>% arrange(value) %>%
mutate(d = first(value)) %>% arrange(d) %>% ungroup() %>% select(-d)
# A tibble: 5 x 2
value people
<dbl> <chr>
1 1 d
2 4 d
3 2 c
4 3 b
5 5 b
I got your result with the following one-liner base-R code:
df2[order(df2$people, decreasing = TRUE),]
# value people
# 1 1 d
# 4 4 d
# 2 2 c
# 3 3 b
# 5 5 b
I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
I am using tidyr from R and am running into an issue when using the spread() command with duplicate identifiers.
Here is a mock example that illustrates the problem:
X = data.frame(name=c("Eric","Bob","Mark","Bob","Bob","Mark","Eric","Bob","Mark"),
metric=c("height","height","height","weight","weight","weight","grade","grade","grade"),
values=c(6,5,4,120,118,180,"A","B","C"),
stringsAsFactors=FALSE)
tidyr::spread(X,metric,values)
So when I run this command I get the following error:
Error: Duplicate identifiers for rows (4, 5)
which makes sense why its an error because Bob is recorded twice for weight. It's actually nota mistake because Bob did have his weight recorded twice. What I would like to be able to do is have run the command and have it it give me back the following:
name height weight grade
Eric 6 NA A
Bob 5 120 B
Bob 5 118 B
Mark 4 180 C
Is spread not the command I should be using to accomplish this? And if there isn't an easy solution is there a simple way to remove the record with lowest weight for duplicates when running the spread() command?
After making unique identifiers, which can be done by making a new variable representing the index within each group, you can use fill to fill the second "Bob" row with a duplicate value for "height" and "grade".
You can remove the index variable at the end via select.
library(dplyr)
library(tidyr)
X %>%
group_by(name, metric) %>%
mutate(row = row_number() ) %>%
spread(metric, values) %>%
fill(grade, height) %>%
select(-row)
# A tibble: 4 x 4
# Groups: name [3]
name grade height weight
<chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Bob B 5 118
3 Eric A 6 <NA>
4 Mark C 4 180
To filter to the maximum value of each name/metric group:
X %>%
group_by(name, metric) %>%
filter(values == max(values)) %>%
spread(metric, values)
# A tibble: 3 x 4
# Groups: name [3]
name grade height weight
* <chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Eric A 6 <NA>
3 Mark C 4 180