How to identify mismatch between two data sets in R? - r

I have two data set. Data set 1 and Data set 2 which is as follow:
Dataset1:-
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
Dataset2:-
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
Here is the mismatch between the number of member entered in dataset1 and details of individual entered in dataset2. Like For family id 2, the number of member in family is 3 in dataset1 but the in dataset2 there is entry of only 2 member.
How to identify these types of mismatch between two data sets????

both of these views might be helpful for you :
dataset2 %>%
add_count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 7 x 7
# family_id house_id age gender n number_family_member match
# <int> <int> <int> <fctr> <int> <int> <lgl>
# 1 1 1052 24 male 2 2 TRUE
# 2 1 1052 25 female 2 2 TRUE
# 3 2 5042 23 male 2 3 FALSE
# 4 2 5042 20 female 2 3 FALSE
# 5 3 1111 1 male 3 2 FALSE
# 6 3 1111 20 female 3 2 FALSE
# 7 3 1111 21 female 3 2 FALSE
dataset2 %>%
count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 3 x 5
# family_id n house_id number_family_member match
# <int> <int> <int> <int> <lgl>
# 1 1 2 1052 2 TRUE
# 2 2 2 5042 3 FALSE
# 3 3 3 1111 2 FALSE

We can use count to count the number of family members and create a new data frame df3, and then use setequal to compare df1 and df3.
library(dplyr)
df3 <- df2 %>%
count(family_id, house_id) %>%
rename(number_family_member = n)
setequal(df1, df3)
# FALSE: Rows in x but not y: 2, 3. Rows in y but not x: 2, 3.
DATA
df1 <- read.table(text = "family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female",
header = TRUE, stringsAsFactors = FALSE)

This can be done with aggregate and merge.
agg <- aggregate(family_id ~ factor(family_id), dataset2, length)
mrg <- merge(agg, dataset1[c(1, 3)], by.x = "factor(family_id)", by.y = "family_id")
result <- data.frame(family_id = dataset1$family_id)
result$Match <- ifelse(dataset1$number_family_member == mrg$family_id, "match", "mismatch")
result
# family_id Match
#1 1 match
#2 2 mismatch
#3 3 mismatch
rm(agg, mrg) # final clean up
DATA.
dataset1 <- read.table(text = "
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
", header = TRUE)
dataset2 <- read.table(text = "
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
", header = TRUE)

Related

R: Turning row data from one dataframe into column data by group in another

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

Is there a better way to add a new value/field for every key (sym) in a tibble instead of using mutate then pivot_longer?

I have the following table and I would like to apply a function (ret) to the values BY sym. However, instead of creating a new column with that result (simple mutate), I would like to keep the table in long format and create a new row (field/value) for each day/sym.
x <- tibble(day=rep(1:5,2),
sym=c(rep('a',5),rep('b',5)),
field=rep('price',10),
value=as.numeric(c(101:105,501:505))) %>%
arrange(day,sym)
> x
# A tibble: 10 x 4
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 b price 501
3 2 a price 102
4 2 b price 502
5 3 a price 103
6 3 b price 503
I can accomplish this task by mutate to create a new column and then pivot_longer and bind_rows but I have a feeling there is a more concise way...
Here is my solution:
ret <- function(x) c(NA,diff(x))/x
x2 <- x %>% group_by(sym) %>% mutate(ret=ret(value)) %>%
select(day,sym,ret) %>%
pivot_longer(cols=c(-day,-sym),names_to='field',values_to='value') %>%
bind_rows(x) %>%
ungroup() %>%
arrange(day,sym,field)
> x2
# A tibble: 20 x 4
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 a ret NA
3 1 b price 501
4 1 b ret NA
5 2 a price 102
6 2 a ret 0.00980
7 2 b price 502
8 2 b ret 0.00199
9 3 a price 103
10 3 a ret 0.00971
11 3 b price 503
12 3 b ret 0.00199
Thank you!! Please let me know your thoughts
D
There's no need to use bind_rows since you already have the price variable in the data.frame. If you rename value to price and don't remove it before pivoting, then you'll have both 'ret' and 'price' in your field variable without having to bind it back in:
x %>%
group_by(sym) %>%
mutate(ret = ret(value)) %>%
select(day, sym, ret, 'price' = value) %>%
pivot_longer(cols = c(-day, -sym),
names_to = 'field',
values_to = 'value')
# A tibble: 20 x 4
# Groups: sym [2]
day sym field value
<int> <chr> <chr> <dbl>
1 1 a ret NA
2 1 a price 101
3 1 b ret NA
4 1 b price 501
5 2 a ret 0.00980
6 2 a price 102
7 2 b ret 0.00199
8 2 b price 502
9 3 a ret 0.00971
10 3 a price 103
11 3 b ret 0.00199
12 3 b price 503
13 4 a ret 0.00962
14 4 a price 104
15 4 b ret 0.00198
16 4 b price 504
17 5 a ret 0.00952
18 5 a price 105
19 5 b ret 0.00198
20 5 b price 505
How about
library(dplyr)
x %>%
group_by(sym) %>%
mutate(value=ret(value), field="ret") %>%
full_join(x) %>%
arrange(day,sym,field)
which returns
Joining, by = c("day", "sym", "field", "value")
# A tibble: 20 x 4
# Groups: sym [2]
day sym field value
<int> <chr> <chr> <dbl>
1 1 a price 101
2 1 a ret NA
3 1 b price 501
4 1 b ret NA
5 2 a price 102
6 2 a ret 0.00980
7 2 b price 502
8 2 b ret 0.00199
9 3 a price 103
10 3 a ret 0.00971
11 3 b price 503
12 3 b ret 0.00199
13 4 a price 104
14 4 a ret 0.00962
15 4 b price 504
16 4 b ret 0.00198
17 5 a price 105
18 5 a ret 0.00952
19 5 b price 505
20 5 b ret 0.00198
Or replace the full_join(x) with rbind(x).

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Get difference with closest previous row in a group which meets criterion

I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99

R: Calculating New Variable R Code

I have
id_1 id_2 name count total
1 001 111 a 15
2 001 111 b 3
3 001 111 sum 28 28
4 002 111 a 7
5 002 111 b 33
6 002 111 sum 48 48
I want the rows that share the same id_1 and id_2 to share the total, like
id_1 id_2 name count total
1 001 111 a 15 28
2 001 111 b 3 28
3 001 111 sum 28 28
4 002 111 a 7 48
5 002 111 b 33 48
6 002 111 sum 48 48
We can use fill from tidyr.
library(tidyr)
dat2 <- dat %>% fill(total, .direction = "up")
dat2
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
DATA
dat <- read.table(text = " id_1 id_2 name count total
1 001 111 a 15 NA
2 001 111 b 3 NA
3 001 111 sum 28 28
4 002 111 a 7 NA
5 002 111 b 33 NA
6 002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)
Consider base R's ave calculating group max (na.rm to handle NA):
df$total <- ave(df$total, df$id_1, df$_id_2, FUN=function(i) max(i, na.rm=na.omit))
df
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
Using zoo and data.table:
df <- read.table(text = "id_1 id_2 name count total
001 111 a 15 NA
001 111 b 3 NA
001 111 sum 28 28
002 111 a 7 NA
002 111 b 33 NA
002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)# create data
library(zoo)# load packages
library(data.table)
setDT(df)[, total := na.locf(na.locf(total, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = c("id_1", "id_2")]# convert df to data.table and carry forward and backward total by ids
Output:
id_1 id_2 name count total
1: 1 111 a 15 28
2: 1 111 b 3 28
3: 1 111 sum 28 28
4: 2 111 a 7 48
5: 2 111 b 33 48
6: 2 111 sum 48 48
Simple approach using the normal dplyr way:
dat %>% group_by(id_1, id_2) %>% mutate(total=count[name == "sum"])
Alternatively:
dat %>% group_by(id_1, id_2) %>% mutate(total=na.omit(total)[1])
id_1 id_2 name count total
<int> <int> <chr> <int> <int>
1 1 111 a 15 28
2 1 111 b 3 28
3 1 111 sum 28 28
4 2 111 a 7 48
5 2 111 b 33 48
6 2 111 sum 48 48

Resources