I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).
While I have found out how to identify which records are not identical between the two data sets (using the function detailed here: http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/), I'm not sure how to flag which variables are different.
E.g.
Data set A:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 4
100001 Jane Doe 7/3/2011 3/14/2013 VARICELLA 1
Data set B:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 3
100001 Jane Doee 7/3/2011 3/24/2013 VARICELLA 1
100002 John Smith 2/5/2010 7/13/2013 HEPB 3
I want to identify which records are different, and which specific variable(s) have discrepancies. For example, the John Doe record has 1 discrepancy in dose, and the Jane Doe record has 2 discrepancies: in name and vaccinedate. Also, data set B has one additional record that was not in data set A, and I would want to identify these instances as well.
In the end, the goal is to find the frequency of the "types" of errors, e.g. how many records have a discrepancy in vaccinedate, vaccinename, dose, etc.
Thanks!
This should get you started, but there may be more elegant solutions.
First, establish df1 and df2 so others can reproduce quickly:
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
Next, get the discrepancies from df1 to df2 via mapply and setdiff. That is, what's in set one that's not in set two:
discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
#
# $name
# [1] "Jane Doe"
#
# $dob
# character(0)
#
# $vaccinedate
# [1] "3/14/2013"
#
# $vaccinename
# character(0)
#
# $dose
# [1] 4
To count them up we can use sapply:
num.discrep <- sapply(discrep, length)
num.discrep
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
Per your question on obtaining id's in set two that are not in set one, you could reverse the process with mapply(setdiff, df2, df1) or if it's simply an exercise of ids only you could do setdiff(df2$id, df1$id).
For more on R's functional functions (e.g., mapply, sapply, lapply, etc.) see this post.
Updating with a purrr solution:
map2(df1, df2, setdiff) %>%
map_int(length)
One possibility. First, find out which ids both datasets have in common. The simplest way to do this is:
commonID<-intersect(A$id,B$id)
Then you can determine which rows are missing from A by:
> B[!B$id %in% commonID,]
# id name dob vaccinedate vaccinename dose
# 3 100002 John Smith 2/5/2010 7/13/2013 HEPB 3
Next, you can restrict both datasets to the ids they have in common.
Acommon<-A[A$id %in% commonID,]
Bcommon<-B[B$id %in% commonID,]
If you can't assume that the id's are in the right order, then sort them both:
Acommon<-Acommon[order(Acommon$id),]
Bcommon<-Bcommon[order(Bcommon$id),]
Now you can see what fields are different like this.
diffs<-Acommon != Bcommon
diffs
# id name dob vaccinedate vaccinename dose
# 1 FALSE FALSE FALSE FALSE FALSE TRUE
# 2 FALSE TRUE FALSE TRUE FALSE FALSE
This is a logical matrix, and you can do whatever you want with it. For example, to find the total number of errors in each column:
colSums(diffs)
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
To find all ids where the name is different:
Acommon$id[diffs[,"name"]]
# [1] 100001
And so on.
There is a new package call waldo
install.packages("waldo")
library(waldo)
# construct the data frames
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
# compare them
compare(df1,df2)
And we get:
`old` is length 2
`new` is length 3
`names(old)`: "X" "Y"
`names(new)`: "X" "Y" "Z"
`attr(old, 'row.names')`: 1 2 3
`attr(new, 'row.names')`: 1 2 3 4
`old$X`: 1 2 3
`new$X`: 1 2 3 4
`old$Y`: "a" "b" "c"
`new$Y`: "A" "b" "c" "d"
`old$Z` is absent
`new$Z` is a character vector ('k', 'l', 'm', 'n')
library(compareDF)
compare_df(dataframe1, dataframe2, c("columnname"))
Related
My two dataframes are:
df1<-structure(list(header1 = structure(1:4, .Label = c("a", "b",
"c", "d"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
and
df2<-structure(list(sample_x = structure(c(1L, 1L, 2L, 3L), .Label = c("0",
"a", "c"), class = "factor"), sample_y = structure(c(1L, 3L,
2L, 4L), .Label = c("0", "a", "m", "t"), class = "factor"), sample_z = structure(c(3L,
2L, 1L, 1L), .Label = c("0", "a", "c"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
0s in df2 means no values.
Now I want to overlap df1 and df2 to make an output dataframe(df3):
df3<-structure(list(sample_x = c(2L, 2L, 0L), sample_y = c(1L, 3L,
2L), sample_z = c(2L, 2L, 0L)), class = "data.frame", row.names = c("overlap_df1_df2",
"unique_df1", "unique_df2"))
I tried the datatable function foverlaps:
setkeyv(df1, names(df1))
setkeyv(df2, names(df2))
df3<-foverlaps(df1,df2)
But seems like I need to have some common column names in these two dataframes, which is obviously not the case.
Thank you!
Loop through columns, and use set operations:
sapply(df2, function(i){
x = i[ !is.na(i) ]
o = intersect(df1$header1, x)
u_df1 = setdiff(df1$header1, o)
u_df2 = setdiff(x, o)
c(o = length(o),
u_df1 = length(u_df1),
u_df2 = length(u_df2))
})
# sample_x sample_y sample_z
# o 2 1 2
# u_df1 2 3 2
# u_df2 0 2 0
A solution using map:
library(purrr)
rbind(
overlap = map_dbl(df2, ~length(intersect(df1$header1, .x))),
unique_df1 = map_dbl(df2, ~length(setdiff(df1$header1, .x))),
unique_df2 = unique_df1 - overlap
)
sample_x sample_y sample_z
overlap 2 1 2
unique_df1 2 3 2
unique_df2 0 2 0
I have data where each row represents a household, and I would like to have one row per individual in the different households.
The data looks similar to this:
df <- data.frame(village = rep("aaa",5),household_ID = c(1,2,3,4,5),name_1 = c("Aldo","Giovanni","Giacomo","Pippo","Pippa"),outcome_1 = c("yes","no","yes","no","no"),name_2 = c("John","Mary","Cindy","Eva","Doron"),outcome_2 = c("yes","no","no","no","no"))
I would still like to keep the wide format of the data, just with one individual (and related outcome variables) per row. I could find examples that tell how to do the opposite, going from individual to grouped data using dcast, but I could not find examples of this problem I am facing now.
I have tried with melt
reshape2::melt(df, id.vars = "household_ID")
but I get a long format data.
Any suggestions welcome...
Thank you
Use pivot_longer() in tidyr, and set ".value" in names_to to indicate new column names from the pattern of the original column names.
library(tidyr)
df %>%
pivot_longer(-c(village, household_ID),
names_to = c(".value", "n"),
names_sep = "_")
# # A tibble: 10 x 5
# village household_ID n name outcome
# <fct> <dbl> <chr> <fct> <fct>
# 1 aaa 1 1 Aldo yes
# 2 aaa 1 2 John yes
# 3 aaa 2 1 Giovanni no
# 4 aaa 2 2 Mary no
# 5 aaa 3 1 Giacomo yes
# 6 aaa 3 2 Cindy no
# 7 aaa 4 1 Pippo no
# 8 aaa 4 2 Eva no
# 9 aaa 5 1 Pippa no
# 10 aaa 5 2 Doron no
Data
df <- structure(list(village = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "aaa", class = "factor"),
household_ID = c(1, 2, 3, 4, 5), name_1 = structure(c(1L,
3L, 2L, 5L, 4L), .Label = c("Aldo", "Giacomo", "Giovanni",
"Pippa", "Pippo"), class = "factor"), outcome_1 = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
name_2 = structure(c(4L, 5L, 1L, 3L, 2L), .Label = c("Cindy",
"Doron", "Eva", "John", "Mary"), class = "factor"), outcome_2 = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
I need to merge two tables in R.
The table X looks this way:
company_name country_code country cost1 cost2
1 Test1 FR <NA> NA 9.945000e-02
2 Test1 BR Brazil NA NA
3 Test2 <NA> USA 1 1.053000e-01
The table Y looks this way:
country country_code tier
France FR 1
Brazil BR 2
USA US 1
I need to merge X and Y to get Z:
name country_code tier
Test1 FR 1
Test2 BR 2
....
What should I do to merge by OR condition or something?
The following will do it. Note that I use a function from package zoo, so you will need to have it installed.
m <- merge(df1, df2, all = TRUE)
m$country <- zoo::na.locf(m$country)
m <- lapply(split(m, m$country), function(.m) zoo::na.locf(.m, fromLast = TRUE))
m <- lapply(m, function(.m) zoo::na.locf(.m))
m <- do.call(rbind, m)
m <- m[!duplicated(m), c(3, 2, 4)]
row.names(m) <- NULL
m
# name country_code tier
#1 First FR 1
#2 Third US 1
#3 Second BR 2
DATA.
df1 <-
structure(list(name = structure(1:3, .Label = c("First", "Second",
"Third"), class = "factor"), country = structure(c(1L, NA, 2L
), .Label = c("France", "USA"), class = "factor"), country_code = structure(c(NA,
1L, 2L), .Label = c("BR", "US"), class = "factor")), .Names = c("name",
"country", "country_code"), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(country = structure(c(2L, 1L, 3L), .Label = c("Brazil",
"France", "USA"), class = "factor"), country_code = structure(c(2L,
1L, 3L), .Label = c("BR", "FR", "US"), class = "factor"), tier = c(1L,
2L, 1L)), .Names = c("country", "country_code", "tier"), class = "data.frame", row.names = c(NA,
-3L))
EDIT.
After the comments and the question edit by the OP, the input data has changed and the following code and new df1 reflect that change.
fun <- function(DF, col){
sp <- split(DF, DF[[col]])
m <- lapply(sp, function(.m) zoo::na.locf(.m, fromLast = TRUE))
m <- lapply(m, function(.m) zoo::na.locf(.m))
m <- do.call(rbind, m)
row.names(m) <- NULL
m
}
m <- merge(df1, df2, all = TRUE)
m$country <- zoo::na.locf(m$country)
m$country_code <- zoo::na.locf(m$country_code)
m <- fun(m, "country_code")
m <- m[!duplicated(m), ]
m
# country_code country company_name cost1 cost2 tier
#1 BR Brazil Test <NA> 0.0819 2
#2 FR France Test <NA> 0.09945 1
#4 US USA Test <NA> 0.1053 1
df1 <-
structure(list(company_name = structure(c(1L, 1L, 1L), .Label = "Test", class = "factor"),
country_code = structure(c(2L, 1L, NA), .Label = c("BR",
"FR"), class = "factor"), country = structure(c(NA, 1L, 2L
), .Label = c("Brazil", "USA"), class = "factor"), cost1 = c(NA,
NA, NA), cost2 = c(0.09945, 0.0819, 0.1053)), .Names = c("company_name",
"country_code", "country", "cost1", "cost2"), class = "data.frame", row.names = c("1",
"2", "3"))
I am experiencing a little problem using the matches function from dplyr package.
From this dataset, I would like to extract the column names starting with enj
enj1 enj2 Enjm
bbc 1 1 2
bca 1 1 2
With grepl, I can do this
dt[, grepl('enj', colnames(dt))]
and get
enj1 enj2
bbc 1 1
bca 1 1
However the function matches does not give me the correct answer
library(dplyr)
dt %>% select(matches('enj') )
# or
dt %>% select(matches('^enj') )
Any idea why ?
dt = structure(list(enj1 = structure(c(1L, 1L), .Names = c("bbc",
"bca"), .Label = "1", class = "factor"), enj2 = structure(c(1L,
1L), .Names = c("bbc", "bca"), .Label = "1", class = "factor"),
Enjm = structure(c(1L, 1L), .Names = c("bbc", "bca"), .Label = "2", class = "factor")), .Names = c("enj1",
"enj2", "Enjm"), row.names = c("bbc", "bca"), class = "data.frame")
It's because you didn't set ignore.case = F.
> dt %>% select(matches('^enj', ignore.case = F) )
enj1 enj2
bbc 1 1
bca 1 1
>
I am studying this webpage, and cannot figure out how to rename freq to something else, say number of times imbibed
Here is dput
structure(list(name = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("Bill", "Llib"), class = "factor"), drink = structure(c(2L,
3L, 1L, 4L, 2L, 3L, 1L, 4L), .Label = c("cocoa", "coffee", "tea",
"water"), class = "factor"), cost = 1:8), .Names = c("name",
"drink", "cost"), row.names = c(NA, -8L), class = "data.frame")
And this is working code with output. Again, I'd like to rename the freq column. Thanks!
library(plyr)
bevs$cost <- as.integer(bevs$cost)
count(bevs, "name")
Output
name freq
1 Bill 4
2 Llib 4
Are you trying to do this?
counts <- count(bevs, "name")
names(counts) <- c("name", "number of times imbibed")
counts
The count() function returns a data.frame. Just rename it like any other data.frame:
counts <- count(bevs, "name")
names(counts)[which(names(counts) == "freq")] <- "number of times imbibed"
print(counts)
# name number of times imbibed
# 1 Bill 4
# 2 Llib 4