CSV to CSV comparison in R - r

I need to compare two csv files in R and write the records that is not matching in both the files. I was able to do above task with the below code,
library(dplyr)
a <- c("ads", "ads", "abc")
b <- c(121, 345, 23.300)
c <- c(21,22,23)
srce <- cbind.data.frame(a,b,c)
d <- c("ads", "ds", "abc")
e <- c(121, 345, 23)
f <- c(21,22,23)
trgt <- cbind.data.frame(d, e, f)
colnames(trgt) <- colnames(srce)
#Compare csv files
nn <- anti_join(srce, trgt)
The final output gives me rows with mismatch,
But i need to find out the cells which are mismatch in two files,
Is there a way to identify the cells which are mismatch rather than the entire records?
Thanks
Balaji.SJ

If you use the stringsAsFactors = FALSE argument with cbind.data.frame, a simple logical comparison will do the trick:
library(dplyr)
a <- c("ads", "ads", "abc")
b <- c(121, 345, 23.300)
c <- c(21,22,23)
srce <- cbind.data.frame(a,b,c, stringsAsFactors = FALSE)
d <- c("ads", "ds", "abc")
e <- c(121, 345, 23)
f <- c(21,22,23)
trgt <- cbind.data.frame(d, e, f, stringsAsFactors = FALSE)
colnames(trgt) <- colnames(srce)
# logical comparison:
srce == trgt
a b c
[1,] TRUE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] TRUE FALSE TRUE
To compare two dataframes and extract the column names of columns with nonmatches, calculate the colSums of the matrix srce != trgt and display the names of all columns with colSums > 0:
names(which(colSums(srce != trgt, na.rm = TRUE) > 0))
[1] "a" "b"

Related

R function to see if combinations of multiple vectors match a target vector

Given a set of any # of vectors:
a<-c("giraffe", "dolphin", "pig")
b<-c("elephant" , "pig")
c<-c("zebra","cobra","spider","porcupine")
d<-c("porcupine")
e<-c("spider","cobra")
f<-c("elephant","pig","porcupine")
and a target vector:
target<- c("elephant" , "pig","cobra","spider","porcupine")
Is there a way to check if any combinations of the vectors can match the target vector (order doesn't matter)?
In this case, answers would be:
b,d,e
e,f
Clarifying:
I need to know which combinations exactly match the target vector with no duplicates. Any answers that would repeat a value (e.x. b,d,e,f) would not work.
The solutions shown in the question consist of non-overlapping vectors so we assume that that is a requirement so that we are looking to partition the target into disjoint vectors that cover it. If the vectors may overlap then instead of using = or == in the constraints involving A below use >=.
The assumed problem is known as a set partitioning problem and the problem with overlaps is known as a set covering problem.
Assuming the list of vectors L and the target shown in the Note at the end form the objective (all one's), incidence matrix A of vectors, animals and the right hand of the constraint equations rhs derived from the target and run the linear program shown.
If a solution is found then we add a constraint that will eliminate it in the next iteration by insisting that at least one of its zeros be one. We iterate 5 times (i.e. up to 5 solutions) or until we can find no more solutions.
We show a solution using the lpSolveAPI package and then in the section after that repeat it using the CVXR package.
lpSolveAPI
library(lpSolveAPI)
animals <- sort(unique(unlist(L)))
A <- +outer(animals, L, Vectorize(`%in%`))
rownames(A) <- animals
nr <- nrow(A)
nc <- ncol(A)
rhs <- rownames(A) %in% target
lp <- make.lp(nr, nc)
set.objfn(lp, rep(1, nc))
for(i in 1:nr) add.constraint(lp, A[i, ], "=", rhs[i])
for(j in 1:nc) set.type(lp, j, type = "binary")
soln <- solns <- NULL
for(s in 1:5) {
if (!is.null(soln)) add.constraint(lp, 1-soln, ">=", 1)
if (solve(lp) != 0) break
soln <- get.variables(lp)
solns <- c(solns, list(names(L)[soln == 1]))
}
solns
## [[1]]
## [1] "e" "f"
##
## [[2]]
## [1] "b" "d" "e"
CVXR
An alternative to lpSolve is CVXR. We use nc, A and rhs from above. Below we find up to 5 solutions.
library(CVXR)
x <- Variable(nc, boolean = TRUE)
objective <- Minimize(sum(x))
constraints <- list(A %*% x == matrix(rhs))
solns <- soln <- NULL
for(i in 1:5) {
if (!is.null(soln)) constraints <- c(constraints, sum((1 - soln) * x) >= 1)
prob <- Problem(objective, constraints)
result <- solve(prob)
if (result$status != "optimal") break
soln <- result$getValue(x)
solns <- c(solns, list(names(L)[soln == 1]))
}
solns
## [[1]]
## [1] "e" "f"
##
## [[2]]
## [1] "b" "d" "e"
Note
L <- within(list(), {
a <- c("giraffe", "dolphin", "pig")
b <- c("elephant" , "pig")
c <- c("zebra","cobra","spider","porcupine")
d <- c("porcupine")
e <- c("spider","cobra")
f <- c("elephant","pig","porcupine")
})
L <- L[order(names(L))]
target<- c("elephant" , "pig","cobra","spider","porcupine")
By first converting your vectors into a list l <- list(a = a, b = b, c = c, d = d, e = e, f = f)
In base R you can use lapply:
unlist(lapply(l, FUN = function(x) all(x %in% target)))
a b c d e f
FALSE TRUE FALSE TRUE TRUE TRUE
You could accomplish this with the purrr library function imap_lgl:
library(purrr)
purrr::imap_lgl(l, ~ all( . %in% target))
a b c d e f
FALSE TRUE FALSE TRUE TRUE TRUE
If you add a pipe names you can get a character vector of the names if you prefer:
purrr::imap_lgl(l, ~ all( . %in% target)) %>%
names(.)[.]
[1] "b" "d" "e" "f"
Both of these solutions use all and the operator %in%. %in% works by testing if everything in the LHS vector is in the RHS vector:
a %in% target
[1] FALSE FALSE TRUE
all(a %in% target)
[1] FALSE
Since "giraffe" and "dolphin" are not in target the first two values return FALSE and the last value is TRUE since "pig" is in target. all tests if all values of a vector are TRUE. Since not all values of a are in target it returns FALSE.
Try this:
Build a list with your vectors
vec_list <- list(a, b, c, d, e, f)
names(vec_list) <- c("a", "b", "c", "d", "e", "f")
Write a function that identifies matches
match_elem <- function(i, the_list, target) {
if (all( the_list[[i]] %in% target)) {
return(names(the_list)[[i]])
}
}
Apply match_elem to each element of the list
unlist(lapply(1:6, match_elem, vec_list, target))
> "b" "d" "e" "f"
A base R option using combn
lst <- list(a, b, c, d, e, f)
nms <- c("a", "b", "c", "d", "e", "f")
names(
Filter(
isTRUE,
unlist(
lapply(
seq_along(lst),
function(k) {
setNames(
combn(lst, k, FUN = function(v) !(length(setdiff(unlist(v), target)) + length(setdiff(target, unlist(v))))),
combn(nms, k, toString)
)
}
)
)
)
)
or
subset(
unlist(
lapply(
seq_along(nms), function(k) combn(nms, k, toString)
)
),
unlist(
lapply(
seq_along(lst),
function(k) combn(lst, k, FUN = function(v) !(length(setdiff(unlist(v), target)) + length(setdiff(target, unlist(v)))))
)
)
)
gives
[1] "e, f" "b, d, e" "b, e, f" "d, e, f" "b, d, e, f"
Update
If do need to find exclusive combinations, i.e., without overlap, we can try the code below
subset(
unlist(
lapply(
seq_along(nms), function(k) combn(nms, k, toString)
)
),
unlist(
lapply(
seq_along(lst),
function(k) combn(lst, k, FUN = function(v) length(unlist(v))==length(target) & all(unlist(v)%in% target))
)
)
)
or
names(
Filter(
isTRUE,
unlist(
lapply(
seq_along(lst),
function(k) {
setNames(
combn(lst, k, FUN = function(v) length(unlist(v))==length(target) & all(unlist(v)%in% target)),
combn(nms, k, toString)
)
}
)
)
)
)
which gives
[1] "b, f" "e, f" "b, d, e"

Building a Tree with Node Pairs

I have a data.table of node pairs where Parent is higher up the tree than Child.
I need to extract all the individual chains from these rules e.g. if I have in format parent>child: (a>b, b>c, b>e, c>d), the chains are (a>b>c>d, a>b>e).
I've made an example with some dummy data showing what I want to do. Any suggestions on how to do this would be great? It feels like it should be straightforward but I'm struggling to think how to start. Thank you :)
library(data.table)
library(data.tree)
# example input and expected output
input <- data.table(Parent = c("a", "b", "c",
"e", "b"),
Child = c("b", "c", "d",
"b", "f"))
output <- data.table(Tree = c(rep(1,4), rep(2,3), rep(3,3), rep(4,4)),
List = c("a", "b", "c", "d",
"e", "b", "f",
"a", "b", "f",
"e", "b", "c", "d"),
Hierarchy = c(1:4, 1:3, 1:3, 1:4))
# attempt with data.tree, only builds the node pairs.
# ignore world part, was following: https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-creation
input[, pathString := paste("world", Parent, Child, sep = "/")]
data.tree::as.Node(input)
# attempt to re-structure
input[, Tree := .I]
dt1 <- input[, .(List = c(Parent, Child),
Hierarchy = 1:2), by=Tree]
Here is another possible solution - a little messy as well though
Output
output(input)
# tree_nums elems hierarchy
# 1: 1 a 1
# 2: 1 b 2
# 3: 1 c 3
# 4: 1 d 4
# 5: 2 e 1
# 6: 2 b 2
# 7: 2 c 3
# 8: 2 d 4
# 9: 3 a 1
# 10: 3 b 2
# 11: 3 f 3
# 12: 4 e 1
# 13: 4 b 2
# 14: 4 f 3
#
Function
output <- function (input) {
# init
helper <- do.call(paste0, input)
elements <- unique(unlist(input))
res <- integer(length(elements))
ind <- elements %in% input$Child
# first generation
parents <- elements[!ind]
res[!ind] <- 1L
# later generations
val <- 1L
parents <- parents
trees <- setNames(as.list(seq_along(parents)), parents)
while (any(res == 0L)) {
val <- val + 1L
children <- unique(input$Child[input$Parent %in% parents])
res[elements %in% children] <- val
# create the tree
nextHelper <- expand.grid(parents, children)
nextHelper$conc <- do.call(paste0, nextHelper)
nextHelper <- nextHelper[nextHelper$conc %in% helper,]
df_1 <- do.call(rbind, strsplit(names(trees),''))
df_2 <- base::merge(df_1, nextHelper[,-3L], by.x = ncol(df_1), by.y = 'Var1', all.x = TRUE)
n1 <- ncol(df_2)
if (n1 > 2L) df_2 <- df_2[,c(2:(n1-1),1L,n1)]
df_2$Var2 <- as.character(df_2$Var2)
df_2$Var2[is.na(df_2$Var2)] <- ''
treeNames <- do.call(paste0, df_2)
trees <- setNames(as.list(seq_along(treeNames)), treeNames)
parents <- children
}
elems <- strsplit(names(trees),'')
tree_nums <- rep(as.integer(trees), lengths(elems))
elems <- unlist(elems)
output <- data.table::data.table(tree_nums,elems)
out <- data.table::data.table(elements, res)
output$hierarchy <- out$res[match(output$elems, out$elements)]
output
}
I have a solution after a bit of a slog, but would prefer something more efficient if it exists.
library(stringi)
# convert to string
setkey(input, Parent)
sep <- ">>"
split_regex <- "(?<=%1$s)[^(%1$s)]*$"
trees <- sprintf("%s%s%s", input$Parent, sep, input$Child)
# get the base nodes, the children
children <- stri_extract_first_regex(trees, sprintf(split_regex, sep),
simplify = TRUE)
# find that which are parents
grid <- input[J(unique(children)), ][!is.na(Child), ]
update <- unique(grid$Parent)
N <- nrow(grid)
while(N > 0){
# add the children on for the ones at the base of the chains, might mean
# making more tree splits
all_trees <- unique(unlist(lapply(update, function(x){
pos <- children == x
y <- grid[Parent %in% x, Child]
trees <- c(trees[!pos], CJ(trees[pos], y)[, sprintf("%s%s%s", V1, sep, V2)])
trees
})))
# I have some trees embedded now, so remove these ones
trim <- sapply(seq_along(all_trees), function(i){
any(stri_detect_fixed(all_trees[-i], all_trees[i]))
})
trees <- all_trees[!trim]
# update operations on expanded trees until no children remain with a dependency
children <- stri_extract_first_regex(trees, sprintf(split_regex, sep, sep),
simplify = TRUE)
grid <- input[J(unique(children)), ][!is.na(Child), ]
update <- unique(grid$Parent)
N <- nrow(grid)
}
# re-structure to appropriate format
output <- data.table(pattern = trees)
output[, Tree := 1:.N]
output[, split := stri_split_regex(pattern, sep)]
output <- output[, .(List = split[[1]],
Hierarchy = 1:length(split[[1]])), by=Tree]
output[]

Compare every n rows and show boolean vector

I have similar issue like in this questions Compare every 2 rows and show mismatches in R
I would like to compare not only 2 rows but for example 3, 4, etc.
I have a data.table here:
DT <- data.table(A = rep(1:2, 2), B = rep(1:4, 2),
C = rep(1:2, 1), key = "A")
Then I use
dfs <- split(DT, DT$A)
comp <- function(x) sapply(x, function(u) u[1]==u[2])
matches <- sapply(dfs, comp)
For 3 rows :
comp <- function(x) sapply(x, function(u) u[1]==u[2] & u[1]==u[3])
Is that accurate? How can I generalize it in more elegant way?
try this:
comp2 <- function(dt, i, rws){
k <- length(rws)
tmp <- as.numeric(dt[i])
tmp <- as.data.table(matrix(rep(tmp, k), nrow = k, byrow = TRUE, dimnames = list(NULL, colnames(dt))))
ans <- (dt[rws] == tmp)
ans
}
this function takes three arguments:
-> dt your data.table (or sub-data.tables obtained from splitting your original one, up to you)
-> i -- row you want to compare
-> rws -- vector of row numbers you want to compare i with (e.g. c(2,3,4) would compare i with rows 2, 3 and 4
it then creates a new data.table that consists of row i stacked k times, so a data.frame to data.frame comparison is possible.
example:
comp2(DT, 1, c(2, 3, 4))
# A B C
#[1,] TRUE FALSE TRUE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
compares row 1 of your data.table DT to rows 2, 3 and 4.
if you want your output to tell you whether your chosen row differs from at least one of the rows you are comparing it to, then you need an extra operation colSums(ans) == k instead of ans.

Selecting logical vector elements based on certain conditions

I have a situation where I would like to detect conditions between two logical, named vectors based on the TRUE / FALSE combination at each position in the vector. For example:
x <- c(TRUE, FALSE, FALSE, TRUE)
names(x) <- c("a", "b", "c", "d")
y <- c(TRUE, TRUE, FALSE, FALSE)
names(y) <- names(x)
For each element in these two vectors I want to detect 3 conditions:
x[i] is TRUE and y[i] is TRUE;
x[i] is FALSE and y[i] is TRUE,
x[i] is TRUE and y[i] is FALSE.
The length of x and y are the same but could be longer than this example. I want to retrieve the name of the element for each condition and assign the element name to a new variable. For this example:
v1 <- "a"
v2 <- "b"
v3 <- "d"
In a longer version of these two vectors I might end up with something like:
v1 <- c("a", "e")
v2 <- c("b", "f", "g")
v3 <- c("d", "i", "k", "l")
What is the best vectorized way to do this. I think it is simple but I am unable to come up with the answer. Thanks in advance.
We can efficiently use split, but before that, we need a single grouping index. Here is a possibility:
g <- x + y + x
split(names(x), g)
To understand the above grouping index, consider this:
x <- c(TRUE, TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE, FALSE)
x + y + x
#[1] 3 2 1 0
So you can see that 4 combinations of TRUE and FALSE are mapped to 4 integer values.
Ah, so "a" get assigned to T-T, "b" to T-F, etc. But, why the x + y + x?? I don't follow adding x twice.
If you only do x + y, the result is only 0, 1 and 2. You won't be able to differentiate T-F and F-T as they are both 1.
#thelatemail offers a more readable way:
split(names(x), interaction(x, y, drop=TRUE))
Update
Ah... stupid me... Why did I bother creating g. I suddenly remember that we can pass a list to f argument in split:
split(names(x), list(x, y))
Note, internally in split.default:
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep)

extract a single column

I have a list of 701 given csv files. Each one has the same number of columns (7) but different number of rows (between 25000 and 28000).
Here is an extract of the first file:
Date,Week,Week Day,Hour,Price,Volume,Sale/Purchase
18/03/2011,11,5,1,-3000.00,17416,Sell
18/03/2011,11,5,1,-1001.10,17427,Sell
18/03/2011,11,5,1,-1000.00,18055,Sell
18/03/2011,11,5,1,-500.10,18057,Sell
18/03/2011,11,5,1,-500.00,18064,Sell
18/03/2011,11,5,1,-400.10,18066,Sell
18/03/2011,11,5,1,-400.00,18066,Sell
18/03/2011,11,5,1,-300.10,18068,Sell
18/03/2011,11,5,1,-300.00,18118,Sell
I made a nonlinear regression of the supply curve of the ninth hour for the year 2012. The datas for 2012 are in 290. to 654. csv files.
allenamen <- dir(pattern="*.csv")
alledat <- lapply(allenamen, read.csv, header = TRUE, sep = ",", stringsAsFactors = FALSE)
h <- list()
for(i in 290:654) {
g <- function(a, b, c, d, p) {a*atan(b*p+c)+d}
f <- nlsLM(Volume ~ g(a,b,c,d,Price), data=subset(alledat[[i-289]], (Hour==9) & (Sale.Purchase == "Sell") & (!Price %in% as.character(-50:150))), start = list(a=4000, b=0.1, c=-5, d=32000))
h[[i-289]] <- coef(f)
}
This works and I get the coefficients a, b, c and d for every day in 2012.
This is the head(h):
[[1]]
a b c d
2.513378e+03 4.668218e-02 -3.181322e+00 2.637142e+04
[[2]]
a b c d
2.803172e+03 6.696201e-02 -4.576432e+00 2.574454e+04
[[3]]
a b c d
3.298991e+03 5.817949e-02 -3.425728e+00 2.393888e+04
[[4]]
a b c d
2.150487e+03 3.810406e-02 -2.658772e+00 2.675609e+04
[[5]]
a b c d
2.326199e+03 3.044967e-02 -1.780965e+00 2.604374e+04
[[6]]
a b c d
2934.0193270 0.0302937 -1.9912913 26283.0300823
And this is dput(head(h)):
list(structure(c(2513.37818972349, 0.0466821822063123, -3.18132213466142,
26371.4241646124), .Names = c("a", "b", "c", "d")), structure(c(2803.17230054557,
0.0669620116294894, -4.57643230249848, 25744.5376725213), .Names = c("a",
"b", "c", "d")), structure(c(3298.99066895304, 0.0581794881246528,
-3.42572804902504, 23938.8754575156), .Names = c("a", "b", "c",
"d")), structure(c(2150.48734655237, 0.0381040636898022, -2.65877160023262,
26756.0907073567), .Names = c("a", "b", "c", "d")), structure(c(2326.19873555633,
0.0304496684589379, -1.7809654498454, 26043.735374657), .Names = c("a",
"b", "c", "d")), structure(c(2934.01932702805, 0.0302937043170001,
-1.99129130343521, 26283.0300823458), .Names = c("a", "b", "c",
"d")))
Now I am trying to get just a column with h$a but I get NULL. How can I get just the a column?
In addition to this I want to plot the single coefficients and Date. I tried this code:
koeffreihe <- function(x) {
files <- list.files(pattern="*.csv")
df <- data.frame()
for(i in 1:length(files)){
xx <- read.csv(as.character(files[i]))
xx <- subset(xx, Sale.Purchase == "Sell" & Hour == 3)
df <- rbind(df, xx)
g <- function(a, b, c, d, p) {a*atan(b*p+c)+d}
f <- nlsLM(Volume ~ g(a,b,c,d,Price), data=subset(alledat[[i]], (Hour==9) & (Sale.Purchase == "Sell") & (!Price %in% as.character(-50:150))), start = list(a=4000, b=0.1, c=-5, d=32000))
h[[i]] <- coef(f)
}
df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
plot(h$x ~ Date, df, xlim = as.Date(c("2012-01-01", "2012-12-31")))
}
koeffreihe(a)
But I get this error:
invalid type (NULL) for variable 'h$x'
So the problem is that h$a is NULL. If someone can fix this problem I guess the code will work too.
Thank you for your help!
First transform your list into a data.frame:
h.df <- setNames(do.call(rbind.data.frame, h), names(h[[1]]))
# a b c d
#1 2513.378 0.04668218 -3.181322 26371.42
#2 2803.172 0.06696201 -4.576432 25744.54
#3 3298.991 0.05817949 -3.425728 23938.88
#4 2150.487 0.03810406 -2.658772 26756.09
#5 2326.199 0.03044967 -1.780965 26043.74
#6 2934.019 0.03029370 -1.991291 26283.03
Then you can extract variables easily:
h.df$a
#[1] 2513.378 2803.172 3298.991 2150.487 2326.199 2934.019
Alternatively you can iterate over the list to extract the variable:
sapply(h, "[", "a")
# a a a a a a
#2513.378 2803.172 3298.991 2150.487 2326.199 2934.019
In this line, although x is a variable, h$x is looking for a column named x in h:
plot(h$x ~ Date, df, xlim = as.Date(c("2012-01-01", "2012-12-31")))
You probably want h[[x]] instead.
From ?'[[':
x$name is equivalent to x[["name", exact = FALSE]].
That is, you are looking for a column literally named x.

Resources