How to find the difference between 2 dataframes? - r

I have 2 dataframes which are "exactly" the same. The difference between them is that one has 676 observations (rows) and the second has 666 observations. I don't know which of those rows are missed in a second dataframe.
Would be the easiest to me if someone can show me the code how to make a third dataframe with those 10 rows which are missed.
The name of dataframes:
- dataset1 (676)
- dataset2 (666)
Thx.

dataset1[tail(!duplicated(rbind(dataset2, dataset1)), nrow(dataset1)), ]

Here's an approach:
library(qdap)
## generate random problem
prob <- sample(1:nrow(mtcars), 1)
## remove the random problem row
mtcars2 <- mtcars[-prob, ]
## Throw it into a list of 2 dataframes so they're easier to work with
dat <- list(mtcars, mtcars2)
## Use qdap's `paste2` function to paste all columns together
dat2 <- lapply(dat, paste2)
## Find the shorter data set
wmn <- which.min(sapply(dat2, length))
## Add additional element to shorter one
dat2[[wmn]] <- c(dat2[[wmn]], NA)
## check each element of the 2 pasted data sets for equality
out <- mapply(identical, dat2[[1]], dat2[[2]])
## Which row's the problem
which(!out)[1]
which(!out)[1] == prob
If which(!out)[1] equals NA problem is in the last row.
When you start seeing FALSE that's where the problem is located.
EDIT: removed the for loop

I would say try to use merge and then look for where the merge result has NA values.
Here's an example using dummy data:
set.seed(1)
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- df1[-sample(1:100,10),]
dim(df1)
# [1] 100 2
dim(df2)
# [1] 90 2
out <- merge(df1,df2,by='x',all.x=TRUE)
in1not2 <- which(is.na(out$y.y))
in1not2
# [1] 6 25 33 51 52 53 57 73 77 82
Then you can extract:
> df1[in1not2,]
x y
6 -0.8204684 1.76728727
25 0.6198257 -0.10019074
33 0.3876716 0.53149619
51 0.3981059 0.45018710
52 -0.6120264 -0.01855983
53 0.3411197 -0.31806837
57 -0.3672215 1.00002880
73 0.6107264 0.45699881
77 -0.4432919 0.78763961
82 -0.1351786 0.98389557

Related

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

Is there a way stop table from sorting in R

Problem setup: Creating a function to take multiple CSV files selected by ID column and combine into 1 csv, then create an output of number of observations by ID.
Expected:
complete("specdata", 30:25) ##notice descending order of IDs requested
## id nobs
## 1 30 932
## 2 29 711
## 3 28 475
## 4 27 338
## 5 26 586
## 6 25 463
I get:
> complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932
Which is "wrong" because it has been sorted by id.
The CSV file I read from does have the data in descending order. My snippet:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
xTab3<-as.data.frame(table(dfTable$ID[ccTab]),)
colnames(xTab3)<-c("id","nobs")
And as near as I can tell, the third line is where sorting occurs. I broke out the expression and it happens in the table() call. I've not found any option or parameter I can pass to make something like sort=FALSE. You'd think...
Anyway. Any help appreciated!
So, the problem is in the output of table, which are sorted by default. For example:
> r = sample(5,15,replace = T)
> r
[1] 1 4 1 1 3 5 3 2 1 4 2 4 2 4 4
> table(r)
r
1 2 3 4 5
4 3 2 5 1
If you want to take the order of first appearance, you are going to get your hands a little bit dirty by recoding the table function:
unique_r = unique(r)
table_r = rbind(label=unique_r, count=sapply(unique_r,function(x)sum(r==x)))
table_r
[,1] [,2] [,3] [,4] [,5]
label 1 4 3 5 2
count 4 5 2 1 3
One way to get around this is...don't use table. Here's an example where I create three one-line data sets from your data. Then I read them in with a descending sequence, with read.table and it seems to be okay.
The real big thing here is that multiple data sets should be placed in a list upon being read into R. You'll get the exact order of data sets you want that way, among other benefits.
Once you've read them into R the way you want them, it's much easier to order them at the very end. Ordering of rows (for me) is usually the very last step.
> dat <- read.table(h=T, text = "id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932")
Write three one-line files:
> write.table(dat[3,], "dat3.csv", row.names = FALSE)
> write.table(dat[2,], "dat2.csv", row.names = FALSE)
> write.table(dat[1,], "dat1.csv", row.names = FALSE)
Read them in using a 3:1 order:
> do.call(rbind, lapply(3:1, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 27 338
# 2 26 586
# 3 25 463
Then, if we change 3:1 to 1:3 the rows "comply" with our request
> do.call(rbind, lapply(1:3, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE)
}))
# id nobs
# 1 25 463
# 2 26 586
# 3 27 338
And just for fun
> fun <- function(z){
do.call(rbind, lapply(z, function(x){
read.table(paste0("dat", x, ".csv"), header = TRUE) }))
}
> fun(c(2, 3, 1))
# id nobs
# 1 26 586
# 2 27 338
# 3 25 463
You may try something like this:
t1 <- c(5,3,1,3,5,5,5)
as.data.frame(table(t1)) ##result in ascending order
# t1 Freq
#1 1 1
#2 3 2
#3 5 4
t1 <- factor(t1)
as.data.frame(table(reorder(t1, rep(-1, length(t1)),sum)))
# Var1 Freq
#1 5 4
#2 3 2
#3 1 1
In your case you are complaining about the actions of the table function with a single argument returning the items with the names in ascending order and you wnat them in descending order. You could have simply used the rev() function around the table call.
xTab3<-as.data.frame( rev( table( dfTable$ID[ccTab] ) ),)
(I'm not sure what that last comma is doing in there.) The sort order in the original would not be expected to determine the order of a table operation. Generally R will return results with discrete labels sorted in alpha (ascending) order unless the levels of a factor item have been specified differently. That's one of those R-specific rules that may be difficult to intuit. The other R-specific rule that may be difficult to grasp (although not really a problem here) is that arguments are often expected to be in the form of R-lists.
It's probably wise to think about R-table objects at this point (and what happens with the as.data.frame call. table-objects are actually R-matrices, so the feature that you wanted to sort by was actually the rownames of that table object and are of class character:
r = sample(5,15,replace = T)
table(r)
#r
#2 3 4 5
#5 3 2 5
rownames(table(r))
#[1] "2" "3" "4" "5"
str(as.data.frame(table(r)))
#-------
'data.frame': 4 obs. of 2 variables:
$ r : Factor w/ 4 levels "2","3","4","5": 1 2 3 4
$ Freq: int 5 3 2 5
I just wanna share this homework I've done
complete <- function(directory, id=1:332){
setwd("E:/Coursera")
files <- dir(directory, full.names = TRUE)
data <- lapply(files, read.csv)
specdata <- do.call(rbind, data)
cleandata <- specdata[!is.na(specdata$sulfate) & !is.na(specdata$nitrate),]
targetdata <- data.frame(Date=numeric(0), sulfate=numeric(0), nitrate=numeric(0), ID=numeric(0))
result<-data.frame(id=numeric(0), nobs=numeric(0))
for(i in id){
targetdata <- cleandata[cleandata$ID == i, ]
result <- rbind(result, data.frame(table(targetdata$ID)))
}
names(result) <- c("id","nobs")
result
}
A simple solution that no one has proposed yet is combining table() with unique() function. The unique() function does the behaviour that you are looking (listing unique IDs in order of appearance).
In your case it would be something like this:
dfTable<-read.csv("~/progAssign1/specdata/tmpdata.csv")
ccTab<-complete.cases(dfTable)
x<-dfTable$ID[ccTab] #unique IDs
xTab3<-as.data.frame(table(x)[unique(x)],) #here you sort the "table()" result in order of appearance
colnames(xTab3)<-c("id","nobs")

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

finding unique vector elements in a list efficiently

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

converting an ftable to a matrix

Take for example the following ftable
height <- c(rep('short', 7), rep('tall', 3))
girth <- c(rep('narrow', 4), rep('wide', 6))
measurement <- rnorm(10)
foo <- data.frame(height=height, girth=girth, measurement=measurement)
ftable.result <- ftable(foo$height, foo$girth)
I'd like to convert the above ftable.result into a matrix with row names and column names. Is there an efficient way of doing this? as.matrix() doesn't exactly work, since it won't attach the row names and column names for you.
You could do the following
ftable.matrix <- ftable.result
class(ftable.matrix) <- 'matrix'
rownames(ftable.matrix) <- unlist(attr(ftable.result, 'row.vars'))
colnames(ftable.matrix) <- unlist(attr(ftable.result, 'col.vars'))
However, it seems a bit heavy-handed. Is there a more efficient way of doing this?
It turns out that #Shane had originally posted (but quickly deleted) what is a correct answer with more recent versions of R.
Somewhere along the way, an as.matrix method was added for ftable (I haven't found it in the NEWS files I read through though.
The as.matrix method for ftable lets you deal fairly nicely with "nested" frequency tables (which is what ftable creates quite nicely). Consider the following:
temp <- read.ftable(textConnection("breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863"))
class(temp)
# [1] "ftable"
The head(as.table(...), Inf) trick doesn't work with such ftables because as.table would convert the result into a multi-dimensional array.
head(as.table(temp), Inf)
# [1] 9 23 54 95 108 177 7 9 19 1841 1654 1863
For the same reason, the second suggestion also doesn't work:
t <- as.table(temp)
class(t) <- "matrix"
# Error in class(t) <- "matrix" :
# invalid to set the class to matrix unless the dimension attribute is of length 2 (was 3)
However, with more recent versions of R, simply using as.matrix would be fine:
as.matrix(temp)
# breathless_coughed
# age yes_yes yes_no no_yes no_no
# 20-24 9 7 95 1841
# 25-29 23 9 108 1654
# 30-34 54 19 177 1863
class(.Last.value)
# [1] "matrix"
If you prefer a data.frame to a matrix, check out table2df from my "mrdwabmisc" package on GitHub.
I found 2 solutions on R-Help:
head(as.table(ftable.result), Inf)
Or
t <- as.table(ftable.result)
class(t) <- "matrix"

Resources