R - Calculate difference (similarity measure) between similar datasets

R - Calculate difference (similarity measure) between similar datasets - r

I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question.
Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2.
Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2. Ie function(benchmark, imputedData1) = 3.3 and function(benchmark, imputedData2) = 2.8
Note: Datasets are numerical, datasets are the same size, method should work at the data level if possible (ie not creating a regression and comparing regressions - unless it can work with ANY numerical dataset).
Reproducible datasets, they have only been changed in the first row:
benchmark:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData1:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData2:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I have tried to use RMSE (root mean squared error) but it didn't work very well so I am trying to find other ways to tackle this problem.

You could also check out package ftsa. It has about 20 error measures that can be calculated. In your case, a scaled error would make sense as the units differ from column to column.
library(ftsa)
error(forecast=unlist(imputedData1),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.035136
error(forecast=unlist(imputedData2),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.031151
data
bench <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData1 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData2 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)

One possible way is to calculate a norm of their difference and prefer the imputation method that minimises this value. There are different matrix norms for different purposes. I'll point you to the wikipedia as a starting point - https://en.wikipedia.org/wiki/Matrix_norm.
In the absence of any specifics about your data I can't really say which you should choose but one method could be to create your own index that averages across different matrix norms and select the imputation method that minimizes this average. Or you could just eyeball them and with any luck one of the methods is a clear winner across most or all matrix norms.

A simple implementation of what was discussed in the comments that gives a result with same order of magnitude as P Lapointe's answer, just FYI.
library(magrittr)
center_and_reduce_df <- function(df,bm){
centered <- mapply('-',df,sapply(bm,mean)) %>% as.data.frame(stringsAsFactors= FALSE)
reduced <- mapply('/',centered,sapply(bm,sd)) %>% as.data.frame(stringsAsFactors= FALSE)
}
mean((center_and_reduce_df(id1,bm) - center_and_reduce_df(bm,bm))^2) # 0.03083166

Not quite sure what you mean by "difference", but if you just want to know how much each cell differs from each cell on average (given the matrices are of the same shape and have indentical cols/rows), you could do absolute difference, or use Euclidean distance, or Kolmogorov-Smirnov distance - depending again on what you mean by "difference".
abs(head(mtcars) - (head(mtcars)*0.5)) # differences by cell
mean( as.matrix(abs(head(mtcars) - (head(mtcars)*0.5)))) # mean abs difference
dist( t(data.frame(as.vector(as.matrix(head(mtcars))), (as.vector(as.matrix(head(mtcars)*0.5)))))) # Euclidean; remove t() to see element by element
ks.test( as.vector(as.matrix(head(mtcars))), (as.vector(as.matrix(head(mtcars)*0.5))))$statistic # K-S

Related

add column with value depending on value in other column

Is there a smart way of doing this:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Now I want to add a column named X which could be either 1 or 0 depending on the name of the car. For example all cars starting with M should be 1 and the rest 0.
Best regards,
H

Many ways to do this :
mtcars$X <- +(startsWith(rownames(mtcars), 'M'))
You can also use grepl/str_detect :
mtcars$X <- as.integer(grepl('^M', rownames(mtcars)))
mtcars$X <- as.integer(stringr::str_detect(rownames(mtcars), '^M'))
The above two are similar to using ifelse :
mtcars$X <- ifelse(grepl('^M', rownames(mtcars)), 1, 0)
but they are more efficient than using ifelse.

Using dplyr, how should I create a column of strings repeating a character based on the value of another column?

With mtcars for example, I'd like to create a new column carb_dots such that when carb = 4, carb_dots = "...."
Using dplyr, I've tried
library(dplyr)
mtcars2 <- mtcars %>% mutate(carb_dots = rep(".", carb))
This errors with
Error in mutate_impl(.data, dots) :
Evaluation error: invalid 'times' argument.
What should I do? Thanks for your suggestions.

With the addition of stringr, you can do:
mtcars %>%
mutate(carb_dots = str_dup(".", carb))
mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ..
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ..
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ....

We can use strrep
library(dplyr)
mtcars %>%
mutate(carb_dots = strrep(".", carb))
# mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
#Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
#Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
#Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
#...
If we need to use rep
mtcars %>%
rowwise %>%
mutate(carb_dots = paste(rep(".", carb), collapse=""))

How to iterate over columns with Sapply for Pearson coefficient

[i] indicates where I have to iterate pearsons coefficient over the columns and how to convert this into a dataframe attached onto a variable?
Code example:
*INSTEAD OF DOING THIS*
F.ReedBunting.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$ReedBunting,method='pearson')
F.Whitethroat.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Whitethroat,method='pearson')
F.Rook.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Rook,method='pearson')
.
.
.
*HOW CAN IT BE DONE QUICKLY WITH THIS*
workspaceone <- sapply(W_farmland_mean, function(x){
cor.test(W_farmland_mean$Years, W_farmland_mean[, 1[i]], method = 'pearson')
})

I think you should try:
result_cor <- apply(W_farmland_mean,2,function(x){cor.test(W_farmland_mean$Years,x, method = 'pearson')$estimate})
It will extract the Pearson coefficient of the comparison of each columns with the column years of your dataset.
Example
With the mtcars dataset:
df <- mtcars[c(1:10),]
> df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
And if we apply the function:
result_cor = apply(df,2, function(x){cor.test(x,df$mpg,method ='pearson')$estimate})
And you get the following output:
> result_cor
mpg cyl disp hp drat wt qsec
1.0000000 -0.8614165 -0.7739868 -0.8937223 0.5413585 -0.5991894 0.5494131
vs am gear carb
0.4796102 0.2919683 0.6646449 -0.3711956

R - Create missingness in DataFrame for testing

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.
Does anyone know of a way to delete a certain amount of data from a dataframe.
As an example of what I need:
You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)
Or: Something that can turn
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Into:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 NA 6 160.0 NA 3.90 2.620 NA 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 NA 108.0 93 NA NA 18.61 NA 1 NA 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.
This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.
All the best

This should do it:
df_new <- as.data.frame(apply(mtcars,2,function(x){
x[sample(1:length(x),round(length(x)*0.2))] <- NA
return(x)
}))
Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.
New answer after comment:
This randomly adds NA in 10% of all rows.
df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
}

How to have NA's displayed first using arrange()

Sample data:
temp = data.frame(col = list(NA, 1, 2, 3) )
Using arrange:
temp %>%
arrange(col)
gives
col
1 1
2 2
3 3
4 NA
and
temp %>%
arrange(desc(col))
gives
col
1 3
2 2
3 1
4 NA
I would like
col
1 NA
2 3
3 2
4 1
that is, to put NAs first. Does anyone know how to do this?

You could also do:
m %>%
arrange(!is.na(wt), wt) ##Spacedman's dataset
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
#2 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#5 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#7 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#9 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#10 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

Write a function that sorts a data frame and then pass the handy na.last=FALSE option to order. My original version can be found in the edit history, David Arenburg improved it to this:
> sortNA=function(d,n,...){d[order(d[[deparse(substitute(n))]],...),]}
Then use like this
> m=mtcars[1:10,]
> m$wt[5]=NA
> m$wt[8]=NA
> m %.% sortNA(wt, na.last=FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
Merc 240D 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Add decreasing=TRUE to sort in the opposite order.
You might also consider posting an issue to the dplyr github issue tracker to suggest a new option to the arrange function to do this.

The order function in base R has an na.last argument:
> temp=data.frame(col=c(NA,1,2,3))
> temp[order(temp[,"col"],na.last=F),]
[1] NA 1 2 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Calculate difference (similarity measure) between similar datasets - r

Related

add column with value depending on value in other column

Using dplyr, how should I create a column of strings repeating a character based on the value of another column?

How to iterate over columns with Sapply for Pearson coefficient

R - Create missingness in DataFrame for testing

How to have NA's displayed first using arrange()

Categories

Resources