R - Create missingness in DataFrame for testing - r

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.
Does anyone know of a way to delete a certain amount of data from a dataframe.
As an example of what I need:
You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)
Or: Something that can turn
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Into:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 NA 6 160.0 NA 3.90 2.620 NA 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 NA 108.0 93 NA NA 18.61 NA 1 NA 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.
This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.
All the best

This should do it:
df_new <- as.data.frame(apply(mtcars,2,function(x){
x[sample(1:length(x),round(length(x)*0.2))] <- NA
return(x)
}))
Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.
New answer after comment:
This randomly adds NA in 10% of all rows.
df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
}

Related

Merge several csv files from different folders using both the folder and the file names as identifiers

Dear R users and programmers:
I have tens of CSV files with similar col and row names but in different folders. I want to merge all of them while using the folder names and CSV file names as identifiers.
Is there any way to do that in R?
Thanks
Create data for a reproducible example
example_data <- mtcars
dirs <- c("dir_1", "dir_2", "dir_3", "dir_4")
df_list <- split(example_data, factor(sort(rank(row.names(example_data))%%4)))
for(i in c(1:4)){
dir.create(dirs[i])
write.csv(df_list[[i]],paste0(dirs[i],"/mtcars_",i,".csv"))}
Build data frame from files in multiple directories
# helper function to read a csv and
# add a column with the label of the data source
read_label <- function(files){
df <-read.csv(files)
df[,ncol(df)+1] <- files
names(df)[ncol(df)]<- "data_source"
return(df)
}
# List the files
files <- list.files(path = ".",pattern = "*.csv",recursive = TRUE)
# read files with labels into memory
do.call(rbind,lapply(files, read_label))
X mpg cyl disp hp drat wt qsec vs am gear carb data_source
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 dir_1/mtcars_1.csv
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 dir_1/mtcars_1.csv
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 dir_1/mtcars_1.csv
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 dir_1/mtcars_1.csv
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 dir_1/mtcars_1.csv
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 dir_1/mtcars_1.csv
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 dir_1/mtcars_1.csv
8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 dir_1/mtcars_1.csv
9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 dir_2/mtcars_2.csv
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 dir_2/mtcars_2.csv

add column with value depending on value in other column

Is there a smart way of doing this:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Now I want to add a column named X which could be either 1 or 0 depending on the name of the car. For example all cars starting with M should be 1 and the rest 0.
Best regards,
H
Many ways to do this :
mtcars$X <- +(startsWith(rownames(mtcars), 'M'))
You can also use grepl/str_detect :
mtcars$X <- as.integer(grepl('^M', rownames(mtcars)))
mtcars$X <- as.integer(stringr::str_detect(rownames(mtcars), '^M'))
The above two are similar to using ifelse :
mtcars$X <- ifelse(grepl('^M', rownames(mtcars)), 1, 0)
but they are more efficient than using ifelse.

How to iterate over columns with Sapply for Pearson coefficient

[i] indicates where I have to iterate pearsons coefficient over the columns and how to convert this into a dataframe attached onto a variable?
Code example:
*INSTEAD OF DOING THIS*
F.ReedBunting.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$ReedBunting,method='pearson')
F.Whitethroat.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Whitethroat,method='pearson')
F.Rook.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Rook,method='pearson')
.
.
.
*HOW CAN IT BE DONE QUICKLY WITH THIS*
workspaceone <- sapply(W_farmland_mean, function(x){
cor.test(W_farmland_mean$Years, W_farmland_mean[, 1[i]], method = 'pearson')
})
I think you should try:
result_cor <- apply(W_farmland_mean,2,function(x){cor.test(W_farmland_mean$Years,x, method = 'pearson')$estimate})
It will extract the Pearson coefficient of the comparison of each columns with the column years of your dataset.
Example
With the mtcars dataset:
df <- mtcars[c(1:10),]
> df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
And if we apply the function:
result_cor = apply(df,2, function(x){cor.test(x,df$mpg,method ='pearson')$estimate})
And you get the following output:
> result_cor
mpg cyl disp hp drat wt qsec
1.0000000 -0.8614165 -0.7739868 -0.8937223 0.5413585 -0.5991894 0.5494131
vs am gear carb
0.4796102 0.2919683 0.6646449 -0.3711956

R - Calculate difference (similarity measure) between similar datasets

I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question.
Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2.
Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2. Ie function(benchmark, imputedData1) = 3.3 and function(benchmark, imputedData2) = 2.8
Note: Datasets are numerical, datasets are the same size, method should work at the data level if possible (ie not creating a regression and comparing regressions - unless it can work with ANY numerical dataset).
Reproducible datasets, they have only been changed in the first row:
benchmark:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData1:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData2:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I have tried to use RMSE (root mean squared error) but it didn't work very well so I am trying to find other ways to tackle this problem.
You could also check out package ftsa. It has about 20 error measures that can be calculated. In your case, a scaled error would make sense as the units differ from column to column.
library(ftsa)
error(forecast=unlist(imputedData1),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.035136
error(forecast=unlist(imputedData2),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.031151
data
bench <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData1 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData2 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
One possible way is to calculate a norm of their difference and prefer the imputation method that minimises this value. There are different matrix norms for different purposes. I'll point you to the wikipedia as a starting point - https://en.wikipedia.org/wiki/Matrix_norm.
In the absence of any specifics about your data I can't really say which you should choose but one method could be to create your own index that averages across different matrix norms and select the imputation method that minimizes this average. Or you could just eyeball them and with any luck one of the methods is a clear winner across most or all matrix norms.
A simple implementation of what was discussed in the comments that gives a result with same order of magnitude as P Lapointe's answer, just FYI.
library(magrittr)
center_and_reduce_df <- function(df,bm){
centered <- mapply('-',df,sapply(bm,mean)) %>% as.data.frame(stringsAsFactors= FALSE)
reduced <- mapply('/',centered,sapply(bm,sd)) %>% as.data.frame(stringsAsFactors= FALSE)
}
mean((center_and_reduce_df(id1,bm) - center_and_reduce_df(bm,bm))^2) # 0.03083166
Not quite sure what you mean by "difference", but if you just want to know how much each cell differs from each cell on average (given the matrices are of the same shape and have indentical cols/rows), you could do absolute difference, or use Euclidean distance, or Kolmogorov-Smirnov distance - depending again on what you mean by "difference".
abs(head(mtcars) - (head(mtcars)*0.5)) # differences by cell
mean( as.matrix(abs(head(mtcars) - (head(mtcars)*0.5)))) # mean abs difference
dist( t(data.frame(as.vector(as.matrix(head(mtcars))), (as.vector(as.matrix(head(mtcars)*0.5)))))) # Euclidean; remove t() to see element by element
ks.test( as.vector(as.matrix(head(mtcars))), (as.vector(as.matrix(head(mtcars)*0.5))))$statistic # K-S

How to have NA's displayed first using arrange()

Sample data:
temp = data.frame(col = list(NA, 1, 2, 3) )
Using arrange:
temp %>%
arrange(col)
gives
col
1 1
2 2
3 3
4 NA
and
temp %>%
arrange(desc(col))
gives
col
1 3
2 2
3 1
4 NA
I would like
col
1 NA
2 3
3 2
4 1
that is, to put NAs first. Does anyone know how to do this?
You could also do:
m %>%
arrange(!is.na(wt), wt) ##Spacedman's dataset
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
#2 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#5 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#7 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#9 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#10 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Write a function that sorts a data frame and then pass the handy na.last=FALSE option to order. My original version can be found in the edit history, David Arenburg improved it to this:
> sortNA=function(d,n,...){d[order(d[[deparse(substitute(n))]],...),]}
Then use like this
> m=mtcars[1:10,]
> m$wt[5]=NA
> m$wt[8]=NA
> m %.% sortNA(wt, na.last=FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
Merc 240D 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Add decreasing=TRUE to sort in the opposite order.
You might also consider posting an issue to the dplyr github issue tracker to suggest a new option to the arrange function to do this.
The order function in base R has an na.last argument:
> temp=data.frame(col=c(NA,1,2,3))
> temp[order(temp[,"col"],na.last=F),]
[1] NA 1 2 3

Resources