I need to add vectors[attributes] returned by remove_outliers function in a dataframe. Right now, I am getting a large matrix. I have tried append method(as following)
# function to calculate IQR and upper and lower limit of given attribute
remove_outliers <- function(attribute, na.rm = TRUE, ...) {
IQR_val <- quantile(attribute, probs=c(.25, .75), na.rm = na.rm, ...)
LF <- 1.5 * IQR(attribute, na.rm = na.rm)
attribute_W_NA <- attribute
attribute_W_NA[attribute < (IQR_val[1] - LF)] <- NA
attribute_W_NA[attribute > (IQR_val[2] + LF)] <- NA
attribute_W_NA
}
cleaned_data <- NULL
for(i in 1:ncol(data_rm_val)){
# cleaned data with NA entries replacing outliers
cleaned_data <- cbind(cleaned_data, remove_outliers(data_rm_val[,i]))
}
it results in large matrix
This is input dataframe:
current output is: (with cbind in loop)
and desired result should be a dataframe with the same number of rows and columns.
Any help would be greatly appreciated.
PS: I am a newbie in R and Data Science.
Simply use lapply with your user defined function avoiding the need of cbinding or appending. When using lapply() on a dataframe you run operations on each column:
cleaned_data <- data.frame(lapply(data_rm_val, remove_outliers))
Now above assumes your defined function, remove_outliers returns a vector type. To ensure a vector always outputs, consider vapply() defining a length equal to input or nrow(data_rm_val):
cleaned_data <- data.frame(vapply(data_rm_val, remove_outliers, numeric(nrow(data_rm_val))))
Above two options work on a dataset of random numbers (since OP does not provide example data):
data_rm_val <- data.frame(matrix(rnorm(25),5))
# X1 X2 X3 X4 X5
# 1 0.4303766 1.8152041 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 -1.7940421 -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 1.23811458
cleaned_data <- data.frame(lapply(data_rm_val, remove_outliers))
# X1 X2 X3 X4 X5
# 1 0.4303766 NA 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 NA -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 NA
cleaned_data2 <- data.frame(vapply(data_rm_val,
remove_outliers, numeric(nrow(data_rm_val))))
# X1 X2 X3 X4 X5
# 1 0.4303766 NA 0.3355174 -0.4880282 -0.63612820
# 2 0.2876950 -0.7613642 -1.5046115 0.1821653 0.09397964
# 3 -2.3402548 -0.6771749 -2.0122667 -0.9442210 -1.30994853
# 4 1.4224979 NA -0.5110736 -0.2837820 -0.24240172
# 5 -0.7484131 -0.8159326 -1.2690513 -1.0422656 NA
Related
I have a dataframe with 300 columns (labeled like so X17.01, X24.05, X200.4...) and 500 rows. I want to rescale those columns to be between 0 and 1 but based on the min/max of each individual column. So, for example, I want to rescale column X17.01 separately from X24.05.
I have used the following codes in R (below) but both rescale the whole data frame.
Code 1:
Data_profile_standardized <- data.frame(lapply(Data_profile, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))
Code 2:
normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
dfout <- data.frame(normalize(Data_profile_standardize ))
Let's make a nice small test case so we can see what's going on easily:
df = data.frame(X1 = 1:3, X2 = c(100, 150, 1000))
The problem with scale is not that it is applied to the whole data frame, rather it is that with center = FALSE all it does is divide by the maximum, so you don't get any 0s:
data.frame(lapply(df, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))
# X1 X2
# 1 0.3333333 0.10
# 2 0.6666667 0.15
# 3 1.0000000 1.00
The problem with your normalize function is that it is vectorized as written, and the Vectorize is not necessary. Vectorizing it makes it try to normalize each individual entry, not each column, and since the diff(range()) of a single number is 0, you are dividing by 0 and getting NaN as a result:
normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
data.frame(lapply(df, normalize))
# X1 X2
# 1 NaN NaN
# 2 NaN NaN
# 3 NaN NaN
Let's leave off the Vectorize (and add na.rm = TRUE for good measure, in case there are NA values in your actual data):
normalize = function(v) (v - min(v, na.rm = TRUE)) / diff(range(v, na.rm = TRUE))
data.frame(lapply(df, normalize))
# X1 X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000
This works!
Note that we could work more with the scale function. If you specify center = min(x), then the minimums will be subtracted and you'll get 0s... but then max(x) is no longer the correct scale factor. We need to use diff(range()) here, just like in the other methods:
# also works
data.frame(lapply(df, function(x) scale(
x,
center = min(x, na.rm = TRUE),
scale = diff(range(x, na.rm = TRUE))
)))
# X1 X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000
Make a copy of the data.frame and then normalize it by lapplying the normalizing function to the data set.
Note the square brackets without which dfout doesn't keep its tabular shape.
normalize <- function(v, na.rm = FALSE) (v - min(v, na.rm = na.rm))/diff(range(v, na.rm = na.rm))
dfout <- Data_profile
dfout[] <- lapply(dfout, normalize)
With the data in Gregor Thomas's answer, the result is the following.
dfout
# X1 X2
#1 0.0 0.00000000
#2 0.5 0.05555556
#3 1.0 1.00000000
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.
So far this is the function I have come up with:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values
My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?
Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
To get a vector, use apply instead of lapply.
as.data.frame(apply(df$name, normalize))
Update to address Holger's suggestion.
If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:
normalize <- function(x, ...) {
return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}
x <- c(1, NA, 2, 3)
normalize(a)
# [1] NA NA NA NA
normalize(a, na.rm = TRUE)
# 0.0 NA 0.5 1.0
Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.
Regarding your 2nd question, you can use the caret package:
library(caret)
train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)
pp = preProcess(train, method = "range")
predict(pp, train)
# a b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0
predict(pp, test)
# a b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5 0.0
# 5 2.0 0.5
# 6 2.5 1.0
This packages also defines other transformation methods, see: http://machinelearningmastery.com/pre-process-your-dataset-in-r/
set.seed(1984)
### simulating a data set
df <- data.frame(var1 = rnorm(100,5,3),
var2 = rpois(100,15),
var3 = runif(50,90,100))
df_train <- df[1:60,]
df_test <- df[61:100,]
## the function
normalize_data <- function(train_set, test_set) ## the args are the two sets
{
ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation
normalized_train <- train_set/ranges # the normalization
normalized_test <- test_set/ranges
return(list(ranges = ranges, # returning a list
normalized_train= normalized_train,
normalized_test =normalized_test ))
}
z <- normalize_data(df_train, df_test) ## applying the function
## the results
z$ranges
var1 var2 var3
13.051448 22.000000 9.945934
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
I am using R to apply a self-written function, that takes as an input two numeric vectors plus a numeric parameter, over column margins of data frame. Each column in data frame is a numeric vector and I want to perform pairwise computations and create a matrix which has all possible combinations of the columns with indicated result of the computation. So essentially I want to generate a behaviour similar to the one yielded by cor() function.
# Data
> head(d)
1 2 3 4
1 -1.01035342 1.2490665 0.7202516 0.101467379
2 -0.50700743 1.4356733 0.9032172 -0.001583743
3 -0.09055243 0.4695046 2.4487632 -1.082570048
4 1.11230416 0.2885735 0.3534247 -0.728574628
5 -1.96115691 0.4831158 1.5650052 0.648675605
6 1.20434218 1.7668086 0.2170858 -0.161570792
> cor(d)
1 2 3 4
1 1.00000000 0.08320968 -0.06432155 0.04909430
2 0.08320968 1.00000000 -0.04557743 -0.01092765
3 -0.06432155 -0.04557743 1.00000000 -0.01654762
4 0.04909430 -0.01092765 -0.01654762 1.00000000
I found this useful answer: Perform pairwise comparison of matrix
Based on this I wrote this function which makes use of another self-written function compareFunctions()
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y)) # df creation for predicted values from density function
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y # convert df of original values to df of predicted values from density function
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df) # give the predicted values column names as in the original df
predDF <- as.matrix(predDF)
out.mx <- apply( X=combinations,MARGIN = 2,FUN = "compareFunctions",
predicted_by_first = predDF[,combinations[1]],
predicted_by_second = predDF[,combinations[2]],
threshold = threshold)
return(out.mx)
}
The predicted_by_first, predicted_by_second and threshold are inputs for compareFunctions. However I get the following error:
Error in FUN(newX[, i], ...) : unused argument (newX[, i])
In desperation I tried this:
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y))
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df)
predDF <- as.matrix(predDF)
out.mx <- apply(
X=combinations,MARGIN = 2,FUN = function(x) {
diff <- abs(predDF[,x[1]]-predDF[,x[2]])
boolean <- diff<threshold
acceptCount <- length(boolean[boolean==TRUE])
probability <- acceptCount/length(diff)
return(probability)
}
)
return(out.mx)
}
It does seem to be working but instead of returning the pairwise matrix it gives me a vector:
> createProbOfNonEqMatrix(d,0.001)
[1] 0.10351562 0.08203125 0.13476562 0.13085938 0.14843750 0.10937500
Will you be able to guide me on how to make the desired pairwise matrix even if it implies writing the function code again within apply()? Also, if you could give me an idea on how to keep track of what pairwise comparisons are performed it will be greatly appreciated.
Thank you,
Alex
Your output gives you the result of the calculation in the order of the pairs in combinations: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). If you want to organise this into a symmetric square matrix you can do a basic manipulation on the result, e.g. as follows:
out.mx<-c(0.10351562, 0.08203125, 0.13476562, 0.13085938, 0.14843750, 0.10937500)
out.mtx<-matrix(nrow=ncol(df1),ncol=ncol(df1))
out.mtx[,]<-1
for (i in 1:length(combinations[1,])){
a<-combinations[1,i]
b<-combinations[2,i]
out.mtx[a,b]<-out.mtx[b,a]<-out.mx[i]
}
out.mtx
which gives you
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.1035156 0.08203125 0.1347656
[2,] 0.10351562 1.0000000 0.13085938 0.1484375
[3,] 0.08203125 0.1308594 1.00000000 0.1093750
[4,] 0.13476562 0.1484375 0.10937500 1.0000000
I have a data.frame with several variables X1,X2... and a grouping variable "site" I want to find the proportion of X1 with site==1 greater than X1 with site==2, I can do that with a fixed number of site levels and each variable at a time, but I would like to generalize for any number of levels and several variables, following is an example:
# Generate data
set.seed(20130226)
n <- 100
x1 <- matrix(c(rnorm(n, mean = 2),rnorm(n, mean = 5)),ncol=2)
x2 <- matrix(c(rnorm(n, mean = 1), rnorm(n, mean = 4)),ncol=2)
x3 <- matrix(c(rnorm(n, mean = 3), rnorm(n, mean = 3)),ncol=2)
xx <- data.frame(x1,site=1)
xx <- rbind(xx, data.frame(x2,site=2))
xx <- rbind(xx, data.frame(x3,site=3))
# comparisons
s <- unique(xx$site)
me1 <- with(xx,xx[site==s[1],])
me2<- with(xx,xx[site==s[2],])
me3<- with(xx,xx[site==s[3],])
Pg1.gt.g2 <- sum(me1[,c("X1")]>me2[,c("X1")])/nrow(me1)
Pg1.gt.g3 <- sum(me1[,c("X1")]>me3[,c("X1")])/nrow(me1)
Pg2.gt.g3 <- sum(me2[,c("X1")]>me3[,c("X1")])/nrow(me1)
# build table
comp1 <- data.frame(Group=c(paste(s[1],">",s[2]),paste(s[1],">",s[3]),paste(s[2],">",s[3])), P=c(Pg1.gt.g2, Pg1.gt.g3,Pg2.gt.g3))
print(comp1)
I don't figure out how to do this for a different numbers of groups and several variables, maybe using plyr
Thanks!
I would reshape the data into a matrix where each column represents a group:
# Unique sites
s <- unique(xx$site)
# Columns are each group, data are X1 values
mat <- do.call(cbind, lapply(split(xx, xx$site), function(x) x$X1))
# Compare all pairs of sites
do.call(rbind, apply(combn(seq_along(s), 2), 2,
function(x) data.frame(g1=s[x[1]], g2=s[x[2]],
prop=sum(mat[,x[1]] > mat[,x[2]])/nrow(mat))))
# g1 g2 prop
# 1 1 2 0.83
# 2 1 3 0.20
# 3 2 3 0.09