Suppose my data frame DF has two colums $A and $B. $A is always present. $B is sometimes coded NaN when the value is missing. I want to predict $B.predicted, the missing values for $B, and create a new column $B.complete such that $B.complete[i] is $B.predicted if $B[i] is NaN and is $B[i] otherwise.
I use multinom, which requires a factors as the dependent variable, to predict the B's where I have a full observation, using:
DF$B.factor <- factor(DF$B)
model.results <- multinom(formula=B.factor ~ A,
data=DF[!is.na(DF$B),])
B.predicted <- predict(model.result, newdata=DF, type="class")
The variable B.predicted is a factor.
My DF$B column is not a factor.
Mu question is how to I merge DF$B and B.predicted to create B.complete? In particular, since B.predicted is a factor and DF$B is not, does this code pick up the correct values?
B.complete <- ifelse(is.na(DF$B), $B.predicted, DF$B)
Use replace
set.seed(1)
DF <- data.frame(A = factor(sample(letters[1:5],30, TRUE)),
B = sample(c(letters[1:3],NA), 30 , TRUE, prob = rep(c(0.3,0.1),c(3,1))),
stringsAsFactors = F)
DF$B.factor <- factor(DF$B)
# no need to include is.na(DF$B) as multinom will omit anyway
model <- multinom(B.factor ~ A, data = DF)
# use replace to replace the NA values (converting to character when necessary)
DF$B.complete <- replace(DF$B, is.na(DF$B), as.character(predict(model, newdata = DF[is.na(DF$B),])))
Related
I'm trying to test various imputation methods in R and I've written a function which takes a data frame, inserts some random NA values, imputes the missing values and then compares the imputation method back to the original data using MAE.
My function looks as follows:
pacman::p_load(tidyverse)
impute_diamonds_accuracy <- function(df, col, prop) {
require(tidyverse)
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[idx_na, col] - df[idx_na, col]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}
impute_diamonds_accuracy(df = diamonds, col = 'cut', prop = 0.02)
The function prints to the screen that it's doing the imputation but it fails when it performs that MAE calculation with the following error:
Error in imputed_df[idx_na, col] - df[idx_na, col] :
non-numeric argument to binary operator
How can I compare the original data against the imputed version to get a sense of the accuracy?
diamonds is a tibble.
> library(ggplot2)
> data(diamonds)
> is_tibble(diamonds)
[1] TRUE
so we may need to use [[ to extract the column as a vector. Also, the idx_na returns the index of NA elements in data. If we want to use the subset comparison, make a copy of the original data before we assign NAs, and then do the comparison between the imputed and original data
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
-full code
impute_diamonds_accuracy <- function(df, col, prop) {
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
df_cpy <- data.table::copy(df)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- mice::complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}
I am trying to upsample an imbalanced dataset in R using the upSample function in Caret. However upon applying the function it completely removes the target variable C_flag from the dataset. Here is my code:
set.seed(100)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[, colnames(train) %ni% "C_flag"], #all predictor variables
y = train$C_flag) #target variable
Here is the amount of each category of C_flag in the train set.
0 = 100193, 1=29651.
I test to see if C_flag is there with this result:
print(up_train$C_flag)
NULL
Does anyone know why this function is removing this variable instead of upsampling?
First thing that comes to my mind is if up_train$C_flagis a factor or not. Anyway, I tried this sample dataset:
library(tidyverse)
library(caret)
train <- data.frame(x1 = c(2,3,4,2,3,3,3,8),
x2 = c(1,2,1,2,4,1,1,4),
C_flag = c("A","B","B","A","A","A","A","A"))
train$C_flag <- as.factor(train$C_flag)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag)
up_train$C_flag
And it returned me NULL. Why?, because the target column was renamed "Class". So if you want to see the target with the name C_flag add the yname name you want:
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag,
yname = "C_flag")
print(up_train$C_flag)
[1] A A A A A A B B B B B B
Levels: A B
I know inputting a data.frame to cor will automatically give a correlation matrix (ex. cor(mtcars)).
But I wonder why when I input my own data.frame (dat_w) to cor I get the below error?
I have NA and Inf but have used use = 'pairwise.complete.obs'.
dat_w <- read.csv('https://raw.githubusercontent.com/izeh/n/master/w1.csv', stringsAsFactors = F)
cor(dat_w, use = 'pairwise.complete.obs')
# >Error : 'x' must be numeric
We can find the columns that are numeric automatically
i1 <- sapply(dat_w, is.numeric)
out <- cor(dat_w[i1], use = 'pairwise.complete.obs')
If we want to replace the NaN with some value, i.e. 0
out1 <- replace(out, is.nan(out)|is.na(out), 0)
Because your 2nd column (gender) is not numeric. Try :
cor(dat_w[-2], use = 'pairwise.complete.obs')
I've two dfs (d1 and d2). I have to apply a correlation function to each time series after matching the Product IDs from a third data frame (m, in this example). It can be a one-to-many match. For example, "c" is linked to both "C" and "D". So, for "A"-"a", cor function would be applied to (1,2) from d1 and (50,20) from d2. How do we automate this process? Any sort of help is appreciated! Thanks!
#Reproducible Example
d1 <- data.frame(ProductID1 = c("A","B","C","D"),
Aug16 = c(1,2,3,4), Sep16 = c(2,3,4,4))
d2 <- data.frame(ProductID2 = c("a","b","c"),
Aug16 = c(50,20,30), Sep16 = c(20,40,40))
m <- data.frame(ProductID1 = c("A","B","C","D"),
ProductID2 = c("a","b","c","c"))
# Look up the value of A in dataframe m, the value is "a".
#Find "a" in dataframe d2. Apply cor() to A's time series and a's time series.
# Output should look like this. I can put the correlation values(-0.1285341, etc.) in a matrix.
# A matched with a, apply cor function
cor(as.numeric(d1[1,2:3]),as.numeric(d2[1,2:3]))
[1] -0.1285341
# B matched with b, apply cor function
cor(as.numeric(d1[2,2:3]),as.numeric(d2[2,2:3]))
[1] 0.8808123
# C matched with c, apply cor function
cor(as.numeric(d1[3,2:3]),as.numeric(d2[3,2:3]))
[1] -1
# D matched with c, apply cor function
cor(as.numeric(d1[4,2:3]),as.numeric(d2[3,2:3]))
[1] NA
Not sure, but perhaps this is what you are looking for (but does not exactly matchthe given desired output)..
myFunction <- function(x,y){
subset_d1 <- d1[ d1$ProductID1 == x, ]
subset_d2 <- d2[ d2$ProductID2 == y, ]
return( cor( as.numeric( subset_d1 ), as.numeric( subset_d2 ) ) )
}
mapply( myFunction, m$ProductID1, m$ProductID2, SIMPLIFY = TRUE )
output
[1] -0.1285341 0.8808123 0.7088739 NA
with a warning for the fourth row of m... standard deviation is zero (so cor --> NA )
I intend to find Pearson correlation coefficient from multi-dim data to one numeric vector in R. Basically, I am expecting to get a correlation matrix by using the Pearson method, want to keep the rows (a.k.a, features for each column) in multi-dim data by using certain correlation coefficient as threshold.However, I tentatively tried some R implementation to do that but didn't get correct correlation matrix though. How can I get this one? can anyone point me out how to make this happen easily in R? any thought?
reproducible example
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
in persons_df, different features in row-wise and different persons in column-wise are given.
I have also age_df which has age of each person.
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my initial attempt:
pearson_corr <- function(df1, df2, verbose=FALSE){
stopifnot(ncol(df1)==nrow(df2))
res <- as.data.frame()
lapply(colnames(df1), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(df2)){
cor_mat <- stats::cor(y, df2$age, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above code didn't return correct correlation matrix. what I want to do how each features of the certain person is correlated with his age. Is there any efficient way to make this happen? any idea?
goal:
basically, I want to keep the features which show a high correlation with age. I don't have a better idea to do this in R. Can anyone point me out how to get his done easily and efficiently in R? thanks
mylist = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value)
}))
mylist
# t p
#a -1.060264 3.488012e-01
#b -2.292612 8.361623e-02
#c -16.785311 7.382895e-05
#d -1.362776 2.446304e-01
#e -1.922296 1.269356e-01
#f -4.671259 9.509393e-03
#g -3.719296 2.048710e-02
#h -2.684663 5.496171e-02
#i -15.814635 9.341701e-05
#j -2.423014 7.252635e-02
Then use mylist to filter out what values you don't want.