Replace missing values with Mean and Mode (Custom function) - r

I am required to build a function which uses mean to replace missing values for continuous/integer variables and uses mode to replace missing values for categorical variables.
The data comes from credit screening dataset
X <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header = FALSE, na.strings = '?')
The first column of the dataset is of factor type, second and third columns are numeric.....
I built a mode function
mode_function <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Which works as intended.
The overall function that I am using on the dataset is
broken <- function(data){
for(i in 1:ncol(data)){
if(is.factor(data[,i])){
data[is.na(data[,i]),i] <- mode_function(data[,i])
}
else{
data[is.na(data[,i]),i] <- mean(data[,i], na.rm = TRUE)
}
}
return(data)
}
Problem: I run this function and nothing changes in my dataset. I still have the same number of missing values as I did before the function was run.
This line outside of the function works just as intended. The same with the code that deals with mean.
data[is.na(data[,i]),i] <- mode_function(data[,i])
But once I try to use my function to perform the exact same operations nothing happens.

The most likely reason for "nothing happening" is failing to assign a result to an R name/symbol. Perhaps trying this:
maybe_res <- broken(data)
Chaeck this:
> sapply(X, function(x) sum(is.na(x)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
12 12 0 6 6 9 9 0 0 0 0 0 0 13 0 0
> sapply( broken(X), function(x) sum(is.na(x)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I should warn you that mode functions are notorious for delivering answers that may not be what are desired.

Related

convert a list of class data frame objects into objects of class matrix - list

I have a list of data frames. Each data frame has 6 rows and 6 columns. They are all numbers, however, all data frames have their elements as class character.
Example:
$`A`
V1 V2 V3 V4 V5 V6
V1 0.1212 0.6231 0.4431 0.3213 0.6578 0.1259
V2 2.1234 0.6532 0.9845 0.8743 0.8732
V3 0.2314 0.7648 0.7634 0.8732
V4 0.1234 0.6544 0.3456
V5 0.7653 0.9812
V6 0.1265
$`B`
V1 V2 V3 V4 V5 V6
V1 0.2345 0.1234 0.5647 0.7891 0.6721 0.3259
V2 1.1334 0.4332 0.1245 0.2343 0.5332
V3 0.2914 0.1648 0.2334 0.1232
V4 0.1234 0.6744 0.5656
V5 0.3553 0.9812
V6 0.4665
I would like to change all data frames of the list to class matrix (numerical).
I tried:
lapply (list, data.matrix)
but the result is a list of data frames with integers. Example:
V1 V2 V3 V4 V5 V6
V1 2 2 2 2 2 4
V2 1 3 4 5 5 7
V3 1 1 3 4 6 3
V4 1 1 1 3 4 5
V5 1 1 1 1 1 1
V6 1 1 1 1 1 1
Also tried to run
lapply(list, as.matrix)
however, I got a list of quoted matrices, like this:
$`A`
V1 V2 V3 V4 V5 V6
V1 "0.1212" "0.6231" "0.4431" "0.3213" "0.6578" "0.1259"
V2 "2.1234" "0.6532" "0.9845" "0.8743" "0.8732"
V3 "0.2314" "0.7648" "0.7634" "0.8732"
V4 "0.1234" "0.6544" "0.3456"
V5 "0.7653" "0.9812"
V6 "0.1265"
How can I convert these data frames of my list from character class to matrix class?
We may loop over the list, then loop over the data.frame columns with lapply convert to numeric and assign it back to the original data.frame object and return the data.frame ('x')
list <- lapply(list, function(x) {x[] <- lapply(x, as.numeric);x})
If those are factor columns, convert to character first and then to numeric
lapply(list, function(x) {x[] <- lapply(x, function(y) as.numeric(as.character(y)))
x})
You can convert to numeric and then reset the matrix order:
lapply(dfs, function(x) matrix(as.numeric(x), ncol = n_cols))
Data
set.seed(1L)
n_cols <- 6
n_total <- 36
a <- matrix(rnorm(n_total), ncol = n_cols)
b <- matrix(rnorm(n_total), ncol = n_cols)
a[lower.tri(a)] <- ""
b[lower.tri(b)] <- ""
dfs <- list(a, b)

How do I sort my columns in a dataframe so that all the identical strings in a row are in the same column

I have a very large dataframe of 63 columns and 1697 rows. The end of the rows fill up with NAs but I want the matching values in rows to be in the same column, and stick the NAs into the gaps
a bit like this (updated):
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2","NA","v2","v2","v2")
v3 <- c("v4","v4","v3","NA","v3","v3", "v3")
v4 <- c("v5","v5","v4","NA","v5","v4","NA")
v5 <- c("NA","NA","v5","NA","v6","v6", "NA")
v6 <- c("NA","NA","v6","NA","v7","v7","NA")
v7 < - c("NA","NA","NA","NA","NA","NA","NA")
df <- data.frame(v1,v2,v3,v4,v5,v6,v7)
df
v1 v2 v3 v4 v5 v6 v7
1 v1 v3 v4 v5 NA NA NA
2 v1 v2 v4 v5 NA NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 v5 v6 v7 NA
6 v1 v2 v3 v4 v6 v7 NA
7 v1 v2 v3 NA NA NA NA
but I would like everything aligned like this:
v1 v2 v3 v4 v5 v6 v7
1 v1 NA NA v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I have tried map.values() and this didn't come out as expected, as well as a ifelse() but this all requires me to enter specific cell data and change that.
The column names do match the cell names.
I want to use the data to put into a presence absence plot, so I figured after I can just
for (i in 1:63){
gsub("NA", 0, df[,i]}
and then same for anything containing "v" to have a binary 1 or 0 for presence or absence, but they have to be aligned
There are no predefined rules governing the data, the dataframe has been conglomerated together from many other .csv files and this is the best format I can get it into currently.
Any help would be appreciated!
Updated answer to match new input data
Data
I removed the quotation marks from NA:
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2",NA,"v2","v2","v2")
v3 <- c("v4","v4","v3",NA,"v3","v3", "v3")
v4 <- c("v5","v5","v4",NA,"v5","v4",NA)
v5 <- c(NA,NA,"v5",NA,"v6","v6", NA)
v6 <- c(NA,NA,"v6",NA,"v7","v7",NA)
v7 <- c(NA,NA,NA,NA,NA,NA,NA)
df <- data.frame(v1,v2,v3,v4,v5,v6,v7, stringsAsFactors = F)
Code
l <- list()
u <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")
h <- NULL
for(k in 1:nrow(df)){
# create a list for each row of the df
l[[k]] <- df[k, ]
for(i in 1:length(l[[k]])){
#check if number exists in the row
if(u[i] %in% l[[k]]){
# find the index of the number given it exists
a <- which(l[[k]] == u[i])
#assign to "help" vector in order to not overwrite values
h[i] <- l[[k]][a]
}
else{
#numbers that do not exist in the vector are asigned NA
h[i] <- NA
}
}
#replace row by sorted vector with NA place holders ("help" vector)
l[[k]] <- h
}
Result
df1 <- as.data.frame(do.call(rbind, l))
df1
V1 V2 V3 V4 V5 V6 V7
1 v1 NA v3 v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA

Fast method in R of obtaining a value from one column based on searched value in another?

I have two data frames, one of which has a large list of two identifiers:
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
And one which contains one of the two identifiers:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 rs6682375 0 14907 G A
My solution at present is to use a for ... if loop as follows:
for (x in 1:nrow(df2)){
if (df2$V2[x] %in% df1$uniq_id){
df2$V2[x] = df1$rsid[x]
}
}
However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.
Someone suggested that using the match() function might be quicker, but given that the R documentation for this suggests that %in% is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.
Any help appreciated.
This is an update-join, in data.table terminology. Assuming the first table is called df and the second is called df2:
library(data.table)
setDT(df)
setDT(df2)
df2[df, on = .(V2 = uniq_id), V2 := rsid]
df2
# V1 V2 V3 V4 V5 V6
# 1: 1 1_10439_A_AC 0 10439 A AC
# 2: 1 1_13417_CGAGA_C 0 13417 C CGAGA
# 3: 1 rs6682375 0 14907 G A
Data used
df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
A method using base r. It would also be easy to perform this using dplyr and it's left_join function if desired.
df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL
df2
# V2 V1 V3 V4 V5 V6
# 1: 1_10439_A_AC 1 0 10439 A AC
# 2: 1_13417_CGAGA_C 1 0 13417 C CGAGA
# 3: rs6682375 1 0 14907 G A

Replace specific values in a data frame except first column

I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158

How to create a binary matrix from a data.frame?

I have a data.frame with 16 columns. Here's one example row.
> data[16,]
V1 V2 V3 V4
16 comp27182_c0_seq4 ENSP00000442096 ENSG00000011143 ENSFCAP00000011376
V5 V6 V7 V8
16 ENSFCAG00000012261 comp48601_c0_seq1 comp19130_c0_seq3 comp22796_c2_seq3
V9 V10 V11 V12
16 comp146901_c0_seq1 comp157916_c0_seq1 comp158124_c0_seq1
V13 V14 V15 V16
16 comp229797_c0_seq1 comp61875_c0_seq2
I'm only interested in columns 1 and 6-16. The first column contains the name I would like to use as a column name in the matrix, 6 to 16 may contain either a string or '' (nothing).
I would like to transform this data.frame into a matrix showing 1 or 0, reflecting the content in columns 6-16.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
comp27182_c0_seq4 1 1 1 1 0 1 1 1 1 0 0
I've trying to use mask without success. I'm sure there's a very easy option out there.
Thanks for any help.
Try this:
do.call(cbind, lapply(c(1,6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
I slightly modified your code to my exact needs. Now the first column is naming the rows.
a<-do.call(cbind, lapply(c(6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
rownames(a)<-data[,1]
It works great, thanks!

Resources