How do I impute values by factor levels using 'missForest'? - r

I am trying to impute missing values in my dataframe with the non-parametric method available in missForest.
My data (OneDrive link) consists of one categorical variable and five continuous variables.
head(data)
phylo sv1 sv2 sv3 sv4 sv5
1 Phaon_camerunensis 6.03803 NA 5121.257 NA 70
2 Umma_longistigma 6.03803 NA 5121.257 NA 53
3 Umma_longistigma 6.03803 NA 5121.257 NA 64
4 Umma_longistigma 6.03803 NA 5121.257 NA 63
5 Sapho_ciliata 6.03803 NA 5121.257 NA 63
6 Sapho_gloriosa 6.03803 NA 5121.257 NA 63
I was successful at first using missForest()
imp<- missForest(data[2:6])
However, instead of aggregating over the whole data matrix (or vector? idk exactly) I would like to impute missing values by phylo.
I tried data[2:6] %>% group_by(phylo) %>% and sapply(split(data[2:6], data$phylo)) %>% but no success.
Any guess on how to deal with it?

If you want to run missForest for each group, you can use group_map:
imp <- df %>% group_by(phylo) %>% group_map(~ missForest(.))
To get only the first item from the result:
imp2 <- t(sapply(imp, "[[", 1))

Related

Displaying answers on ranking question in R

I have the following variables which are the result of one ranking question. On that question, participants got the 7 listed motivations presented and should rank them. Here, value 1 means the participant put the motivation on position 1, and value 7 means he put it on last position. The ranking is expressed through the numbers on these variables (numbers 1 to 7):
'data.frame': 25 obs. of 8 variables:
$ id : num 8 9 10 11 12 13 14 15 16 17 ...
$ motivation_quantity : num NA 3 1 NA 3 NA NA NA 1 NA ...
$ motivation_quality : num NA 1 6 NA 3 NA NA NA 3 NA ...
$ motivation_timesaving : num NA 6 4 NA 2 NA NA NA 5 NA ...
$ motivation_contribution : num NA 4 2 NA 1 NA NA NA 2 NA ...
$ motivation_alternativelms: num NA 5 3 NA 6 NA NA NA 7 NA ...
$ motivation_inspiration : num NA 2 7 NA 4 NA NA NA 4 NA ...
$ motivation_budget : num NA 7 5 NA 7 NA NA NA 6 NA ...
What I want to do now is to calculate and visualize the results on the ranking question (i.e. visualizing the results on the motivations). Since I havent worked with R for a long time, I am not sure how to best do this.
One way I could imagine is to first calculate the top 3 answers (which are the motivations which were most frequently ranked on position "1", "2", and "3" across participants.
Would really appreciate it if someone could help out with doing this or even show a better way how to analyse and visualize my data.
I originally had an visualization in microsoft forms but this one got destroyed by a bug overnight. It looked like this:
These variables are defined by RStudio as numeric (in statistics terms it refers to continuous variables). The goal is then to convert them into categorical variables (called factors in RStudio).
Let's get to work :
library(dplyr)
library(tidyr)
# lets us first convert the id column into integers so we can apply mutate_if on the other numeric factors and convert all of them into factors (categorical variables), we shall name your dataframe (df)
df$id <- as.integer(df$id)
# and now let's apply mutate_if to convert all the other variables (numeric) into factors (categorical variables).
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7)
# I guess in your case that would be all, but if you wanted the content of the dataframe to be position_1, position_2 ...position_7, we just add labels like this :
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7,
labels = paste(rep("position",7),1:7,sep="_"))
# For the visualisation now, we need to use the function gather in order to convert the df dataframe into a two column dataframe (and keeping the id column), we shall name this new dataframe df1
df1 <- df %>% gather(key=Questions, value=Answers, motivation_quantity:motivation_budget,-id )
# the df1 dataframe now includes three columns : the id column - the Questions columns - the Answers column.
# we can now apply the ggplot function on the new dataframe for the visualisation
# first the colours
colours <- c("firebrick4","firebrick3", "firebrick1", "gray70", "blue", "blue3" ,"darkblue")
# ATTENTION since there are NAs in your dataframe, either you can recode them as zeros or delete them (for the visualisation) using the subset function within the ggplot function as follows :
ggplot(subset(df1,!is.na(Answers)))+
aes(x=Questions,fill=Answers)+
geom_bar()+
coord_flip()+
scale_fill_manual(values = colours) +
ylab("position_levels")
# of course you can enter many modifications into the visualisation but in total I think that's what you need.

Is there a way to add only a name to an r data.frame? [duplicate]

This question already has answers here:
Add empty columns to a dataframe with specified names from a vector
(6 answers)
Closed 2 years ago.
Is there a way to add new columns to a data.frame and populate them with NAs (to be filled by a later function), where the titles of the new column come from a vector of characters (strings).
Example:
What I'd like to do is add some new columns based on the subject column of this data.frame
exam_results <- data.frame(
subject = c("maths", "economics", "chemistry"),
final_score = c(70, 78, 61),
midterm_score = c(53, 66,40)
)
i.e:
# new object for new df
exam_results_new_columns <- exam_results
# get the names of the different subjects
subjects <- unique(exam_results$subject)
column_names <- c()
for (i in 1:length(subjects)) {
column_names[i] <- paste0("subject_", subjects[i])
exam_results_new_columns$i <- NA
}
This will produce the following df:
subject
final_score
midterm_score
i
maths
70
53
NA
economics
78
66
NA
chemistry
61
40
NA
but what I would like is:
subject
final_score
midterm_score
subject_economics
subject_chemistry
maths
70
53
NA
NA
economics
78
66
NA
NA
chemistry
61
40
NA
NA
is there a way I can achieve this?
In the for cycle, use square brackets [] instead of the dollar sign $
subjects <- unique(exam_results$subject)
column_names <- paste0("subject_", subjects)
for(i in column_names)
exam_results_new_columns[,i] <- NA
Output
# subject final_score midterm_score subject_maths subject_economics subject_chemistry
# 1 maths 70 53 NA NA NA
# 2 economics 78 66 NA NA NA
# 3 chemistry 61 40 NA NA NA

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)
Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])
coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

How can I input missing sd in a dataframe and then enforce NAs on the column sd automatically as a function into a new data frame?

Here it is what I was trying to do:
Set up a dataframe:
df=data.frame(m=runif(500,0,100),n=round(runif(500,1,100),sd=runif(500,1,25))
head(df)
df$sd=as.data.frame(lapply(df[3],function(cc)cc[sample(c(TRUE,NA),prob=c(0.85,0.15),size=length(cc),replace=TRUE)]))
Assess if the SD in the data are missing:
NaS=which(is.na(df),arr.ind=TRUE)[,1]
NaM=noquote(paste0(NaS,sep=","))
Get the mean values from the df where the sd is missing, this is the clunky bit as I need to manually copy and paste the values of NaM here:
xm=df[c(...),1]
xm
Get the n values from the df where the sd are missing:
xn=df[c(...),2]
xn
Make this a dataframe:
Simdf=data.frame(xm,xn)
Hopefully I am understanding you correctly but it seems you just want the m and n columns where is.na(df$sd) == TRUE? I would just use subset for that:
df=data.frame(m=runif(500,0,100),n=round(runif(500,1,100)),sd=runif(500,1,25))
head(df)
df$sd=as.data.frame(lapply(df[3],function(cc)cc[sample(c(TRUE,NA),
prob=c(0.85,0.15),size=length(cc),
replace=TRUE)]))
df_NA <- subset(df, is.na(sd))
R> head(df_NA)
m n sd
8 0.8887 85 NA
20 86.1660 71 NA
26 46.9202 83 NA
48 84.4475 41 NA
51 4.8426 3 NA
53 61.7181 92 NA

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Resources