impute data with condition (4 levels) in R - r

I have two different variables in my data. first one: vein size (there are NAs).
Second variable is: procedure site (c=(1,2,3,4))
I want to impute different value to vein size based on different procedure site. I tried if else, but it wasn't successful. e.g.: if procedure site is 1 or 2, impute 3; if procedure site is 3, impute 4; if procedure site is 4, impute 5.I am new to this field. Any help is much appreciated!
vein.size<-(3,3,3,NA,NA,NA)
procedure.site<-(1,2,2,3,4,4)
df<-cbind(vein.size,procedure.site)
My expected output is:
vein.size<-(3,3,3,4,5,5)
thank you

You can use chain of ifelse statement or try case_when from dplyr :
library(dplyr)
df <- df %>%
mutate(output = case_when(is.na(vein.size) & procedure.site %in% 1:2 ~ 3,
is.na(vein.size) & procedure.site == 3 ~ 4,
is.na(vein.size) & procedure.site == 4 ~ 5,
TRUE ~ vein.size))
# vein.size procedure.site output
#1 3 1 3
#2 3 2 3
#3 3 2 3
#4 NA 3 4
#5 NA 4 5
#6 NA 4 5
data
vein.size<-c(3,3,3,NA,NA,NA)
procedure.site<-c(1,2,2,3,4,4)
df<-data.frame(vein.size,procedure.site)

You could use a lookup table and then merge:
# your data
vein.size <- c(3,3,3,NA,NA,NA)
procedure.site <- c(1,2,2,3,4,4)
your_df <- data.frame(vein.size = vein.size,
procedure.site = procedure.site)
# the lookup table
lookup_df <- data.frame(
procedure.site = c(1, 2, 3, 4),
imputation = c(3, 3, 4, 5)
)
# result
merge(your_df, lookup_df, by='procedure.site')
Which gives:
procedure.site vein.size imputation
1 1 3 3
2 2 3 3
3 2 3 3
4 3 NA 4
5 4 NA 5
6 4 NA 5

Related

How to list R data frame variables in alphabetical order? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

How to input nominal values into one column based on values in another column through If_else statements

I am trying to input nominal variables based on a column dedicated to age. Basically, if someone is between the ages of 1 to 5, indicated in the age column, then I want the age group column to have the value of 1, since they are in age group 1. I'm trying to do this in multiple columns since ages increase by one each year. I've tried doing this through a for loop that uses an if else function, but it does not work.
`my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
my_data_1<-cbind("AgeGroup2000"=NA, "AgeGroup2001"=NA, "AgeGroup2002"=NA, my_data_1)
my_data_1
#I'm basically trying to make the below code into a for loop
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 1:5]<-1
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 6:10]<-2
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 11:15]<-3
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 1:5]<-1
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 6:10]<-2
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 11:15]<-3
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 1:5]<-1
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 6:10]<-2
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 11:15]<-3`
Maybe it is better to use findInterval or cut here. We can use lapply to apply it for multiple columns
my_data_1[paste0("AgeGroup_", 2000:2002)] <- lapply(my_data_1, findInterval, c(1, 6, 11))
# Age2000 Age2001 Age2002 AgeGroup_2000 AgeGroup_2001 AgeGroup_2002
#Participant1 1 2 3 1 1 1
#Participant2 3 4 5 1 1 1
#Participant3 5 6 7 1 2 2
#Participant4 7 8 9 2 2 2
#Participant5 9 10 11 2 3 3
#Participant6 11 12 13 3 3 3
Or mutate_all from dplyr
library(dplyr)
my_data_1 %>% mutate_all(list(Group = ~findInterval(., c(1, 6, 11))))
data
my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)

How to reorder columns of a data.frame with na condition? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

Sort columns of a dataframe by column name

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Resources