Function in R that creates dummy variables if a condition is met

Function in R that creates dummy variables if a condition is met - r

I am looking to create a function that will convert any factor variable with more than 4 levels into a dummy variable. The dataset has ~2311 columns, so I would really need to create a function. Your help would be immensely appreciated.
I have compiled the code below and was hoping to get it to work.
library(dummies)
# example function
for(i in names(Final_Dataset)){
if(count (Final_Dataset[i])>4){
y <- Final_Dataset[i]
Final_Dataset <- cbind(Final_Dataset, dummy(y, sep = "_"))
}
}
I was also considering an alternative approach where I would get all the number of columns that need to be dummied and then loop through all the columns and if the column number is in that array then create dummy variables out of the variable.

Example data
fct = data.frame(a = as.factor(letters[1:10]), b = 1:10, c = as.factor(sample(letters[1:4], 10, replace = T)), d = as.factor(letters[10:19]))
str(fct)
'data.frame': 10 obs. of 4 variables:
$ a: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
$ b: int 1 2 3 4 5 6 7 8 9 10
$ c: Factor w/ 4 levels "a","b","c","d": 2 4 1 3 1 1 2 3 1 2
$ d: Factor w/ 10 levels "j","k","l","m",..: 1 2 3 4 5 6 7 8 9 10
# keep columns with more than 4 factors
fact_cols = sapply(fct, function(x) is.factor(x) && length(levels(x)) > 4)
# create dummy variables for subset (omit intercept)
dummy_cols = model.matrix(~. -1, fct[, fact_cols])
# cbind new data
out_df = cbind(fct[, !fact_cols], dummy_cols)

You could get all the columns with more than a given number of levels (n = 4) with something like
which(sapply(Final_Dataset, function (c) length(levels(c)) > n))

Related

Stop R from converting a character factor to number

I am trying to convert missing factor values to NA in a data frame, and create a new data frame with replaced values but when I try to do that, previously character factors are all converted to numbers. I cannot figure out what I am doing wrong and cannot find a similar question. Could anybody please help?
Here are my codes:
orders <- c('One','Two','Three', '')
ids <- c(1, 2, 3, 4)
values <- c(1.5, 100.6, 19.3, '')
df <- data.frame(orders, ids, values)
new.df <- as.data.frame(matrix( , ncol = ncol(df), nrow = 0))
names(new.df) <- names(df)
for(i in 1:nrow(df)){
row.df <- df[i, ]
print(row.df$orders) # "One", "Two", "Three", ""
print(str(row.df$orders)) # Factor
# Want to replace "orders" value in each row with NA if it is missing
row.df$orders <- ifelse(row.df$orders == "", NA, row.df$orders)
print(row.df$orders) # Converted to number
print(str(row.df$orders)) # int or logi
# Add the row with new value to the new data frame
new.df[nrow(new.df) + 1, ] <- row.df
}
and I get this:
> new.df
orders ids values
1 2 1 2
2 4 2 3
3 3 3 4
4 NA 4 1
but I want this:
> new.df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 NA 4

Convert empty values to NA and use type.convert to change their class.
df[df == ''] <- NA
df <- type.convert(df)
df
# orders ids values
#1 One 1 1.5
#2 Two 2 100.6
#3 Three 3 19.3
#4 <NA> 4 NA
str(df)
#'data.frame': 4 obs. of 3 variables:
#$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 1
#$ ids : int 1 2 3 4
#$ values: num 1.5 100.6 19.3 NA

Thanks to the hint from Ronak Shah, I did this and it gave me what I wanted.
df$orders[df$orders == ''] <- NA
This will give me:
> df
orders ids values
1 One 1 1.5
2 Two 2 100.6
3 Three 3 19.3
4 <NA> 4
> str(df)
'data.frame': 4 obs. of 3 variables:
$ orders: Factor w/ 4 levels "","One","Three",..: 2 4 3 NA
$ ids : num 1 2 3 4
$ values: Factor w/ 4 levels "","1.5","100.6",..: 2 3 4 1
In case you are curious about the difference between NA and as I was, you can find the answer here.
Your suggestion
df$orders[is.na(df$orders)] <- NA
did not work maybe becasuse missing entry is not NA?

How to convert factor column in df to numeric strings per row?

I am using R for a research project that requires me to input a sequence of 1-5 of varying length and then calculate a score from that sequence.
The data frame I have stores the sequences as a factor. If I take a single entry and convert it to a numeric vector, I can input it into the formula. But if I try to do this for all rows I run into errors.
I have searched SO and other sources but only found information on how to convert factors to numeric if they contain one value per cell. My data contains a sequence of numbers per cell separated by commas.
If I take input from one cell and use as.numeric(strsplit(as.character it works. But I don't want to do all cells manually. How can I solve this?
This is what I did:
df <- read.csv2("example_seq_logs.csv", na.strings = "n/a")
df$seqtext <- as.character(df$hmm)
This is what the data frame looks like:
head(df)
lesson hmm
1 A 1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3
2 B 2,2,3,4,1,1,3,3,3,5,5,4,4,4,2,1
3 C 1,3,1,3,2,3,2,2,3,3,4,1,3,2,3,3,5,4,4,3,3
4 D 1,3,2,2,3,3,2,3,1,4,4,5,5,2,4,4,4,3
5 E 1,4,2,5,1,3,1,3,1,4,3,4,4
str(df)
'data.frame': 5 obs. of 2 variables:
$ lesson: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ hmm : Factor w/ 5 levels "1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3",..: 1 5 2 3 4
sapply(df, mode)
lesson hmm
"numeric" "numeric"
Now if I take a single entry I can do this:
testseq <- as.numeric(strsplit(df$seqtext)[1],",")[[1]])
str(testseq)
num [1:21] 1 2 3 3 3 4 3 4 5 4 ...
and then I can input the testseq sequence into the function I need.
But when I try the same for the whole column it results in an error
df$seq <- as.numeric(strsplit(df$seqtext, ","))[[1:58]]
Error: (list) object cannot be coerced to type 'double'
Thank you for your help!
Edit:
The first suggestion yields this error:
df$seq <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
Error in `$<-.data.frame`(`*tmp*`, seq, value = c(1, 2, 3, 3, 3, 4, 3, :
replacement has 89 rows, data has 5
It seems it turns the entire column into one long string.
a <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
print(a)
[1] 1 2 3 3 3 4 3 4 5 4 4 5 5 2 2 1 2 3 4 2 3 2 2 3 4 1 1 3 3 3 5 5 4 4 4 2 1 1 3 1 3 2 3 2 2 3 3 4 1 3 2 3
[53] 3 5 4 4 3 3 1 3 2 2 3 3 2 3 1 4 4 5 5 2 4 4 4 3 1 4 2 5 1 3 1 3 1 4 3 4 4
But I need each sequence to turn up in the right row as a string.
Edit:
I found that the function I need to calculate results with doesn't need numerics so now I've solved the issue using a for loop:
df$score <- 0
for (i in 1:nrow(df)) {
seq <- as.array(strsplit(as.character(df$hmm),","))
session_seq <- seq[i]
res = computehmm(session_seq)
df$score[i] <- res$score
}
But now it stops calculating once it reaches an empty df$hmm field.
I understand sapply would be better but I don't understand how to get it to work.

You can use paste as:
as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))

Coerce and order character vector to factor, with factor levels ordered by another vector

Imagine a data set like this:
# creating data for test
set.seed(1839)
id <- as.character(1:10)
frequency <- sample(c("n", "r", "s", "o", "a"), 10, TRUE)
frequency_value <- sapply(
frequency, switch, "n" = -2, "r" = -1, "s" = 0, "o" = 1, "a" = 2
)
(test <- data.frame(id, frequency, frequency_value))
Which looks like:
id frequency frequency_value
1 1 a 2
2 2 o 1
3 3 r -1
4 4 o 1
5 5 o 1
6 6 s 0
7 7 n -2
8 8 n -2
9 9 r -1
10 10 n -2
The variable frequency has the response I'm interested in. It goes from never to rarely to sometimes to often to always. The labels are just the first letter of each of those words. The order is presented in frequency_value.
What I would like to do is make frequency a factor with levels in the order n, r, s, o, a. However, I want to make this dependent on the values in frequency_value. They should follow the order that is preserved in frequency_value and not be simply hard-coded (like one would do with factor(frequency, levels = c("n", "r", "s", "o", "a"))).
I have thought about using this, a tidyverse solution:
levels <- test[, c("frequency", "frequency_value")] %>%
unique() %>%
arrange(as.numeric(frequency_value)) %>%
pull(frequency) %>%
as.character()
test$frequency <- factor(test$frequency, levels)
But that seems to be computationally inefficient when I do this on big data sets with more than one variable that I want to make factor. Is there a more efficient solution?

Use order for unique combinations (what you were using) within with:
test$frequency <- factor(test$frequency,
with(unique(test[, -1]), frequency[order(frequency_value)]))
[1] a o r o o s n n r n
Levels:
n r s o a

Once option could be by just using dplyr as:
library(dplyr)
test <- test %>% arrange(frequency_value) %>%
mutate(frequency = factor(frequency, levels = unique(frequency)))
test
# id frequency frequency_value
# 1 7 n -2
# 2 8 n -2
# 3 10 n -2
# 4 3 r -1
# 5 9 r -1
# 6 6 s 0
# 7 2 o 1
# 8 4 o 1
# 9 5 o 1
# 10 1 a 2
str(test)
#'data.frame': 10 obs. of 3 variables:
# $ id : Factor w/ 10 levels "1","10","2","3",..: 8 9 2 4 10 7 3 5 6 1
# $ frequency : Factor w/ 5 levels "n","r","s","o",..: 1 1 1 2 2 3 4 4 4 5
# $ frequency_value: num -2 -2 -2 -1 -1 0 1 1 1 2

Reverse Coding Certain Columns in R

I have a dataset with 49 columns.
'data.frame': 1351 obs. of 47 variables:
$ ID : Factor w/ 1351 levels "PID0001","PID0002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Survey: int 1 2 1 1 2 2 2 1 1 2 ...
$ hsinc1: int 2 4 4 4 5 4 3 3 1 1 ...
$ hsinc2: int 2 3 3 3 4 3 3 3 1 1 ...
$ hsinc3: int 4 4 2 3 3 4 5 4 5 5 ...
$ hsinc4: int 4 4 4 4 4 4 4 4 5 4 ...
$ hfair1: int 2 2 2 1 1 1 1 2 1 2 ...
$ hfair2: int 4 5 5 4 5 5 5 5 5 5 ...
$ hfair3: int 4 5 4 3 5 4 3 3 5 5 ...
etc ...
I want to reverse code columns 5,6,8,9,10,12,13,14,17 and 18 such that a score of 5 becomes a score of 1, and 4 becomes 2 etc.
At first, I thought this was achievable by using the psych::reverse.code() function, so I tried this:
With the -1's being the 5,6,8,9,10,12,13,14,17 and 18 columns.
library('psych')
keys <-c(1,1,1,1,-1,-1,1,-1,-1,-1,1,-1,-1,-1,1,1,-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df_rev <- reverse.code(keys, items = df, mini = rep(1,49), maxi = rep(5,49))
However, when I run this code, I get the following error:
Error in items %*% keys.d :
requires numeric/complex matrix/vector arguments
Can anybody help with this, please?
Another method I have just been trying is to create a subset of the original data frame, with just the columns I want to reverse code:
data_to_rev <- df[c(5,6,8,9,10,12,13,14,17,18)]
And then reverse coding this subset:
keys <- c(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1)
df_rev <- reverse.code(keys, items = data_to_rev, mini = rep(1,10), maxi = rep(5,10))
This works successfully. All variables are now reverse coded like I need them. However, how do I get this subset of reverse coded values and place it back into the original data frame - overwriting the old (non-reversed) columns?
Any help would be hugely appreciated, thank you!
EDIT - SOLUTION
I think I have managed to solve it using #MikeH's help.
I created a subset of just the participant ID's (the factor variable) data_ID <- df[1]
And then used:
data_rev <- reverse.code(keys, items = df[,-1], mini = rep(1,46), maxi = rep(5,46))
This leaves me with 2 data frames/subsets:
1 with all the participant ID's.
1 with all their data and columns 5,6,8,9,10,12,13,14,17 and 18 reverse coded.
I then used: data_final <- cbind(data_ID, data_rev) to join the 2 subsets back together.
Can anyone see anything wrong with this method? I think it has worked upon visual inspection...

df[c(5,6,8,9,10,12,13,14,17)] <- 6 - df[c(5,6,8,9,10,12,13,14,17)]

An efficient way to do it is to write the reverse function yourself and apply it to the columns you want
library(data.table)
start=1
end=5
myrev=function(x) end+start-x
dt=data.table(x=c(1,2,1,4),y=c(2,5,4,1))
cols=1:2
dt[, (cols) := lapply(.SD,myrev), .SDcols = cols]
Or
dt[, (cols) := end + start-cols]

Subset columns based on certain columns missing value

My dataset is pretty big. I have about 2,000 variables and 1,000 observations.
I want to run a model for each variable using other variables.
To do so, I need to drop variables which have missing values where the dependent variable doesn't have.
I meant that for instance, for variable "A" I need to drop variable C and D because those have missing values where variable A doesn't have. for variable "C" I can keep variable "D".
data <- read.table(text="
A B C D
1 3 9 4
2 1 3 4
NA NA 3 5
4 2 NA NA
2 5 4 3
1 1 1 2",header=T,sep="")
I think I need to make a loop to go through each variable.

I think this gets what you need:
for (i in 1:ncol(data)) {
# filter out rows with NA's in on column 'i'
# which is the column we currently care about
tmp <- data[!is.na(data[,i]),]
# now column 'i' has no NA values, so remove other columns
# that have NAs in them from the data frame
tmp <- tmp[sapply(tmp, function(x) !any(is.na(x)))]
#run your model on 'tmp'
}
For each iteration of i, the tmp data frame looks like:
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 4 obs. of 2 variables:
$ C: int 3 3 4 1
$ D: int 4 5 3 2
'data.frame': 5 obs. of 1 variable:
$ D: int 4 4 5 3 2

I'll provide a way to get the usable vadiables for each column you choose:
getVars <- function(data, col){
tmp<-!sapply(data[!is.na(data[[col]]),], function(x) { any(is.na(x)) })
names(data)[tmp & names(data) != col]
}
PS: I'm on my phone so I didn't test the above nor had the chance for a good code styling.
EDIT: Styling fixed!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Function in R that creates dummy variables if a condition is met - r

You could get all the columns with more than a given number of levels (n = 4) with something like which(sapply(Final_Dataset, function (c) length(levels(c)) > n))

Related

Stop R from converting a character factor to number

How to convert factor column in df to numeric strings per row?

Coerce and order character vector to factor, with factor levels ordered by another vector

Reverse Coding Certain Columns in R

Subset columns based on certain columns missing value

Categories

Resources