Split variable into new column variable for each factor - r

Using the R dataset mtcars, I want to make a new binary variable for each level of the "cyl" variable.
For example, the values of cyl are 6, 4, and 8.
I want a new dataset with variables "cyl_4", "cyl_6", and "cyl_8" equal to 1 when each of these numbers occur.
Am looking for solutions that create a new variable for each level of the original variable.
Have:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Want:
mpg cyl disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 0 0
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 0
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0 0 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 0

Here's one tidyverse solution: pivot on the cyl column, then replace the values in the 3 resulting columns with 0 where they are NA, otherwise with 1.
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column(var = "model") %>%
pivot_wider(names_from = "cyl",
values_from = "cyl",
names_prefix = "cyl_",
names_sort = TRUE) %>%
mutate(across(starts_with("cyl"), ~ ifelse(is.na(.), 0, 1)))
Result (first 5 rows):
# A tibble: 32 × 14
model mpg disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4 0 1 0
2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4 0 1 0
3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1 1 0 0
4 Hornet 4 Drive 21.4 258 110 3.08 3.22 19.4 1 0 3 1 0 1 0
5 Hornet Sportabout 18.7 360 175 3.15 3.44 17.0 0 0 3 2 0 0 1

You could use model.matrix() to create the design matrix for a catogorical variable.
cbind(mtcars, model.matrix(~ cyl - 1, transform(mtcars, cyl = as.factor(cyl))))
# mpg cyl disp hp drat wt qsec vs am gear carb cyl4 cyl6 cyl8
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0 1 0
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0 1 0
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1 0 0
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0 1 0
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0 0 1
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0 1 0

Related

"Total score" based on multiple "above/below median values"

I am new to R and I have a problem that I simply cant find the solution to. I have tried reading through older posts etc. but I can't figure out how it could/should be done. I hope that some of you might be able to help.
Using the mtcars dataset as an example,
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I wish to assign the cars a value (e.g. 1) if for instance the hp valvue is above the median and 0 if below. This could be called "hp_index".
I would do the same for lets say cyl, disp and drat, and then, I would like to assign the cars an "index_score" where a car for instance would be given the value 1 if a minimum of 3 out of the 4 hp, cyl, disp and drat is above median (that is, if 3 or 4 out of the 4 "hp_index", "cyl_index", "disp_index" and "drat_index" is 1).
Once again, I really hope that some of you might be able to help!
Thanks in advance, and have a nice day!
here's a tidyverse solution:
library(tidyverse)
mtcars %>%
mutate(across(c(hp, cyl, disp, drat), .fns = ~if_else(.x >= median(.x), 1, 0), .names = "{.col}_index"),
index_score = apply(across(c(hp_index, cyl_index, disp_index, drat_index)), 1, sum),
index_score = if_else(index_score >= 3, 1, 0))
Note: I created the conditions in a way that I split with >= median and < median. If you only use > and < there could be edge cases with missings if the row has exactly the median.
First six cases:
mpg cyl disp hp drat wt qsec vs am gear carb hp_index cyl_index disp_index drat_index index_score
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0 1 0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0 1 0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0 0 0 1 0
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 1 0 0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1 1 0 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 1 0 0
Using dplyr :
library(dplyr)
cols <- c('cyl', 'disp', 'drat', 'hp')
mtcars %>%
mutate(across(cols, list(index = ~+(.x > median(.x))))) %>%
mutate(index_score = +(rowSums(select(., ends_with('index'))) >= 3)) -> result
result
Used select(., ends_with('index')) in rowSums assuming there are no other columns that end with index in your actual dataset apart from the newly created ones with across.
In this case based on all the columns, otherwise specific columns can be selected
mtcars$index=rowSums(
sapply(mtcars,function(x){
ifelse(x>median(x),1,0)
})
)
mpg cyl disp hp drat wt qsec vs am gear carb index
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 5
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4
...
from here you can do another ifelse on the "index" column to decide whether it be 0 or 1.
To use different filters for columns, as you proposed
rowSums(
cbind(
sapply(mtcars[c("cyl", "disp", "drat")],function(x){ifelse(x>median(x),1,0)}),
sapply(mtcars["hp"],function(x){ifelse(x<median(x),1,0)})
)
)

write r function to modify value in data frame

I have a set a variables say Var1, Var2 to Varn. They all take three possible values 0, 1, and 2. I want to replace all 2 as 1
like so
df$Var1[df$Var1 >= 1] <- 1
This does the job. But when I try to write a function to do this
MakeBinary <- function(varName dfName){dfName$varName[dfName$varNAme > = 1] <- 1}
and use this function like:
MakeBinary(Var2, df)
I got an error message: Error in $<-.data.frame(*tmp*, "varName", value = numeric(0)) :
replacement has 0 rows, data has 512.
I just want to know why I got this message. Thanks. My sample size is 512.
If we are passing column name as string, then use [[ instead of $ and return the dataset
MakeBinary <- function(varName, dfName){
dfName[[varName]][dfName[[varName]] >= 1] <- 1
dfName
}
MakeBinary("Var2", df)
example with mtcars
MakeBinary("carb", head(mtcars))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Unquoted arguments for variable names can be passed as well, but it needs to be converted to string
MakeBinary <- function(varName, dfName){
varName <- deparse(substitute(varName))
dfName[[varName]][dfName[[varName]] >= 1] <- 1
dfName
}
MakeBinary(Var2, df)
Using a reproducible example with mtcars
MakeBinary(carb, head(mtcars))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

write function to replace variable with itself plus 1% of its median

I'm new to writing functions in R, but want to write a function to add 1% of the median of a variable to itself, using dplyr, and replace the variable with this transformation.
x is a numeric variable.
add_median <- function(df, x) {
x <- enquo(x)
x <- quo_name(x)
mutate(x=x+.01*median(x, na.rm=T))
}
When I run newDF <- DF %>% add_median(variable_of_interest), I get the following error:
Error in 0.01 * median(x, na.rm = T) : non-numeric argument to binary operator
What am I doing wrong here?
We could change the function to evaluate with {{}} and then use assign (:=) instead of = in mutate
library(dplyr)
add_median <- function(df, x) {
df %>%
mutate({{x}} := {{x}} + .01 * median({{x}}, na.rm = TRUE))
}
If we need to change multiple columns, use mutate_at
add_median_multiple <- function(df, vec){
df %>%
mutate_at(vars(vec), ~ . + .01 * median(., na.rm = TRUE))
}
-testing
data(mtcars)
head(mtcars) %>%
add_median(mpg)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.21 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.21 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 23.01 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.61 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.91 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.31 6 225 105 2.76 3.460 20.22 1 0 3 1
comparison with original 'mpg' column
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
add_median_multiple(head(mtcars), c('mpg', 'wt'))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.21 6 160 110 3.90 2.65045 16.46 0 1 4 4
#Mazda RX4 Wag 21.21 6 160 110 3.90 2.90545 17.02 0 1 4 4
#Datsun 710 23.01 4 108 93 3.85 2.35045 18.61 1 1 4 1
#Hornet 4 Drive 21.61 6 258 110 3.08 3.24545 19.44 1 0 3 1
#Hornet Sportabout 18.91 8 360 175 3.15 3.47045 17.02 0 0 3 2
#Valiant 18.31 6 225 105 2.76 3.49045 20.22 1 0 3 1

Adding a counting sequence to a nested dataframe in R using purrr

I have a list of nested data frames. I want to add a column of a sequential count to each nested dataframe.
This works in typical dataframe:
library(tidyverse)
mtcars %>% mutate(index = sequence(n()))
mpg cyl disp hp drat wt qsec vs am gear carb index
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
But it does not work in a nested structure:
table <-
tribble(~id, ~data, 1, mtcars) %>%
mutate(data_index = map(data, ~mutate(.x, index = sequence(n()))))
table
# A tibble: 1 x 3
id data data_index
<dbl> <list> <list>
1 1 <data.frame [32 x 11]> <data.frame [32 x 12]>
head(table$data_index[[1]])
mpg cyl disp hp drat wt qsec vs am gear carb index
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
Any thoughts on how to overcome this?
I modified your code a little bit. The key is n() returns the row number of the entire data frame, not mtcars. So I used nrow(.x) instead.
library(tidyverse)
dt <- tribble(~id, ~data, 1, mtcars)
dt2 <- dt %>%
mutate(data_index = map(data, ~mutate(.x, index = sequence(nrow(.x)))))
head(dt2$data_index[[1]])
# mpg cyl disp hp drat wt qsec vs am gear carb index
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
# 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
# 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6

get rid of first column when converting dtm Matrix to DataFrame

I've converted a Document Term Matrix to a dataframe using this simple line
dtm.df <- as.data.frame(inspect(dtm))
The problem is I want to remove the first column (filenames) but the column has no name.
There might be two different issues here: rownames vs. columns.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here you see a column printed without a name. These are the rownames.
mpg is the first column. If we wanted to remove this column without refering to its name, we could use
mtcars <- mtcars[,-1]
head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
On the other hand, if you are talking about the rownames, which are still printed, you can remove them with the function rownames:
rownames(mtcars) <- NULL
head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
1 6 160 110 3.90 2.620 16.46 0 1 4 4
2 6 160 110 3.90 2.875 17.02 0 1 4 4
3 4 108 93 3.85 2.320 18.61 1 1 4 1
4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 8 360 175 3.15 3.440 17.02 0 0 3 2
6 6 225 105 2.76 3.460 20.22 1 0 3 1

Resources