I am new to R and I have a problem that I simply cant find the solution to. I have tried reading through older posts etc. but I can't figure out how it could/should be done. I hope that some of you might be able to help.
Using the mtcars dataset as an example,
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I wish to assign the cars a value (e.g. 1) if for instance the hp valvue is above the median and 0 if below. This could be called "hp_index".
I would do the same for lets say cyl, disp and drat, and then, I would like to assign the cars an "index_score" where a car for instance would be given the value 1 if a minimum of 3 out of the 4 hp, cyl, disp and drat is above median (that is, if 3 or 4 out of the 4 "hp_index", "cyl_index", "disp_index" and "drat_index" is 1).
Once again, I really hope that some of you might be able to help!
Thanks in advance, and have a nice day!
here's a tidyverse solution:
library(tidyverse)
mtcars %>%
mutate(across(c(hp, cyl, disp, drat), .fns = ~if_else(.x >= median(.x), 1, 0), .names = "{.col}_index"),
index_score = apply(across(c(hp_index, cyl_index, disp_index, drat_index)), 1, sum),
index_score = if_else(index_score >= 3, 1, 0))
Note: I created the conditions in a way that I split with >= median and < median. If you only use > and < there could be edge cases with missings if the row has exactly the median.
First six cases:
mpg cyl disp hp drat wt qsec vs am gear carb hp_index cyl_index disp_index drat_index index_score
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0 1 0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0 1 0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0 0 0 1 0
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 1 0 0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1 1 0 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 1 0 0
Using dplyr :
library(dplyr)
cols <- c('cyl', 'disp', 'drat', 'hp')
mtcars %>%
mutate(across(cols, list(index = ~+(.x > median(.x))))) %>%
mutate(index_score = +(rowSums(select(., ends_with('index'))) >= 3)) -> result
result
Used select(., ends_with('index')) in rowSums assuming there are no other columns that end with index in your actual dataset apart from the newly created ones with across.
In this case based on all the columns, otherwise specific columns can be selected
mtcars$index=rowSums(
sapply(mtcars,function(x){
ifelse(x>median(x),1,0)
})
)
mpg cyl disp hp drat wt qsec vs am gear carb index
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 5
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4
...
from here you can do another ifelse on the "index" column to decide whether it be 0 or 1.
To use different filters for columns, as you proposed
rowSums(
cbind(
sapply(mtcars[c("cyl", "disp", "drat")],function(x){ifelse(x>median(x),1,0)}),
sapply(mtcars["hp"],function(x){ifelse(x<median(x),1,0)})
)
)
Related
Using the R dataset mtcars, I want to make a new binary variable for each level of the "cyl" variable.
For example, the values of cyl are 6, 4, and 8.
I want a new dataset with variables "cyl_4", "cyl_6", and "cyl_8" equal to 1 when each of these numbers occur.
Am looking for solutions that create a new variable for each level of the original variable.
Have:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Want:
mpg cyl disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 0 0
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 0
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0 0 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 0
Here's one tidyverse solution: pivot on the cyl column, then replace the values in the 3 resulting columns with 0 where they are NA, otherwise with 1.
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column(var = "model") %>%
pivot_wider(names_from = "cyl",
values_from = "cyl",
names_prefix = "cyl_",
names_sort = TRUE) %>%
mutate(across(starts_with("cyl"), ~ ifelse(is.na(.), 0, 1)))
Result (first 5 rows):
# A tibble: 32 × 14
model mpg disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4 0 1 0
2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4 0 1 0
3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1 1 0 0
4 Hornet 4 Drive 21.4 258 110 3.08 3.22 19.4 1 0 3 1 0 1 0
5 Hornet Sportabout 18.7 360 175 3.15 3.44 17.0 0 0 3 2 0 0 1
You could use model.matrix() to create the design matrix for a catogorical variable.
cbind(mtcars, model.matrix(~ cyl - 1, transform(mtcars, cyl = as.factor(cyl))))
# mpg cyl disp hp drat wt qsec vs am gear carb cyl4 cyl6 cyl8
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0 1 0
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0 1 0
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1 0 0
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0 1 0
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0 0 1
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0 1 0
I am just trying to fill gaps but in a loop. It is a monthly data, and fill_gaps produces NAs for every day. I am not sure why.
for (x in 2:length(differencing)){
for(micky in 1:length(differencing$`d_ BA`)){
if(is.na(differencing[micky,x])== T){
differencing[micky,x] = differencing[micky-1,x]
}
}
}
here is the error that I am getting:
Error: Assigned data `differencing[(micky - 1), x]` must be compatible with row subscript `micky`.
x 1 row must be assigned.
x Assigned data has 0 rows.
i Row updates require a list value. Do you need `list()` or `as.list()`?
Run `rlang::last_error()` to see where the error occurred.
This can be easily done using fill
library(tidyr)
library(dplyr)
differencing %>%
fill(everything())
Or we can use na.locf from zoo
library(zoo)
na.locf(differencing)
In the OP's loop, in the first line, it would be
for (x in 2:length(differencing$`d_ BA`)
...
as length of a data.frame will be the number of columns (as mentioned in the comments) and is different from length of a column i.e. vector
As the OP mentioned none of them works (OP didn't provide any example), using a small reproducible example ('tmp')
tmp %>%
fill(everything())
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 6 258 110 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 258 110 2.76 3.460 20.22 1 0 3 1
or using na.locf
na.locf(tmp)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 6 258 110 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 258 110 2.76 3.460 20.22 1 0 3 1
data
tmp <- head(mtcars)
tmp[c(2, 5, 6), c(3, 4, 2)] <- NA
I have a set a variables say Var1, Var2 to Varn. They all take three possible values 0, 1, and 2. I want to replace all 2 as 1
like so
df$Var1[df$Var1 >= 1] <- 1
This does the job. But when I try to write a function to do this
MakeBinary <- function(varName dfName){dfName$varName[dfName$varNAme > = 1] <- 1}
and use this function like:
MakeBinary(Var2, df)
I got an error message: Error in $<-.data.frame(*tmp*, "varName", value = numeric(0)) :
replacement has 0 rows, data has 512.
I just want to know why I got this message. Thanks. My sample size is 512.
If we are passing column name as string, then use [[ instead of $ and return the dataset
MakeBinary <- function(varName, dfName){
dfName[[varName]][dfName[[varName]] >= 1] <- 1
dfName
}
MakeBinary("Var2", df)
example with mtcars
MakeBinary("carb", head(mtcars))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Unquoted arguments for variable names can be passed as well, but it needs to be converted to string
MakeBinary <- function(varName, dfName){
varName <- deparse(substitute(varName))
dfName[[varName]][dfName[[varName]] >= 1] <- 1
dfName
}
MakeBinary(Var2, df)
Using a reproducible example with mtcars
MakeBinary(carb, head(mtcars))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
When I create a new variable, is there a way to specify in the function where to place it?
Right now, it adds it to the end of the dataframe, but for ease of viewing in Excel for example, I'd like to place a new calculated column beside the columns I used for the calculation.
Here's an example of code:
rawdata2 <- (rawdata1 %>% unite(location, locations1,locations2, locations3,
na.rm = TRUE, remove=TRUE)
%>% select(-location7, -location16)
%>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE)
%>% unite(TypeofSpace, TypesofSpace, type.of.spaceOther, na.rm=TRUE,
remove=TRUE)
)
You can rearrange the columns in your data frame. It looks like you are using dplyr::select in your example.
library(dplyr)
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars2 <- mtcars %>%
select(mpg, carb, everything()) ## moves carb up behind mpg
head(mtcars2)
# mpg carb cyl disp hp drat wt qsec vs am gear
# Mazda RX4 21.0 4 6 160 110 3.90 2.620 16.46 0 1 4
# Mazda RX4 Wag 21.0 4 6 160 110 3.90 2.875 17.02 0 1 4
# Datsun 710 22.8 1 4 108 93 3.85 2.320 18.61 1 1 4
# Hornet 4 Drive 21.4 1 6 258 110 3.08 3.215 19.44 1 0 3
# Hornet Sportabout 18.7 2 8 360 175 3.15 3.440 17.02 0 0 3
# Valiant 18.1 1 6 225 105 2.76 3.460 20.22 1 0 3
You can do the same thing with base subsetting, for example with a data frame with 11 columns you can move the 11th behind the second by
mtcars3 <- mtcars[,c(1,11,2:10)]
identical(mtcars2, mtcars3)
# [1] TRUE
I ended up using relocate, documentation here: dplyr.tidyverse.org/reference/relocate.html
I'm new to writing functions in R, but want to write a function to add 1% of the median of a variable to itself, using dplyr, and replace the variable with this transformation.
x is a numeric variable.
add_median <- function(df, x) {
x <- enquo(x)
x <- quo_name(x)
mutate(x=x+.01*median(x, na.rm=T))
}
When I run newDF <- DF %>% add_median(variable_of_interest), I get the following error:
Error in 0.01 * median(x, na.rm = T) : non-numeric argument to binary operator
What am I doing wrong here?
We could change the function to evaluate with {{}} and then use assign (:=) instead of = in mutate
library(dplyr)
add_median <- function(df, x) {
df %>%
mutate({{x}} := {{x}} + .01 * median({{x}}, na.rm = TRUE))
}
If we need to change multiple columns, use mutate_at
add_median_multiple <- function(df, vec){
df %>%
mutate_at(vars(vec), ~ . + .01 * median(., na.rm = TRUE))
}
-testing
data(mtcars)
head(mtcars) %>%
add_median(mpg)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.21 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.21 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 23.01 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.61 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.91 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.31 6 225 105 2.76 3.460 20.22 1 0 3 1
comparison with original 'mpg' column
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
add_median_multiple(head(mtcars), c('mpg', 'wt'))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.21 6 160 110 3.90 2.65045 16.46 0 1 4 4
#Mazda RX4 Wag 21.21 6 160 110 3.90 2.90545 17.02 0 1 4 4
#Datsun 710 23.01 4 108 93 3.85 2.35045 18.61 1 1 4 1
#Hornet 4 Drive 21.61 6 258 110 3.08 3.24545 19.44 1 0 3 1
#Hornet Sportabout 18.91 8 360 175 3.15 3.47045 17.02 0 0 3 2
#Valiant 18.31 6 225 105 2.76 3.49045 20.22 1 0 3 1