Weird things with Automatically generate new variable names using dplyr mutate - r

OK this is going to be a long post.
So i am fairly new with R (i am currently using the MR free 3.5, with no checkpoint) but i am trying to work with the tidyverse, which i find very elegant in writing code and a lot of times a lot more simple.
I decided to replicate an exercise from guru99 here. It is a simple k-means exercise. However because i always want to write "generalizeble" code i was trying to automatically rename the variables in mutate with new names. So i searched SO and found this solution here which is very nice.
First what works fine.
#library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read.csv(link)
rescaled <- df %>% discard(is.factor) %>%
select(-X) %>%
mutate_all(
funs("scaled" = scale)
)
When you download the data with read.csv you get the df in dataframe class and everything works.
And now the weird thinks start. If you download the data with read_csv or make it a tibble at any point after (the first X variable will be named X1 and you need to change the is.factor to is.character because stings are converted to character not factors unless explicitly asked for, for future me and others.)
and then run the code
df1 <- read_csv(link)
df1 %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
funs("scaled" = scale)
)
the new named variables are named price_scaled[,1] speed_scaled[,1] hd_scaled[,1] ram_scaled[,1] etc. when you view the output in the console or you even if you print().
BUT if you view() on it you see the output with the names you expect which are price_scaled speed_scaled hd_scaled etc. ALSO I am using an Rmarkdown document for the code and when i change the chunk output to inline it diplays the names correctly with hd_scaled etc.
Any one has any idea how to get the names printed in the console like price_scaled etc.
Why this is happening?
Though that this would be interesting to ask.

scale() returns a matrix, and dplyr/tibble isn't automatically coercing it to a vector. By changing your mutate_all() call to the below, we can have it return a vector. I identified this is what was happening by calling class(df1$speed_scaled) and seeing the result of "matrix".
library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read_csv(link)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> price = col_double(),
#> speed = col_double(),
#> hd = col_double(),
#> ram = col_double(),
#> screen = col_double(),
#> cd = col_character(),
#> multi = col_character(),
#> premium = col_character(),
#> ads = col_double(),
#> trend = col_double()
#> )
df %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
list("scaled" = function(x) scale(x)[[1]])
)
#> # A tibble: 6,259 x 14
#> price speed hd ram screen ads trend price_scaled speed_scaled
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1499 25 80 4 14 94 1 -1.24 -1.28
#> 2 1795 33 85 2 14 94 1 -1.24 -1.28
#> 3 1595 25 170 4 15 94 1 -1.24 -1.28
#> 4 1849 25 170 8 14 94 1 -1.24 -1.28
#> 5 3295 33 340 16 14 94 1 -1.24 -1.28
#> 6 3695 66 340 16 14 94 1 -1.24 -1.28
#> 7 1720 25 170 4 14 94 1 -1.24 -1.28
#> 8 1995 50 85 2 14 94 1 -1.24 -1.28
#> 9 2225 50 210 8 14 94 1 -1.24 -1.28
#> 10 2575 50 210 4 15 94 1 -1.24 -1.28
#> # ... with 6,249 more rows, and 5 more variables: hd_scaled <dbl>,
#> # ram_scaled <dbl>, screen_scaled <dbl>, ads_scaled <dbl>,
#> # trend_scaled <dbl>

Related

mutate( ) returns a matrix

After I update my Rstudio today, when I tried to get z-scores of a data frame by using mutate() and scale(), it returns a matrix with a 'new name' warning:
df <- df %>% group_by(participants) %>% mutate(zscore=scale(answer))
New names:
* NA -> ...8
class(df$zscore)
[1] "matrix" "array"
The column of the z-scores should have been named 'zscore', but why it is now named '...8'? I never had any problems with the codes before. Is it because of the update?
I think you just added another column without a header or read in data with a column without a header. There is no issue with your classes.
library(tidyverse)
test <- mtcars|>
group_by(cyl) |>
mutate(zscore=scale(mpg))
#class of test
class(test)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#class of column
class(test$zscore)
#> [1] "matrix" "array"
#recreate warning
test <- test |>
bind_cols("")
#> New names:
#> * `` -> `...13`
The warning at the bottom means that I added a column without a name in the 13th position.
Part of the issue is that scale() returns a matrix. You can fix this by wrapping in as.double():
library(dplyr)
starwars2 <- starwars %>%
select(height, gender) %>%
group_by(gender) %>%
mutate(zscore = as.double(scale(height)))
Output:
# A tibble: 87 × 3
# Groups: gender [3]
height gender zscore
<int> <chr> <dbl>
1 172 masculine -0.120
2 167 masculine -0.253
3 96 masculine -2.14
4 202 masculine 0.677
5 150 feminine -0.624
6 178 masculine 0.0394
7 165 feminine 0.0133
8 97 masculine -2.11
9 183 masculine 0.172
10 182 masculine 0.146
# … with 77 more rows
But I’m not sure this explains your NA -> ...8 issue. If not, please update your question to include your data (using dput(df)) or a subset (using dput(head(df))).

Error in using group_by and summarise in running correlation and test of significance with SPSS dataset

I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.
I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).
Then, problems emerge.
I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?
df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)
df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA
I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.
Then I try to follow past examples and use broom::tidy, but no output for p-values....
> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
column n mean sd median trimmed mad min max range skew kurtosis se
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 toptim 435 22.1 4.43 22 22.3 3 7 30 23 NA NA 0.212
2 tnegaff 435 19.4 7.07 18 18.6 4 10 39 29 NA NA 0.339
3 sex 439 1.58 0.494 2 1.58 0 1 2 1 -0.318 1.10 0.0236
How can I get the result? May I know the reason for such failure?
Thank you for your answers in advance.
It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.
For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.
That's a few steps to go through but hope it helps!
df <- haven::read_sav("survey.sav")
library(tidyverse)
df %>%
group_by(sex) %>%
summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
cor.test = list(cor.test(toptim, tnegaff))) %>%
mutate(tidy_out = map(cor.test, broom::tidy)) %>%
unnest(tidy_out)
#> # A tibble: 2 × 11
#> sex cor cor.t…¹ estim…² stati…³ p.value param…⁴ conf.…⁵ conf.…⁶ method
#> <dbl+l> <dbl> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 [MAL… -0.220 <htest> -0.220 -3.04 2.73e- 3 182 -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest> -0.394 -6.75 1.06e-10 248 -0.494 -0.284 Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> # ¹​cor.test, ²​estimate, ³​statistic, ⁴​parameter, ⁵​conf.low, ⁶​conf.high
Edit - examining difference in correlation
Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:
cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])

How to move R code into functions to generalise behaviour

I have a huge messy piece of R code with loads of ugly repetition. There is an opportunity to massively reduce it. Starting with this piece of code:
table <-
risk_assigned %>%
group_by(rental_type, room_type) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) %>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
I would like to generalise it into a function so it can be reused.
LayKable = function(kableDetails) {
table <-
risk_assigned %>%
group_by(kableDetails$group1 , kableDetails$group2) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) #%>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
...
kable <- table
return(kable)
}
kableDetails <- list(
group1 = "rental_type",
group2 = "room_type"
)
newKable <- LayKable(kableDetails)
This rather half-hearted attempt serves to explain what I want to do. How can I pass stuff into this function inside a list (I'm a C programmer, pretending it's a struct).
When passing function arguments to a dplyr verb inside a function you have to use rlang terms. But should be simple to define a function you can pass a number of grouping terms to:
library(dplyr)
test_func <- function(..., data = mtcars) {
# Passing `data` as a default argument as it's nice to be flexible!
data %>%
group_by(!!!enquos(...)) %>%
summarise(across(.fns = sum), .groups = "drop")
}
test_func(cyl, gear)
#> # A tibble: 8 x 11
#> cyl gear mpg disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 3 21.5 120. 97 3.7 2.46 20.0 1 0 1
#> 2 4 4 215. 821 608 32.9 19.0 157. 8 6 12
#> 3 4 5 56.4 215. 204 8.2 3.65 33.6 1 2 4
#> 4 6 3 39.5 483 215 5.84 6.68 39.7 2 0 2
#> 5 6 4 79 655. 466 15.6 12.4 70.7 2 2 16
#> 6 6 5 19.7 145 175 3.62 2.77 15.5 0 1 6
#> 7 8 3 181. 4291. 2330 37.4 49.2 206. 0 0 37
#> 8 8 5 30.8 652 599 7.76 6.74 29.1 0 2 12
Update - adding a list
I see your ideal would be to write a list of arguments for each function call and pass these rather than write out the arguments in each call. You can do this using do.call to pass a list of named arguments to a function. Again, when using dplyr verbs you can quote variable names in constructing your list (so that R doesn't try to find them in the global environment when compiling the list) and !!enquo each one in the calls to then use them there:
library(dplyr)
test_func2 <- function(.summary_var, .group_var, data = mtcars) {
data %>%
group_by(!!enquo(.group_var)) %>%
summarise(mean = mean(!!enquo(.summary_var)))
}
# Test with bare arguments
test_func2(hp, cyl)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
# Construct and pass list
args <- list(.summary_var = quote(hp), .group_var = quote(cyl))
do.call(test_func2, args = args)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
A handy guide to tidy evaluation where most of these ideas are explained more clearly.
Created on 2021-12-21 by the reprex package (v2.0.1)

step_num2factor() Usage -- Tidymodel (Recipe Package)

Well, I've read the function reference for step_num2factor and didn't figured it out how to use it properly, honestly.
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
The data output temp_data$MSSubClass is full of NA after the use of the step.
The obs are saved as 20,30,40.... 190 and I want to transform to names ( or even the same numbers but as unordered factors)
If you know more blog posts about the usage of step_num2factor or some code that uses, I would be gladly to see as well.
The complete dataset is provided by kaggle at:
kaggle data
Thx in advance,
I don't think that step_num2factor() is the best fit for this variable. Take a look at the help again, and notice that you need to give a transform argument that can be used to modify the numeric values prior to determining the levels. This would work OK if this data was all multiples of 10, but you have some values like 75 and 85, so I don't think you want that. This recipe step works best for numeric/integer-ish variables that you can more easily transform to a set of integers with a simple function.
Instead, I think you should think about step_mutate() and a simple coercion to a factor type:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
Created on 2020-05-03 by the reprex package (v0.3.0)
This looks to me like it gets you the factor levels you want. If you want to save those strings from the .txt file like "1-STORY 1945 & OLDER" as a new_levels vector, you could say factor(MSSubClass, levels = new_levels).

R - dplyr lag function

I am trying to calculate the absolute difference between lagged values over several columns. The first row of the resulting data set is NA, which is correct because there is no previous value to calculate the lag. What I don't understand is why the lag isn't calculated for the last value. Note that the last value in the example below (temp) is the lag between the 2nd to last and the 3rd to last values, the lag value between the last and 2nd to last value is missing.
library(tidyverse)
library(purrr)
dim(mtcars) # 32 rows
temp <- map_df(mtcars, ~ abs(diff(lag(.x))))
names(temp) <- paste(names(temp), '.abs.diff.lag', sep= '')
dim(temp) # 31 rows
It would be an awesome bonus if someone could show me how to pipe the renaming step, I played around with paste and enquo. The real dataset is too long to do a gather/newcolumnname/spread approach.
Thanks in advance!
EDIT: libraries need to run the script added
I think the lag call in your existing code is unnecessary as diff calculates the lagged difference automatically (although perhaps I don't understand properly what you are trying to do). You can also use rename_all to add a suffix to all the variable names.
library(purrr)
library(dplyr)
mtcars %>%
map_df(~ abs(diff(.x))) %>%
rename_all(funs(paste0(., ".abs.diff.lag")))
#> # A tibble: 31 x 11
#> mpg.abs.diff.lag cyl.abs.diff.lag disp.abs.diff.lag hp.abs.diff.lag
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0 0 0.0 0
#> 2 1.8 2 52.0 17
#> 3 1.4 2 150.0 17
#> 4 2.7 2 102.0 65
#> 5 0.6 2 135.0 70
#> 6 3.8 2 135.0 140
#> 7 10.1 4 213.3 183
#> 8 1.6 0 5.9 33
#> 9 3.6 2 26.8 28
#> 10 1.4 0 0.0 0
#> # ... with 21 more rows, and 7 more variables: drat.abs.diff.lag <dbl>,
#> # wt.abs.diff.lag <dbl>, qsec.abs.diff.lag <dbl>, vs.abs.diff.lag <dbl>,
#> # am.abs.diff.lag <dbl>, gear.abs.diff.lag <dbl>,
#> # carb.abs.diff.lag <dbl>
Maybe something like this:
dataCars <- mtcars%>%mutate(diffMPG = abs(mpg - lag(mpg)),
diffHP = abs(hp - lag(hp)))
And then do this for all the columns you are interested in
I was not able to reproduce your issues regarding the lag function. When I am executing your sample code, I retrieve a data frame consisting of 31 row, exactly as you mentioned, but the first row is not NA, it is already the subtraction of the 1st and 2nd row.
Regarding your bonus question, the answer is provided here:
temp <- map_df(mtcars, ~ abs(diff(lag(.x)))) %>% setNames(paste0(names(.), '.abs.diff.lag'))
This should result in the desired column naming.

Resources