After I update my Rstudio today, when I tried to get z-scores of a data frame by using mutate() and scale(), it returns a matrix with a 'new name' warning:
df <- df %>% group_by(participants) %>% mutate(zscore=scale(answer))
New names:
* NA -> ...8
class(df$zscore)
[1] "matrix" "array"
The column of the z-scores should have been named 'zscore', but why it is now named '...8'? I never had any problems with the codes before. Is it because of the update?
I think you just added another column without a header or read in data with a column without a header. There is no issue with your classes.
library(tidyverse)
test <- mtcars|>
group_by(cyl) |>
mutate(zscore=scale(mpg))
#class of test
class(test)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#class of column
class(test$zscore)
#> [1] "matrix" "array"
#recreate warning
test <- test |>
bind_cols("")
#> New names:
#> * `` -> `...13`
The warning at the bottom means that I added a column without a name in the 13th position.
Part of the issue is that scale() returns a matrix. You can fix this by wrapping in as.double():
library(dplyr)
starwars2 <- starwars %>%
select(height, gender) %>%
group_by(gender) %>%
mutate(zscore = as.double(scale(height)))
Output:
# A tibble: 87 × 3
# Groups: gender [3]
height gender zscore
<int> <chr> <dbl>
1 172 masculine -0.120
2 167 masculine -0.253
3 96 masculine -2.14
4 202 masculine 0.677
5 150 feminine -0.624
6 178 masculine 0.0394
7 165 feminine 0.0133
8 97 masculine -2.11
9 183 masculine 0.172
10 182 masculine 0.146
# … with 77 more rows
But I’m not sure this explains your NA -> ...8 issue. If not, please update your question to include your data (using dput(df)) or a subset (using dput(head(df))).
Related
I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.
I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).
Then, problems emerge.
I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?
df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)
df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA
I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.
Then I try to follow past examples and use broom::tidy, but no output for p-values....
> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
column n mean sd median trimmed mad min max range skew kurtosis se
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 toptim 435 22.1 4.43 22 22.3 3 7 30 23 NA NA 0.212
2 tnegaff 435 19.4 7.07 18 18.6 4 10 39 29 NA NA 0.339
3 sex 439 1.58 0.494 2 1.58 0 1 2 1 -0.318 1.10 0.0236
How can I get the result? May I know the reason for such failure?
Thank you for your answers in advance.
It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.
For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.
That's a few steps to go through but hope it helps!
df <- haven::read_sav("survey.sav")
library(tidyverse)
df %>%
group_by(sex) %>%
summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
cor.test = list(cor.test(toptim, tnegaff))) %>%
mutate(tidy_out = map(cor.test, broom::tidy)) %>%
unnest(tidy_out)
#> # A tibble: 2 × 11
#> sex cor cor.t…¹ estim…² stati…³ p.value param…⁴ conf.…⁵ conf.…⁶ method
#> <dbl+l> <dbl> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 [MAL… -0.220 <htest> -0.220 -3.04 2.73e- 3 182 -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest> -0.394 -6.75 1.06e-10 248 -0.494 -0.284 Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> # ¹cor.test, ²estimate, ³statistic, ⁴parameter, ⁵conf.low, ⁶conf.high
Edit - examining difference in correlation
Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:
cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])
When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether
or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>%
mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#> promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 no 364 4.89 1 no 0.973 0.0267
#> 2 no 342 3.74 3 no 0.973 0.0267
#> 3 yes 716 3.16 3 yes 0 1
#> 4 no 450 3.21 3 no 0.973 0.0267
#> 5 no 372 3.87 3 no 0.973 0.0267
#> 6 no 535 4.47 2 no 0.973 0.0267
#> 7 yes 736 3.94 4 yes 0 1
#> 8 no 330 2.54 2 no 0.973 0.0267
#> 9 no 478 3.48 2 no 0.973 0.0267
#> 10 yes 728 2.66 3 yes 0 1
#> # … with 78 more rows
sales_preds %>%
roc_curve(promoted, .pred_yes, event_level = "second") %>%
autoplot()
Created on 2021-09-08 by the reprex package (v2.0.1)
Im learning the basics of R and im going through an example where the user loads a .csv file containing the weights of mice fed a Normal Control or High Fat diet.
He proceeds to make two vectors (is this true? once extracted and unlisted?)
Im confused as to what purpose the unlist function serves here. Iv seen the unlist function used before graphing as well and am confused as to what difference it makes?
dplyr functions, such as filter() and select(), return tibbles (a variant on data.frames). Data frames and tibbles are a special type of list, where each element is a vector of the same length, but not necessarily the same type.
In the example given, each statement is selecting a single column, returned as a 1-column tibble. A 1-column tibble is a list with one element, in this case the vector of Bodyweights. However, many functions do not expect a 1-column tibble (or data.frame), but want a vector. By using unlist(), we are squashing the structure down to a single vector. This would be true whether you selected a single column or multiple columns.
The idiomatic way in dplyr would be to pipe pull(Bodyweight), as opposed to using unlist().
Consider this simple example for the difference
tib <- tibble(a = 1:5, b = letters[1:5])
select(tib, a)
class(select(tib, a))
# Notice the different printing and class when we unlist
unlist(select(tib, a))
class(unlist(select(tib, a))
Well that just depends on what you want to achieve. Before the unlist() you'll end up with data.frame (or more specific a tibble in this example because of the dplyr functionality applied to the data). When unlisting the single column tibble you'll end up with an atomic numeric (named) vector, which behaves totally different in some situations (the final rbind below is an example).
library(tidyverse)
mice <- structure(list(Diet=c("chow","chow","chow","chow","chow",
"chow","chow","chow","chow","chow","chow","chow","hf",
"hf","hf","hf","hf","hf","hf","hf","hf","hf","hf","hf"
),Bodyweight=c(21.51,28.14,24.04,23.45,23.68,19.79,28.4,
20.98,22.51,20.1,26.91,26.25,25.71,26.37,22.8,25.34,
24.97,28.14,29.58,30.92,34.02,21.9,31.53,20.73)),class=c("spec_tbl_df",
"tbl_df","tbl","data.frame"),row.names=c(NA,-24L),spec=structure(list(
cols=list(Diet=structure(list(),class=c("collector_character",
"collector")),Bodyweight=structure(list(),class=c("collector_double",
"collector"))),default=structure(list(),class=c("collector_guess",
"collector")),skip=1),class="col_spec"))
bodyweight <- mice %>% filter(Diet == "chow") %>% select(Bodyweight)
class(bodyweight)
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
bodyweight
#> # A tibble: 12 x 1
#> Bodyweight
#> <dbl>
#> 1 21.5
#> 2 28.1
#> 3 24.0
#> 4 23.4
#> 5 23.7
#> 6 19.8
#> 7 28.4
#> 8 21.0
#> 9 22.5
#> 10 20.1
#> 11 26.9
#> 12 26.2
bodyweight_unl <- mice %>% filter(Diet == "chow") %>% select(Bodyweight) %>% unlist
class(bodyweight_unl)
#> [1] "numeric"
bodyweight_unl
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
rbind(bodyweight, 1:12)
rbind(bodyweight_unl, 1:12)
Created on 2020-07-12 by the reprex package (v0.3.0)
The purpose of unlist is to to flatten a list of vectors into a single vector. This is from R for Data Science. It certainly is worth of reading.
See further explanations in the comments below.
library(tidyverse)
head(data)
#> Diet Bodyweight
#> 1 chow 21.51
#> 2 chow 28.14
#> 3 chow 24.04
#> 4 chow 23.45
#> 5 chow 23.68
#> 6 chow 19.79
# without unlist you get a data.frame
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% class()
#> [1] "data.frame"
# by unlisting you get a named vector with the names taken from the selected data
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist()
#> Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6
#> 21.51 28.14 24.04 23.45 23.68 19.79
#> Bodyweight7 Bodyweight8 Bodyweight9 Bodyweight10 Bodyweight11 Bodyweight12
#> 28.40 20.98 22.51 20.10 26.91 26.25
# If you set use.names=F you get a vector with the data you selected
dplyr::filter(data, Diet == 'chow') %>% select(Bodyweight) %>% unlist(use.names = F)
#> [1] 21.51 28.14 24.04 23.45 23.68 19.79 28.40 20.98 22.51 20.10 26.91 26.25
The following code
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
taken from Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles, estimates the tf-idf in Jane Austen's works. Anyway, this code appears to be specific to Jane Austen's books. I would like to derive, istead, the tf-idf for the following data frame:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
df=data.frame(text = sentences,
class = c("A","B","A","C","A","B","A","C","D"),
weight = c(1,1,3,4,1,2,3,4,5))
There are two things you needed to change:
since you did not set stringsAsFactors = FALSE when constructing the data.frame, you need to convert text to character first.
You do not have a column named book, which means you have to select some other column as document. Since you put a column named class into your example, I assume you want to calculate the tf-idf over this column.
Here is the code:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(class, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#> class word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A blue 2 0.0769 0.288 0.0221
#> 2 A you 2 0.0769 0.693 0.0533
#> 3 C is 2 0.2 0.693 0.139
#> 4 A and 1 0.0385 1.39 0.0533
#> 5 A are 1 0.0385 1.39 0.0533
#> 6 A beautiful 1 0.0385 1.39 0.0533
#> 7 A because 1 0.0385 1.39 0.0533
#> 8 A color 1 0.0385 1.39 0.0533
#> 9 A colour 1 0.0385 1.39 0.0533
#> 10 A dress 1 0.0385 1.39 0.0533
#> # ... with 42 more rows
The documentation has helpful remarks for this check out ?count and ?bind_tf_idf.
OK this is going to be a long post.
So i am fairly new with R (i am currently using the MR free 3.5, with no checkpoint) but i am trying to work with the tidyverse, which i find very elegant in writing code and a lot of times a lot more simple.
I decided to replicate an exercise from guru99 here. It is a simple k-means exercise. However because i always want to write "generalizeble" code i was trying to automatically rename the variables in mutate with new names. So i searched SO and found this solution here which is very nice.
First what works fine.
#library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read.csv(link)
rescaled <- df %>% discard(is.factor) %>%
select(-X) %>%
mutate_all(
funs("scaled" = scale)
)
When you download the data with read.csv you get the df in dataframe class and everything works.
And now the weird thinks start. If you download the data with read_csv or make it a tibble at any point after (the first X variable will be named X1 and you need to change the is.factor to is.character because stings are converted to character not factors unless explicitly asked for, for future me and others.)
and then run the code
df1 <- read_csv(link)
df1 %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
funs("scaled" = scale)
)
the new named variables are named price_scaled[,1] speed_scaled[,1] hd_scaled[,1] ram_scaled[,1] etc. when you view the output in the console or you even if you print().
BUT if you view() on it you see the output with the names you expect which are price_scaled speed_scaled hd_scaled etc. ALSO I am using an Rmarkdown document for the code and when i change the chunk output to inline it diplays the names correctly with hd_scaled etc.
Any one has any idea how to get the names printed in the console like price_scaled etc.
Why this is happening?
Though that this would be interesting to ask.
scale() returns a matrix, and dplyr/tibble isn't automatically coercing it to a vector. By changing your mutate_all() call to the below, we can have it return a vector. I identified this is what was happening by calling class(df1$speed_scaled) and seeing the result of "matrix".
library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read_csv(link)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> price = col_double(),
#> speed = col_double(),
#> hd = col_double(),
#> ram = col_double(),
#> screen = col_double(),
#> cd = col_character(),
#> multi = col_character(),
#> premium = col_character(),
#> ads = col_double(),
#> trend = col_double()
#> )
df %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
list("scaled" = function(x) scale(x)[[1]])
)
#> # A tibble: 6,259 x 14
#> price speed hd ram screen ads trend price_scaled speed_scaled
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1499 25 80 4 14 94 1 -1.24 -1.28
#> 2 1795 33 85 2 14 94 1 -1.24 -1.28
#> 3 1595 25 170 4 15 94 1 -1.24 -1.28
#> 4 1849 25 170 8 14 94 1 -1.24 -1.28
#> 5 3295 33 340 16 14 94 1 -1.24 -1.28
#> 6 3695 66 340 16 14 94 1 -1.24 -1.28
#> 7 1720 25 170 4 14 94 1 -1.24 -1.28
#> 8 1995 50 85 2 14 94 1 -1.24 -1.28
#> 9 2225 50 210 8 14 94 1 -1.24 -1.28
#> 10 2575 50 210 4 15 94 1 -1.24 -1.28
#> # ... with 6,249 more rows, and 5 more variables: hd_scaled <dbl>,
#> # ram_scaled <dbl>, screen_scaled <dbl>, ads_scaled <dbl>,
#> # trend_scaled <dbl>