Subsetting, Matrices - r

I am super new to R and currently playing with the "diamond" dataset.
I am trying to return the row corresponding to the lowest, mean and largest prices and put everything in a 10 by 4 matrix. Please explain an easier way of doing this if possible.
library(ggplot2)
data(diamonds)
min(diamonds$price)
mean(diamonds$price)
max(diamonds$price) # this one gives me the wrong val!
M<-matrix(1:cols, nrow = 1, ncol = cols)
colnames(M)<-c("carat","cut" , "color" , "clarity", "depth" , "table" , "price" , "x" , "y" ,"z")
# Here I need to add the rows corresponding to the min,mean,max to this matrix.
Thanks

If all you want to do is to select the rows in the diamonds data frame corresponding to the mean, minimum, and maximum of price, this is easily accomplished with a combination of the $ and [ forms of the extract operator in Base R.
Note that this will return a data frame with 3 rows, not 4, because there are two rows at the minimum price, no rows at the mean price, and one row at the maximum price.
library(ggplot2)
data(diamonds)
diamonds[diamonds$price %in% c(min(diamonds$price),mean(diamonds$price),max(diamonds$price)),]
...and the output:
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
A solution with dplyr uses filter() as follows.
# dplyr solution
library(dplyr)
diamonds %>% filter(price %in% c(min(price),mean(price),max(price)))
...and the output:
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
>

matrix and dataframes are different in R. diamonds is a dataframe it is better and easy to process if you keep it as dataframe only.
summary(diamonds) gives you some nice summary stats for each column.
If you want to apply specific functions to columns using dplyr, you can do :
library(dplyr)
diamonds %>%
summarise(across(where(is.numeric),list(min = min, max = max, mean = mean))) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 7 x 4
# col min max mean
# <chr> <dbl> <dbl> <dbl>
#1 carat 0.2 5.01 0.798
#2 depth 43 79 61.7
#3 table 43 95 57.5
#4 price 326 18823 3933.
#5 x 0 10.7 5.73
#6 y 0 58.9 5.73
#7 z 0 31.8 3.54
Note that I applied these functions only to numeric columns since cut, color, clarity are factor columns.

Related

How do I create dummy variables for specific rows/observations in a dataframe?

I am doing question C12(iii) in chapter 9 of Wooldridge's Introductory Econometrics: A Modern Approach. The question asks the reader to first identify all observations for which the variable 'bs' is greater than 0.5. The question then asks the reader to assign a dummy variable to each of these observations for use in a regression.
I performed the first part of the question (identifying all observations for which 'bs' is greater than 0.5 using the following code:
library('wooldridge')
which(elem94_95$bs>0.5)
[1] 68 1127 1508 1670
After looking at the table this produces in rStudio, I find that the relevant rows/observations are 68; 1,127; 1,508; and 1,670.
I would like to create a dummy variable for each of these rows/observations, i.e., 'd68'; 'd1127'; 'd1508'; and 'd1670'. How do I do this? My intuitive first attempt solution was the following:
elem94_95$d68<-ifelse(row==68,1,0)
However, this does not work.
I've come up with the following solution:
elem94_95$rownumber<-1:nrow(elem94_95)
elem94_95$d68<-ifelse(elem94_95$rownumber==68,1,0)
elem94_95$d1127<-ifelse(elem94_95$rownumber==1127,1,0)
elem94_95$d1508<-ifelse(elem94_95$rownumber==1508,1,0)
elem94_95$d1670<-ifelse(elem94_95$rownumber==1670,1,0)
However, it feels inelegant. If anyone else has a way to directly include row numbers in a formula I would welcome that solution instead.
library(tidyverse)
df <- elem94_95 %>%
as_tibble() %>%
mutate(row = row_number(),
dummy = if_else(
bs > 0.5, str_c("d", row), NA_character_
))
df %>%
filter(!is.na(dummy))
# A tibble: 4 × 16
distid schid lunch enrol staff exppp avgsal avgben math4 story4 bs lavgsal lenrol lstaff row dummy
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
1 9030 192 40.7 167 85.6 3584 24425 16108 67.9 71.4 0.659 10.1 5.12 4.45 68 d68
2 63160 5783 3.60 411 115. 5394 30304 17418 83.9 92.9 0.575 10.3 6.02 4.75 1127 d1127
3 82010 701 69.4 896 78.3 1353 9297 9295 41.2 48.5 1.00 9.14 6.80 4.36 1508 d1508
4 82040 5357 32.9 304 49.9 3532 50042 25134 57.6 55.9 0.502 10.8 5.72 3.91 1670 d1670

Loop to plot boxplot with ggplot

I am using diamonds df,
I would like to plot a boxplot for each numerical column by category,
In this case category would be defined by "cut" column.
I am using a for-loop to accomplish this task,
Here's the code I am using:
##################################################################################
# Data #
# #
##################################################################################
data("diamonds")
basePlot <- diamonds[ names(diamonds)[!names(diamonds) %in% c("color", "clarity")] ]
##################################################################################
## set Plot view to 4 boxplots ##
par(mfrow = c(2,2))
## for-loop to boxplot all numerical columns ##
for (i in 1:(ncol(basePlot)-1)){
print(ggplot(basePlot, aes(as.factor(cut),
basePlot[c(i)],color=as.factor(cut)))
+ geom_boxplot(outlier.colour="black",outlier.shape=16,outlier.size=1,notch=FALSE)
+ xlab("Diamond Cut")
+ ylab(colnames(basePlot)[i])
)
}
Console output:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
Is there any other way to accomplish this task?
Instead of multiple plots, I suggest facets. To do this, though, we need to convert the data from "wide" format to "longer" format, and the canonical way in the tidyverse is with tidyr::pivot_longer.
> basePlot
# A tibble: 53,940 x 8
carat cut depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium 59.8 61 326 3.89 3.84 2.31
3 0.23 Good 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium 62.4 58 334 4.2 4.23 2.63
5 0.31 Good 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
> pivot_longer(basePlot, -cut, names_to="var", values_to="val")
# A tibble: 377,580 x 3
cut var val
<ord> <chr> <dbl>
1 Ideal carat 0.23
2 Ideal depth 61.5
3 Ideal table 55
4 Ideal price 326
5 Ideal x 3.95
6 Ideal y 3.98
7 Ideal z 2.43
8 Premium carat 0.21
9 Premium depth 59.8
10 Premium table 61
# ... with 377,570 more rows
With this, we only have to tell ggplot2 to worry about val for the values, and var for the x-axis.
library(ggplot2)
library(tidyr) # pivot_longer
ggplot(pivot_longer(basePlot, -cut, names_to="var", values_to="val"),
aes(cut, val, color=cut)) +
geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=1, notch=FALSE) +
xlab("Diamond Cut") +
facet_wrap(~var, nrow=2, scales="free") +
scale_x_discrete(guide=guide_axis(n.dodge=2))
The reason you have cut both in the x-axis and in the legend is because color= will add the legend. Since it's redundant, we could either remove the color aesthetic (which would also remove the legend) or we could just suppress the legend (by adding + scale_color_discrete(guide=FALSE)).
There are two ways of faceting: facet_wrap and facet_grid. The latter is well tuned for multiple variables (one facet variable on the x, one on the y) and many other configurations. Granted, you can use facet_grid with just one variable (which is similar to facet_wrap(nrow=1) or ncol=1), but there are some styling distinctions between them.

Translating filter_all(any_vars()) to filter(across())

On updating my own answer to another thread, I wasn't able to come up with a good solution to replace the last example (see below). The idea is to get all rows where any column contains a certain string, in my example "V".
library(tidyverse)
#get all rows where any column contains 'V'
diamonds %>%
filter_all(any_vars(grepl('V',.))) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# this does naturally not give the desired output!
diamonds %>%
filter(across(everything(), ~ grepl('V', .))) %>%
head
#> # A tibble: 0 x 10
I found a thread where the poster ponders over similar stuff, but applying a similar logic on grepl does not work.
### don't run, this is ugly and does not work
diamonds %>%
rowwise %>%
filter(any(grepl("V", across(everything())))) %>%
head
This is very difficult, because the example shows that you want to filter data from all columns when any of them meets the condition (i.e. you want a union). That's done with filter_all() and any_vars().
While filter(across(everything(), ...)) filters out from all columns when all of them meet the condition (i.e. this is a intersection, quite opposite of the previous).
To convert it from intersection to the union (i.e. to get again rows where any of the columns meet the condition), you probably need to check the row sum for that:
diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
It will sum all the TRUEs that appear in the row, i.e. if there is at least one value meeting the condition, that row sum will be > 0 and will be shown.
I'm sorry for across() is not the very first child of filter(), but it's at least some idea how to do that. :-)
Evaluation:
Using #TimTeaFan's method to check that:
identical(
{diamonds %>%
filter_all(any_vars(grepl('V',.)))
},
{diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
}
)
#> [1] TRUE
Benchmark:
As per our discussion under TimTeaFan's answer, here is a comparison, surprisingly, all solutions have a similar time:
library(tidyverse)
microbenchmark::microbenchmark(
filter_all = {diamonds %>%
filter_all(any_vars(grepl('V',.)))},
purrr_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))},
base_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% Reduce(`|`, .))},
rowsums = {diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)},
times = 100L,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> filter_all 295.7235 302.1311 309.6455 305.0491 310.0335 449.3619 100
#> purrr_reduce 297.8220 302.4411 310.2829 306.2929 312.2278 461.0194 100
#> base_reduce 298.5033 303.6170 309.4147 306.1839 312.3518 409.5273 100
#> rowsums 295.3863 301.0281 307.8517 305.3142 309.4793 372.8867 100
Created on 2020-07-14 by the reprex package (v0.3.0)
This is the equivalent to the filter_all call you posted. However, #akrun is totally correct to point out, that it should be converted to character first. Nevertheless, this also holds true for your filter_all statement.
The idea is to use across(everything(), ~ grepl('V', .)) to get the whole data.frame transformed into columns of TRUE and FALSE regarding grepl('V', .). However, filter needs a vector, or a data.frame with one column so we transform it by using reduce(|). It combines the first two columns with | then the result of this call with the third column and so on, until the original data.frame has one column with TRUE and FALSE which can then be used to filter the rows.
library(ggplot2)
library(dplyr)
diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`)) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
identical({diamonds %>%
filter_all(any_vars(grepl('V',.)))},
{diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))
})
#> [1] TRUE
Created on 2020-07-14 by the reprex package (v0.3.0)
Some of the columns were ordered and it will affect with c_across. Instead, if we convert to character class and then do the grepl it should work
library(dplyr)
library(ggplot2)
diamonds %>%
head %>%
mutate(across(where(is.factor), as.character)) %>%
rowwise %>%
filter(any(grepl("V", c_across(where(is.character)))))
# A tibble: 3 x 10
# Rowwise:
# carat cut color clarity depth table price x y z
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

Line plot with error bars in which each line is a different group and multiple variables are in the x axis

I'm trying to create a line plot with error bars in R/Rstudio, in which each line is a different group (coded by one variable) and different continuous variables compose the x axis.
Taking the dataset diamonds as examples, it would be a multiple line graph, in which each line is one category of the variable "color and x,y,z are variables in whose levels are in the y axis, but they are positioned in the x axis.
the head of diamonds in R is:
(as coded in R studio :
>head(diamonds)
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
an example of a similar graph would be the one attached in the pic, but I need one with error bars (and this was made in stata, which just can't add error bars to this command which is: profileplot varx vary varz, by(groups)
profile plot without errorbars as an example is here::
Before we start, we will plot x,y,z columns from diamonds,and because x and y and very close, i subtract 1 from y so we can see it, and also introduce some error for error bars
library(tidyr)
library(ggplot2)
library(dplyr)
mydata <- diamonds %>% select(color,x,y,z) %>% pivot_longer(-color)
# A tibble: 6 x 3
color name value
<ord> <chr> <dbl>
1 E x 1.80
2 E y 3.98
3 E z 2.43
4 E x 2.92
5 E y 3.84
6 E z 2.31
Then:
ggplot(mydata,aes(x=name,y=value,color=color)) +
stat_summary(fun.y=mean,geom="point") +
stat_summary(fun.y=mean,aes(group=color),geom="line") +
stat_summary(fun.data=mean_se,geom="errorbar",width=0.1)
In this case the errorbars etc don't make sense because the x, y and z values are pretty much similar.

How to use Sparklyr to summarize Categorical Variable Level

For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr.
In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable.
Need Help:
Implement the function via SparklyR
Table 1: Final output needed:
# A tibble: 20 x 6
variables levels N freq ratio rank
<chr> <ord> <int> <int> <dbl> <int>
1 cut Ideal 53940 21551 40.0 1
2 cut Premium 53940 13791 25.6 2
3 cut Very Good 53940 12082 22.4 3
4 cut Good 53940 4906 9.10 4
5 cut Fair 53940 1610 2.98 5
6 color G 53940 11292 20.9 1
7 color E 53940 9797 18.2 2
8 color F 53940 9542 17.7 3
9 color H 53940 8304 15.4 4
10 color D 53940 6775 12.6 5
11 color I 53940 5422 10.1 6
12 color J 53940 2808 5.21 7
13 clarity SI1 53940 13065 24.2 1
14 clarity VS2 53940 12258 22.7 2
15 clarity SI2 53940 9194 17.0 3
16 clarity VS1 53940 8171 15.1 4
17 clarity VVS2 53940 5066 9.39 5
18 clarity VVS1 53940 3655 6.78 6
19 clarity IF 53940 1790 3.32 7
20 clarity I1 53940 741 1.37 8
R Code:
# Categorical Variable Profile
# Table based on dlookr package, diagnose_category() function
# variables : variable names
# types: the data type of the variable
# levels: level names
# N : Number of observation
# freq : Number of observation at the level
# ratio : Percentage of observation at the level
# rank : Rank of occupancy ratio of levels
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)
# Helper Function
cat_level_summary <- function(df,x) {
count(df,x, sort = TRUE) %>%
transmute(levels = x, N = sum(n), freq = n,
ratio = n / sum(n) * 100, rank = row_number())
}
# Loading
diamonds_tbl <- diamonds
# Main Code
CategoricalVariableProfile <- diamonds_tbl %>%
select_if(~!is.numeric(.)) %>%
map(~cat_level_summary(data.frame(x=.x), x)) %>%
do.call(rbind.data.frame, .) %>%
rownames_to_column(., "variables")%>%
mutate(variables = str_match(variables, ".*(?=\\.)")[, 1] )
Spark Code:
#Spark data Table
diamonds_tbl <- copy_to(sc, diamonds, "diamonds", overwrite = TRUE)
CategoricalVariableProfile <- diamonds_tbl %>%
group_by(cut) %>%
summarize(count = n()) %>%
sdf_register("CategoricalVariableProfile")
Flatten your data using sdf_gather:
long <- diamonds_tbl %>%
select(cut, color, clarity) %>%
sdf_gather("variable", "level", "cut", "color", "clarity")
Aggregate by variable and level:
counts <- long %>% group_by(variable, level) %>% summarise(freq = n())
And finally apply required window functions:
result <- counts %>%
arrange(-freq) %>%
mutate(
rank = rank(),
total = sum(freq, na.rm = TRUE),
ratio = freq / total * 100)
Which will give you
result
# Source: spark<?> [?? x 6]
# Groups: variable
# Ordered by: -freq
variable level freq rank total ratio
<chr> <chr> <dbl> <int> <dbl> <dbl>
1 cut Ideal 21551 1 53940 40.0
2 cut Premium 13791 2 53940 25.6
3 cut Very Good 12082 3 53940 22.4
4 cut Good 4906 4 53940 9.10
5 cut Fair 1610 5 53940 2.98
6 clarity SI1 13065 1 53940 24.2
7 clarity VS2 12258 2 53940 22.7
8 clarity SI2 9194 3 53940 17.0
9 clarity VS1 8171 4 53940 15.1
10 clarity VVS2 5066 5 53940 9.39
# … with more rows
with following optimized plan
optimizedPlan(result)
<jobj[165]>
org.apache.spark.sql.catalyst.plans.logical.Project
Project [variable#524, level#525, freq#1478L, rank#1479, total#1480L, ((cast(freq#1478L as double) / cast(total#1480L as double)) * 100.0) AS ratio#1481]
+- Window [rank(_w1#1493L) windowspecdefinition(variable#524, _w1#1493L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#1479], [variable#524], [_w1#1493L ASC NULLS FIRST]
+- Window [sum(freq#1478L) windowspecdefinition(variable#524, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS total#1480L], [variable#524]
+- Project [variable#524, level#525, freq#1478L, -freq#1478L AS _w1#1493L]
+- Sort [-freq#1478L ASC NULLS FIRST], true
+- Aggregate [variable#524, level#525], [variable#524, level#525, count(1) AS freq#1478L]
+- Generate explode(map(cut, cut#19, color, color#20, clarity, clarity#21)), [0, 1, 2], false, [variable#524, level#525]
+- Project [cut#19, color#20, clarity#21]
+- InMemoryRelation [carat#18, cut#19, color#20, clarity#21, depth#22, table#23, price#24, x#25, y#26, z#27], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Scan ExistingRDD[carat#18,cut#19,color#20,clarity#21,depth#22,table#23,price#24,x#25,y#26,z#27]
and query (sdf_gather component not included):
dbplyr::remote_query(result)
<SQL> SELECT `variable`, `level`, `freq`, `rank`, `total`, `freq` / `total` * 100.0 AS `ratio`
FROM (SELECT `variable`, `level`, `freq`, rank() OVER (PARTITION BY `variable` ORDER BY -`freq`) AS `rank`, sum(`freq`) OVER (PARTITION BY `variable`) AS `total`
FROM (SELECT *
FROM (SELECT `variable`, `level`, count(*) AS `freq`
FROM `sparklyr_tmp_ded2576b9f1`
GROUP BY `variable`, `level`) `dsbksdfhtf`
ORDER BY -`freq`) `obyrzsxeus`) `ekejqyjrfz`

Resources