Return Value of 3rd qudrant - r

Is there a way to return the value of the 3rd Qu. that comes up when you do the summary of a vector?
For example:
summary(data$attribute)
Returns:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0002012 0.0218800 0.0454300 0.0707100 0.0961500 0.4845000

You can also use quantile and specify the probability to be 0.75:
quantile(1:10, probs = 0.75)
# 75%
#7.75
If you want to remove the name attribute:
quantile(1:10, probs = 0.75, names = FALSE)
#7.75

You can access elements of the summary by index:
summary(1:10)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.25 5.50 5.50 7.75 10.00
summary(1:10)[5]
# 3rd Qu.
# 7.75
Or by name:
summary(1:10)["3rd Qu."]
# 3rd Qu.
# 7.75
We can use unname() to drop names:
unname(summary(1:10)[5])
# [1] 7.75

Related

write.csv() using format() adds white spaces to NA

Incidentally, I have found this problem with write.csv() and NA values if using format():
d <- data.frame(id=1:10, f=0.1*(1:10),f2=0.01*(1:10))
d$f2[3] <- NA
summary(d)
id f f2
Min. : 1.00 Min. :0.100 Min. :0.01000
1st Qu.: 3.25 1st Qu.:0.325 1st Qu.:0.04000
Median : 5.50 Median :0.550 Median :0.06000
Mean : 5.50 Mean :0.550 Mean :0.05778
3rd Qu.: 7.75 3rd Qu.:0.775 3rd Qu.:0.08000
Max. :10.00 Max. :1.000 Max. :0.10000
NA's :1
format(d, nsmall=3)
id f f2
1 1 0.100 0.010
2 2 0.200 0.020
3 3 0.300 NA
4 4 0.400 0.040
5 5 0.500 0.050
6 6 0.600 0.060
7 7 0.700 0.070
8 8 0.800 0.080
9 9 0.900 0.090
10 10 1.000 0.100
format(d$f2, nsmall = 3)
[1] "0.010" "0.020" " NA" "0.040" "0.050" "0.060" "0.070" "0.080" "0.090" "0.100"
format(d$f2[3])
[1] "NA"
write.csv(format(d,nsmall=3),file="test.csv",row.names = FALSE)
d2 <- read.csv("test.csv")
summary(d2)
id f f2
Min. : 1.00 Min. :0.100 Length:10
1st Qu.: 3.25 1st Qu.:0.325 Class :character
Median : 5.50 Median :0.550 Mode :character
Mean : 5.50 Mean :0.550
3rd Qu.: 7.75 3rd Qu.:0.775
Max. :10.00 Max. :1.000
I check test.csv and find that the cell corresponding to d$f[3] is not "NA" but " NA"
d2 <- read.csv("test.csv", na.strings=" NA")
summary(d2)
id f f2
Min. : 1.00 Min. :0.100 Min. :0.01000
1st Qu.: 3.25 1st Qu.:0.325 1st Qu.:0.04000
Median : 5.50 Median :0.550 Median :0.06000
Mean : 5.50 Mean :0.550 Mean :0.05778
3rd Qu.: 7.75 3rd Qu.:0.775 3rd Qu.:0.08000
Max. :10.00 Max. :1.000 Max. :0.10000
NA's :1
Should this behavior of format(), adding white spaces to NAs, not be considered a bug?
Not a critical issue as using format() within write.csv() is not really necessary (I found this problem in a very particular case), but, in principle, NAs should not be affected by any format. One thing is having a nicer print to the console and another actually saving those white spaces to a file that could be read back into R.

Finding the right cluster methods based on data distribution

My record has 821050 rows and 18 columns. The rows represent different online users, the columns the browsing behavior of the users in an online shop. The column variables include shopping cart cancellations, number of items in the shopping cart, detailed view of items, product list/multi-item view, detailed search view, etc... Half of the variables are discrete, half are continuous. 8 of the variables are dummy variables. Based on the data set, I want to apply different hard and soft clustering methods and analyze the shopping cart abondonnement of my data set more precisely. With the help of descriptive statistics I have analyzed my data set and obtained the following results.
# 1. WKA_ohneJB <- read.csv("WKA_ohneJB_PCA.csv", header=TRUE, sep = ";", stringsAsFactors = FALSE)
# 2. summary(WKA_ohneJB)
X BASKETS_NZ LOGONS PIS PIS_AP PIS_DV
Min. : 1 Min. : 0.000 Min. :0.0000 Min. : 1.00 Min. : 0.000 Min. : 0.000
1st Qu.:205263 1st Qu.: 1.000 1st Qu.:1.0000 1st Qu.: 9.00 1st Qu.: 0.000 1st Qu.: 0.000
Median :410525 Median : 1.000 Median :1.0000 Median : 20.00 Median : 1.000 Median : 1.000
Mean :410525 Mean : 1.023 Mean :0.9471 Mean : 31.11 Mean : 1.783 Mean : 4.554
3rd Qu.:615786 3rd Qu.: 1.000 3rd Qu.:1.0000 3rd Qu.: 41.00 3rd Qu.: 2.000 3rd Qu.: 5.000
Max. :821048 Max. :49.000 Max. :1.0000 Max. :593.00 Max. :71.000 Max. :203.000
PIS_PL PIS_SDV PIS_SHOPS PIS_SR QUANTITY WKA
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 1.00 Min. :0.0000
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.:0.0000
Median : 0.000 Median : 0.00 Median : 2.00 Median : 0.000 Median : 2.00 Median :1.0000
Mean : 5.729 Mean : 2.03 Mean : 10.67 Mean : 3.873 Mean : 3.14 Mean :0.6341
3rd Qu.: 4.000 3rd Qu.: 2.00 3rd Qu.: 11.00 3rd Qu.: 4.000 3rd Qu.: 4.00 3rd Qu.:1.0000
Max. :315.000 Max. :142.00 Max. :405.00 Max. :222.000 Max. :143.00 Max. :1.0000
NEW_CUST EXIST_CUST WEB_CUST MOBILE_CUST TABLET_CUST LOGON_CUST_STEP2
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.07822 Mean :0.9218 Mean :0.4704 Mean :0.3935 Mean :0.1361 Mean :0.1743
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
With the non-dummy variables it is noticeable that they have a right-skewed distribution. For the dummy variables, 5 have a right-skewed distribution and 3 have a left-skewed distribution.
I have also listed range and quantiles for the 9 non-dummies
# BASKETS_NZ
range(WKA_ohneJB$BASKETS_NZ) # 0 49
quantile(WKA_ohneJB$BASKETS_NZ, 0.5) # 1
quantile(WKA_ohneJB$BASKETS_NZ, 0.25) # 1
quantile(WKA_ohneJB$BASKETS_NZ, 0.75) # 1
# PIS
range(WKA_ohneJB$PIS) # 1 593
quantile(WKA_ohneJB$PIS, 0.25) # 9
quantile(WKA_ohneJB$PIS, 0.5) # 20
quantile(WKA_ohneJB$PIS, 0.75) # 41
# PIS_AP
range(WKA_ohneJB$PIS_AP) # 0 71
quantile(WKA_ohneJB$PIS_AP, 0.25) # 0
quantile(WKA_ohneJB$PIS_AP, 0.5) # 1
quantile(WKA_ohneJB$PIS_AP, 0.75) # 2
# PIS_DV
range(WKA_ohneJB$PIS_DV) # 0 203
quantile(WKA_ohneJB$PIS_DV, 0.25) # 0
quantile(WKA_ohneJB$PIS_DV, 0.5) # 1
quantile(WKA_ohneJB$PIS_DV, 0.75) # 5
#PIS_PL
range(WKA_ohneJB$PIS_PL) # 0 315
quantile(WKA_ohneJB$PIS_PL, 0.25) # 0
quantile(WKA_ohneJB$PIS_PL, 0.5) # 0
quantile(WKA_ohneJB$PIS_PL, 0.75) # 4
#PIS_SDV
range(WKA_ohneJB$PIS_SDV) # 0 142
quantile(WKA_ohneJB$PIS_SDV, 0.25) # 0
quantile(WKA_ohneJB$PIS_SDV, 0.5) # 0
quantile(WKA_ohneJB$PIS_SDV, 0.75) # 2
# PIS_SHOPS
range(WKA_ohneJB$PIS_SHOPS) # 0 405
quantile(WKA_ohneJB$PIS_SHOPS, 0.25) # 0
quantile(WKA_ohneJB$PIS_SHOPS, 0.5) # 2
quantile(WKA_ohneJB$PIS_SHOPS, 0.75) # 11
# PIS_SR
range(WKA_ohneJB$PIS_SR) # 0 222
quantile(WKA_ohneJB$PIS_SR, 0.25) # 0
quantile(WKA_ohneJB$PIS_SR, 0.5) # 0
quantile(WKA_ohneJB$PIS_SR, 0.75) # 4
# QUANTITY
range(WKA_ohneJB$QUANTITY) # 1 143
quantile(WKA_ohneJB$QUANTITY, 0.25) # 1
quantile(WKA_ohneJB$QUANTITY, 0.5) # 2
quantile(WKA_ohneJB$QUANTITY, 0.75) # 4
How can I recognize from the distribution of my data which cluster methods are suitable for mixed type clickstream data?

Summarize the same variables from multiple dataframes in one table

I have voter and party-data from several datasets that I further separated into different dataframes and lists to make it comparable. I could just use the summary command on each of them individually then compare manually, but I was wondering whether there was a way to get them all together and into one table?
Here's a sample of what I have:
> summary(eco$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 4 4 4 4 5
> summary(ecovoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.744 5.000 10.000 26
> summary(lef$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.692 4.000 7.000
> summary(lefvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 3.000 3.612 5.000 10.000 332
> summary(soc$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.000 4.143 5.000 6.000
> summary(socvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.674 5.000 10.000 346
Is there a way I can summarize these lists (ecovoters, lefvoters, socvoters etc) and the dataframe variables (eco$rilenew, lef$rilenew, soc$rilenew etc) together and have them in one table?
You could put everything into a list and summarize with a small custom function.
L <- list(eco$rilenew, ecovoters, lef$rilenew,
lefvoters, soc$rilenew, socvoters)
t(sapply(L, function(x) {
s <- summary(x)
length(s) <- 7
names(s)[7] <- "NA's"
s[7] <- ifelse(!any(is.na(x)), 0, s[7])
return(s)
}))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 0.9820673 3.3320662 3.958665 3.949512 4.625109 7.229069 0
[2,] -4.8259384 0.5028293 3.220546 3.301452 6.229384 9.585749 26
[3,] -0.3717391 2.3280366 3.009360 3.013908 3.702156 6.584659 0
[4,] -2.6569493 1.6674330 3.069440 3.015325 4.281100 8.808432 332
[5,] -2.3625651 2.4964361 3.886673 3.912009 5.327401 10.349040 0
[6,] -2.4719404 1.3635785 2.790523 2.854812 4.154936 8.491347 346
Data
set.seed(42)
eco <- data.frame(rilenew=rnorm(800, 4, 1))
ecovoters <- rnorm(75, 4, 4)
ecovoters[sample(length(ecovoters), 26)] <- NA
lef <- data.frame(rilenew=rnorm(900, 3, 1))
lefvoters <- rnorm(700, 3, 2)
lefvoters[sample(length(lefvoters), 332)] <- NA
soc <- data.frame(rilenew=rnorm(900, 4, 2))
socvoters <- rnorm(700, 3, 2)
socvoters[sample(length(socvoters), 346)] <- NA
Can use map from tidyverse to get the summary list, then if you want the result as dataframe, then plyr::ldply can help to convert list to dataframe:
ll = map(L, summary)
ll
plyr::ldply(ll, rbind)
> ll = map(L, summary)
> ll
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9821 3.3321 3.9587 3.9495 4.6251 7.2291
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.331 1.347 3.726 3.793 6.653 16.845 26
[[3]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3717 2.3360 3.0125 3.0174 3.7022 6.5847
[[4]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.657 1.795 3.039 3.013 4.395 9.942 332
[[5]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.363 2.503 3.909 3.920 5.327 10.349
[[6]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-3.278 1.449 2.732 2.761 4.062 8.171 346
> plyr::ldply(ll, rbind)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 0.9820673 3.332066 3.958665 3.949512 4.625109 7.229069 NA
2 -4.3312551 1.346532 3.725708 3.793431 6.652917 16.844796 26
3 -0.3717391 2.335959 3.012507 3.017438 3.702156 6.584659 NA
4 -2.6569493 1.795307 3.038905 3.012928 4.395338 9.941819 332
5 -2.3625651 2.503324 3.908727 3.920050 5.327401 10.349040 NA
6 -3.2779863 1.448814 2.732515 2.760569 4.061854 8.170793 346

Restructure output of R summary function

Is there an easy way to change the output format for R's summary function so that the results print in a column instead of row? R does this automatically when you pass summary a data frame. I'd like to print summary statistics in a column when I pass it a single vector. So instead of this:
>summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 6.699 6.000 559.000
It would look something like this:
>summary(vector)
Min. 1.000
1st Qu. 1.000
Median 2.000
Mean 6.699
3rd Qu. 6.000
Max. 559.000
Sure. Treat it as a data.frame:
set.seed(1)
x <- sample(30, 100, TRUE)
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 10.00 15.00 16.03 23.25 30.00
summary(data.frame(x))
# x
# Min. : 1.00
# 1st Qu.:10.00
# Median :15.00
# Mean :16.03
# 3rd Qu.:23.25
# Max. :30.00
For slightly more usable output, you can use data.frame(unclass(.)):
data.frame(val = unclass(summary(x)))
# val
# Min. 1.00
# 1st Qu. 10.00
# Median 15.00
# Mean 16.03
# 3rd Qu. 23.25
# Max. 30.00
Or you can use stack:
stack(summary(x))
# values ind
# 1 1.00 Min.
# 2 10.00 1st Qu.
# 3 15.00 Median
# 4 16.03 Mean
# 5 23.25 3rd Qu.
# 6 30.00 Max.

summary to a data frame

Using summary(var) gives me the following output:
PAY_BACK_ORG
Min. : -16.40
1st Qu.: 0.00
Median : 26.40
Mean : 34.37
3rd Qu.: 53.60
Max. :4033.40
I want it as a dataframe which will look like this:
Min -16.40
1st Qu 0.00
Median 26.40
Mean 34.37
3rd Qu 53.60
Max 4033.40
How can I get it in?
Like this?
var <- rnorm(100)
x <- summary(var)
data.frame(x=matrix(x),row.names=names(x))
## x
## Min. -2.68300
## 1st Qu. -0.70930
## Median -0.09732
## Mean -0.00809
## 3rd Qu. 0.71550
## Max. 2.58100

Resources