skimr: how to remove histogram? - r

I want to use the function skim from R package skimr on Windows. Unfortunately, in many situations column, hist is printed incorrectly (with many <U+2587>-like symbols), as in the example below.
Question: is there an easy way to either disable column "hist" and prevent it from being printed or prevent it from being calculated at all? Is there an option like hist = FALSE?
capture.output(skimr::skim(iris))
#> [1] "Skim summary statistics"
#> [2] " n obs: 150 "
#> [3] " n variables: 5 "
#> [4] ""
#> [5] "-- Variable type:factor ------------------------------------------------------------------------"
#> [6] " variable missing complete n n_unique top_counts"
#> [7] " Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0"
#> [8] " ordered"
#> [9] " FALSE"
#> [10] ""
#> [11] "-- Variable type:numeric -----------------------------------------------------------------------"
#> [12] " variable missing complete n mean sd p0 p25 p50 p75 p100"
#> [13] " Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9"
#> [14] " Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5"
#> [15] " Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9"
#> [16] " Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4"
#> [17] " hist"
#> [18] " <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>"
#> [19] " <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>"
#> [20] " <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>"
#> [21] " <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>"
Changing the locale to Chinese (as in this answer) does not solve the problem, but makes it worse:
Sys.setlocale(locale = "Lithuanian")
df <- data.frame(x = 1:5, y = c("Ą", "Č", "Ę", "ū", "ž"))
Sys.setlocale(locale = "Chinese")
capture.output(skimr::skim(df))
#> Error in substr(names(x), 1, options$formats$.levels$max_char) : invalid multibyte string at '<c0>'

skim_with(numeric = list(hist = NULL)) This is in the "Using Skimr" vignette.

You could also use skim_without_charts instead of skim.
More details in the docs here:
https://www.rdocumentation.org/packages/skimr/versions/2.0.2/topics/skim

Also keep in mind that the output from skimr is a dataframe so you can do:
# I'm using tidyverse here
iris %>%
skim() %>%
select(-numeric.hist)
The catch is that the name of the column is not hist but numeric.hist.
I actually got to this question because I wanted to do the opposite: keep only the histograms.

Related

Appending List Elements in write.table

I have a list of what are essentially tables of different variables, with a reproducible dummy example below (it's a little ugly, but it gets the idea across).
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,i] > mean(iris[,i]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
In a previous version R (at least 4.0), I used to be able to do something like:
lapply(results, function(x) write.table(x, "myfile.txt", append=T, sep="\t", quote=F, row.names=F))
which would generate a file called myfile.txt and fill it with all of my tables, much like the list of printed tables from results. I've had this code (which was functioning as expected) since at least the end of 2021. However, I now get the error:
Error in write.table(x, "myfile.txt", append = T, sep = "\t", quote = T, :
(converted from warning) appending column names to file
And to some extent I get it -- the column names I'm using aren't identical to what I'm appending, but I don't really care for my purposes. I just want my printed list of tables. Is there a way to force appending irrespective of mismatched column names? I've tried using col.names=NA but then receive the error that using col.names=NA with row.names=F "makes no sense". Do I need to resign myself to using functions like sink for this? I'd really like everything to remain tab-separated if possible.
It appears to be baked-in, depending solely on the col.names and append arguments and no easy way to squelch it there.
In general it's just a warning, but since it was elevated to Error status, that suggests you've set options(warn = 2) or higher. It's not a factor for these resolutions (which result in no warning being emitted and therefore no escalation to an error).
Suppress it and all other warnings (for good or bad):
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F)
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
suppressWarnings(write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F))
### nothing emitted, file appended
Suppress just that warning, allowing others (since suppressing all can hide other issues):
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("appending column names to file", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
### nothing emitted, file appended
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("something else", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
Another potential solution is to use write.list() from the erer package:
library(erer)
#> Loading required package: lmtest
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,ii] > mean(iris[,ii]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
write.list(z = results, file = "myfile.txt", row.names = FALSE, quote = FALSE)
read.csv("~/Desktop/myfile.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 Sepal.Length FALSE 50 24 6 8.373761e-18
#> 2 Sepal.Length TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 Sepal.Width FALSE 8 42 33 1.24116e-11
#> 6 Sepal.Width TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 Petal.Length FALSE 50 7 0 9.471374e-28
#> 10 Petal.Length TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 Petal.Width FALSE 50 10 0 4.636126e-26
#> 14 Petal.Width TRUE 0 40 50 4.636126e-26
#> 15
# You can also specify the table names, e.g.
write.list(z = results, file = "myfile2.txt", row.names = FALSE, quote = FALSE, t.name = 1:4)
read.csv("~/Desktop/myfile2.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 1 FALSE 50 24 6 8.373761e-18
#> 2 1 TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 2 FALSE 8 42 33 1.24116e-11
#> 6 2 TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 3 FALSE 50 7 0 9.471374e-28
#> 10 3 TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 4 FALSE 50 10 0 4.636126e-26
#> 14 4 TRUE 0 40 50 4.636126e-26
#> 15
Created on 2022-07-19 by the reprex package (v2.0.1)

For Loop List Not Storing Values

I have a data set of behaviours performed by individuals repeatedly at different temperatures, e.g.:
ID Test Behaviour Temperature
A12.4.2 ONE 8.64 4
A12.4.2 TWO 7.63 5
A6.3.3 ONE 1.81 3
A6.3.3 TWO 2.47 9
B12.4.1 ONE 1.17 12
B12.4.1 TWO 3.96 2
E9.4.2 ONE 13.04 13
E9.4.2 TWO 9.51 6
...
I use the following code to randomly subset that data set, and then run repeatability analysis on the subset, producing R values and CI values from the repeatability analysis at the end.
P<-10000
R_value<-numeric(length=P)
CI_value<-numeric(length=P)
for(i in 1:P){
newdata<-Data[Data$ID %in% sample(unique(Data$ID), 16), ]
m1<-rptR::rpt(((Behaviour))~Temperature+(1|ID),grname="ID",data=newdata,datatype="Gaussian",nboot=1000,npermut=1000)
R_value[i] <- m1$R
CI_value[i] <- m1$CI
}
Unfortunately this doesn't seem to be working. When I call R_value or CI_value, I am greeted with a string of 0's. Upon calling newdata or m1, R tells me that the object cannot be found.
Where I run the repeatability analysis outside of the for loop, everything turns out fine.
Can anyone help?
Your code is running. There was an error message in the sample which I have changed to
sample(unique(Data$ID), 4). And then it runs. You probably could also have added replace like so sample(unique(Data$ID), 16, replace = TRUE), this works, too. I have also reduced the numbers in rboot and in npermut.
library(rptR)
Data <- read.table(text = "
ID Test Behaviour Temperature
A12.4.2 ONE 8.64 4
A12.4.2 TWO 7.63 5
A6.3.3 ONE 1.81 3
A6.3.3 TWO 2.47 9
B12.4.1 ONE 1.17 12
B12.4.1 TWO 3.96 2
E9.4.2 ONE 13.04 13
E9.4.2 TWO 9.51 6
", header =T)
Data
#> ID Test Behaviour Temperature
#> 1 A12.4.2 ONE 8.64 4
#> 2 A12.4.2 TWO 7.63 5
#> 3 A6.3.3 ONE 1.81 3
#> 4 A6.3.3 TWO 2.47 9
#> 5 B12.4.1 ONE 1.17 12
#> 6 B12.4.1 TWO 3.96 2
#> 7 E9.4.2 ONE 13.04 13
#> 8 E9.4.2 TWO 9.51 6
P<-10
R_value<-numeric(length=P)
CI_value<-numeric(length=P)
for(i in 1:P){
newdata<-Data[Data$ID %in% sample(unique(Data$ID), 4), ]
m1<-rptR::rpt(((Behaviour))~Temperature+(1|ID), grname="ID", data=newdata, datatype="Gaussian", nboot=10, npermut=10)
R_value[i] <- m1$R
CI_value[i] <- m1$CI
}
R_value
#> [[1]]
#> [1] 0.8324396
#>
#> [[2]]
#> [1] 0.8324396
#>
#> [[3]]
#> [1] 0.8324396
#>
#> [[4]]
#> [1] 0.8324396
#>
#> [[5]]
#> [1] 0.8324396
#>
#> [[6]]
#> [1] 0.8324396
#>
#> [[7]]
#> [1] 0.8324396
#>
#> [[8]]
#> [1] 0.8324396
#>
#> [[9]]
#> [1] 0.8324396
#>
#> [[10]]
#> [1] 0.8324396
CI_value
#> [1] 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95

assign to is.na(clinical.trial$age)

I am looking at the code from here which has this at the beginning:
## generate data for medical example
clinical.trial <-
data.frame(patient = 1:100,
age = rnorm(100, mean = 60, sd = 6),
treatment = gl(2, 50,
labels = c("Treatment", "Control")),
center = sample(paste("Center", LETTERS[1:5]), 100, replace =
TRUE))
## set some ages to NA (missing)
is.na(clinical.trial$age) <- sample(1:100, 20)
I cannot understand this last line.
The LHS is a vector of all FALSE values. The RHS is a vector of 20 numbers selected from the vector 1:100.
I don't understand this kind of assignment. How is this result in clinical.trial$age getting some NA values? Does this kind of assignment have a name? At best I would say that the boolean vector on the RHS gets numbers assigned to it with recycling.
is.na(x) <- value is translated as 'is.na<-'(x, value).
You can think of 'is.na<-'(x, value) as 'assign NA to x, at position value'.
A perhaps better and intuitive phrasing could be assign_NA(to = x, pos = value).
Regarding other similar function, we can find those in the base package:
x <- as.character(lsf.str("package:base"))
x[grep('<-', x)]
#> [1] "$<-" "$<-.data.frame"
#> [3] "#<-" "[[<-"
#> [5] "[[<-.data.frame" "[[<-.factor"
#> [7] "[[<-.numeric_version" "[<-"
#> [9] "[<-.data.frame" "[<-.Date"
#> [11] "[<-.factor" "[<-.numeric_version"
#> [13] "[<-.POSIXct" "[<-.POSIXlt"
#> [15] "<-" "<<-"
#> [17] "attr<-" "attributes<-"
#> [19] "body<-" "class<-"
#> [21] "colnames<-" "comment<-"
#> [23] "diag<-" "dim<-"
#> [25] "dimnames<-" "dimnames<-.data.frame"
#> [27] "Encoding<-" "environment<-"
#> [29] "formals<-" "is.na<-"
#> [31] "is.na<-.default" "is.na<-.factor"
#> [33] "is.na<-.numeric_version" "length<-"
#> [35] "length<-.factor" "levels<-"
#> [37] "levels<-.factor" "mode<-"
#> [39] "mostattributes<-" "names<-"
#> [41] "names<-.POSIXlt" "oldClass<-"
#> [43] "parent.env<-" "regmatches<-"
#> [45] "row.names<-" "row.names<-.data.frame"
#> [47] "row.names<-.default" "rownames<-"
#> [49] "split<-" "split<-.data.frame"
#> [51] "split<-.default" "storage.mode<-"
#> [53] "substr<-" "substring<-"
#> [55] "units<-" "units<-.difftime"
All works the same in the sense that 'fun<-'(x, val) is equivalent to fun(x) <- val. But after that they all behave like any normal functions.
R manuals: 3.4.4 Subset assignment
The help tells us, that:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx #> 0 NA 2 NA 4
So,
is.na(xx) <- 1
behaves more like
set NA at position 1 on variable xx
#matt, to respond to your question asked above in the comments, here's an alternative way to do the same assignment that I think is easier to follow :-)
clinical.trial$age[sample(1:100, 20)] <- NA

Unable to pipe predict() output through filter() to ggplot()

I'm struggling to figure out why I can't use filter() on the results
of predict.gam() and then ggplot() the subset of predictions. I'm not
sure the prediction step is really part of the problem, but that's what
it takes to trigger the error. Just filter() %>% ggplot() with a
dataframe works fine.
library(dplyr)
library(ggplot2)
library(mgcv)
gam1 <- gam(Petal.Length~s(Petal.Width) + Species, data=iris)
nd <- expand.grid(Petal.Width = seq(0,5,0.05),
Species = levels(iris$Species),
stringsAsFactors = FALSE)
predicted <- predict(gam1,newdata=nd)
predicted <- cbind(predicted,nd)
filter(tbl_df(predicted), Species == "setosa") %>%
ggplot(aes(x=Petal.Width, y = predicted)) +
geom_point()
## Error: length(rows) == 1 is not TRUE
But:
filter(tbl_df(predicted), Species == "setosa")
## Source: local data frame [101 x 3]
##
## predicted Petal.Width Species
## (dbl[10]) (dbl) (chr)
## 1 1.294574 0.00 setosa
## 2 1.327482 0.05 setosa
## 3 1.360390 0.10 setosa
## 4 1.393365 0.15 setosa
## 5 1.426735 0.20 setosa
## 6 1.460927 0.25 setosa
## 7 1.496477 0.30 setosa
## 8 1.533949 0.35 setosa
## 9 1.573888 0.40 setosa
## 10 1.616810 0.45 setosa
## .. ... ... ...
And the problem is filter() because:
pick <- predicted$Species == "setosa"
ggplot(predicted[pick,],aes(x=Petal.Width, y = predicted)) +
geom_point()
I've also tried saving the result of filter to an object and using that directly in ggplot() but that has the same error.
Obviously not a crisis, because there's a workaround, but my mental
model of how to use filter() is obviously wrong! Any insights much
appreciated.
Edit: When I first posted this I was still using R 3.2.3 and was getting warnings from ggplot2 and dplyr. So I upgraded to 3.3.0 and it's still happening.
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mgcv_1.8-12 nlme_3.1-127 ggplot2_2.1.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 knitr_1.11 magrittr_1.5 munsell_0.4.2
## [5] colorspace_1.2-6 lattice_0.20-33 R6_2.1.1 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.3.0 parallel_3.3.0 grid_3.3.0
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10
## [17] yaml_2.1.13 assertthat_0.1 digest_0.6.8 Matrix_1.2-6
## [21] formatR_1.2 evaluate_0.7.2 rmarkdown_0.9.5 labeling_0.3
## [25] stringi_1.0-1 scales_0.3.0
The problem arises because your predict() call generates a named array, instead of just a numerical vector.
class(predicted$predicted)
# [1] "array"
The first filter() will give you the correct output on the surface, however if you inspect the output you will notice that the column predicted is still some sort of nested array.
str(filter(tbl_df(predicted), Species == "setosa"))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:303(1d)] 1.29 1.33 1.36 1.39 1.43 ...
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species: chr "setosa" "setosa" "setosa" "setosa" ...
In contrast, good old logical subsetting does the job on all dimensions:
str(predicted[pick,])
'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:101(1d)] 1.29 1.33 1.36 1.39 1.43 ... # Now 101 obs here too
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
So either you coerce the predicted column to numeric:
library(dplyr)
library(ggplot2)
predicted %>% mutate(predicted = as.numeric(predicted)) %>%
filter(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()
Or replace filter() by subset():
predicted %>%
subset(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()

how to write result of t.test into a file?

How can i write the result of t.test into a file?
> x
[1] 12.2 10.8 12.0 11.8 11.9 12.4 11.3 12.2 12.0 12.3
> t.test(x)
One Sample t-test
data: x
t = 76.2395, df = 9, p-value = 5.814e-14
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
11.5372 12.2428
sample estimates:
mean of x
11.89
> write(t.test(x),file="test")
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'list') cannot be handled by 'cat'
> sink("out.txt")
> x <- scan()
1: 12.2 10.8 12.0 11.8 11.9 12.4 11.3 12.2 12.0 12.3
11:
Read 10 items
> t.test(x)
> sink()
> readLines("out.txt")
[1] ""
[2] "\tOne Sample t-test"
[3] ""
[4] "data: x "
[5] "t = 76.2395, df = 9, p-value = 5.814e-14"
[6] "alternative hypothesis: true mean is not equal to 0 "
[7] "95 percent confidence interval:"
[8] " 11.5372 12.2428 "
[9] "sample estimates:"
[10] "mean of x "
[11] " 11.89 "
[12] ""
The broom package was released after this post. It makes dealing with model outputs much, much nicer. In particular, the function tidy() will convert the model output to a dataframe for further handling.
x <- c(12.2, 10.8, 12.0, 11.8, 11.9, 12.4, 11.3, 12.2, 12.0, 12.3)
t_test <- t.test(x)
library(broom)
tidy_t_test <- broom::tidy(t_test)
tidy_t_test
#> # A tibble: 1 x 8
#> estimate statistic p.value parameter conf.low conf.high method
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11.9 76.2 5.81e-14 9 11.5 12.2 One S…
#> # … with 1 more variable: alternative <chr>
write.csv(tidy_t_test, "out.csv")

Resources