Display both levels of binary outcome in getDescriptionStatsBy - r

ex1 = sample(50, x=c("A","B"), replace=TRUE)
ex2 = sample(50, x=c("A","B"), replace=TRUE)
getDescriptionStatsBy(factor(ex1),ex2,html=TRUE,useNA="no",statistics=TRUE,add_total_col="last”)
with useNA=“no” or useNA=“ifany", I get
A B Total P-value
A "13 (59.1%)" "16 (57.1%)" "29 (58.0%)" "1.0”
but with useNA=“always”, I get
A B Total P-value
A "13 (59.1%)" "16 (57.1%)" "29 (58.0%)" "1.0"
B "9 (40.9%)" "12 (42.9%)" "21 (42.0%)" ""
Missing "0 (0.0%)" "0 (0.0%)" "0 (0.0%)" “"
Is there a way to force the display of both levels of the binary outcome (A and B) with useNA=“ifany”? Although it is obvious to me that if there is no missing data, one must only show the row of “A” (and infer that B = 1-A), some of my colleagues seem to prefer that “A” and “B” are displayed always.

I was able to answer my own question by using a wrapper function that removes the "Missing" row with useNA="always"
k2 = getDescriptionStatsBy(factor(ex1),ex2,html=TRUE,useNA="always",statistics=TRUE,add_total_col=FALSE)
r = table(ex2)
n0 = apply(k2,1,function(x) sum(x=="0 (0%)" | x=="0 (0.0%)"))
rmv = which(rownames(k2)=="Missing" & n0==length(r))
k2[-as.numeric(rmv),]
Note in the above, I set add_total_col=FALSE and consequently looked for n0==length(r); if add_total_col="last", I would look for n0==(length(r)+1)

Related

What is wrong with this ifelse-command? (rows get excluded that don´t match the if-statement)

I wanted to exclude rows with participants who show error rates above 15%
When I look at the error rate of participant 2, it is for example 2,97%
semdata[2,"error_rate"]
[1] "2,97"
But if I run this ifelse-statement, many participants get excluded that don´t display error rates (but others not, which is correct).
15% (e.g., this participant 2).
for(i in 1:NROW(semdata)){
#single trial blocks
ifelse((semdata[i,"error_rate"] >= 15),print(paste(i, "exclusion: error rate ST too high",semdata[i,"dt_tswp.err.prop_st"])),0)
ifelse((semdata[i,"error_rate"] >= 15),semdata[i,6:NCOL(semdata)]<-NA,0)
#dual-task blocks
# ifelse((semdata[i,"error_rate"] >= 15),print(paste(i, "exclusion: error rate DT too high")),0)
# ifelse((semdata[i,"error_rate"] >= 15),semdata[i,6:NCOL(semdata)]<-NA,0)
}
[1] "1 exclusion: error rate ST too high 6,72"
[1] "2 exclusion: error rate ST too high 2,97"
[1] "7 exclusion: error rate ST too high 2,87"
[1] "9 exclusion: error rate ST too high 5,28"
...
What am I doing wrong here?
You are comparing strings here.
"6,72" > 15
#[1] TRUE
You should convert the data to numeric first before comparing which can be done by using sub
as.numeric(sub(",", ".", "6,72"))
#[1] 6.72
This can be compared with 15.
as.numeric(sub(",", ".", "6,72")) > 15
#[1] FALSE
For the entire column you can do -
semdata$error_rate <- as.numeric(sub(",", ".", semdata$error_rate))

Writing a function to clean string data and rename columns

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs
I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

r - Printing plots from list returns quosure error

I have generated a series of plots using ggplot and lapply, like so:
makeplot <- function(data){
require(ggplot2)
require(dplyr)
ggplot(data,aes(x=num,y=cat3, fill=cat3)) +
facet_wrap(~cat2)
# etc...
}
plot_list <- lapply(split(data, interaction(data[,c("province","cat1")]), drop = TRUE), makeplot)
I am using a large data frame that could be simplified to:
data <- data.frame(
province = sample(c("1","2","3","4")),
cat1 = sample(c("health","education","banks","etc")),
cat2 = sample(c("winter","spring","summer","fall")),
cat3 = sample(c("1 hour","2 hours","4 hours","8 hours")),
Y = sample(1:100))
This generates a list of plots like so:
Plot list
I am attempting to print or ggsave out this list, as per here: Saving plots within lapply.
However, all attempts to export/print the resulting plots, whether with an lapply loop or a simple print statement return the following error.
lapply(plot_list,print)
Error: `quo` must be a quosure
Call `rlang::last_error()` to see a backtrace
I'm afraid the R documentation on quosures didn't yield any helpful insights. I'm not a developer and don't really understand most of the documentation. Can anyone help me out?
I originally posted this without much of the complex lapply filtering happening earlier, as it seemed a distracting irrelevance. I'm providing that now in case it's helpful. For reference, the actual data frame head looks like:
~season, ~fac_type, ~trav_cat, ~avg_pc_pop, ~loop,
"Monsoon season", "All financial institutions", "0 to 30 minutes", 0.41395948733655, "Monsoon season All financial institutions",
"Normal season", "All health facilities", "0 to 30 minutes", 0.426855030030894, "Monsoon season All health facilities",
"Other season", "All hospitals", "1 to 2 hours", 0.301967752836744, "Monsoon season All hospitals",
"Monsoon season", "Commercial and development banks", "4 to 8 hours", 0.385783483483483, "Monsoon season Commercial and development banks",
"Normal season", "District Headquarters", "16 to 32 hours", 0.270673828371869, "Monsoon season District Headquarters",
"Other season", "Government hospitals", "1 to 2 hours", 0.263825993199371, "Monsoon season Government hospitals"

Replacing a substring

Hoping to get some guidance as only an occasional analyst and couldn't really understand how to manage an expression with a preceding numeric value.
My data below, I am hoping to convert the "4D" and "5D" type of data into "4 Door" and "5 Door".
a <- c("4D Sedan", "5D Wagon")
b <- c("4 Door Sedan", "5 Door Wagon")
dt <- cbind(a,b)
thanks.
We can use gsub() here, searching for the pattern:
\\b(\\d+)D\\b
and replacing it with:
\\1 Door
Code:
a <- c("4D Sedan", "5D Wagon", "AB4D car 5D")
> gsub("\\b(\\d+)D\\b", "\\1 Door", a)
[1] "4 Door Sedan" "5 Door Wagon" "AB4D car 5 Door"
Note in the above example that the 4D in AB4D car 5D does not get replaced, nor would we want this to happen. By using word boundaries in \\b(\\d+)D\\b we can avoid unwanted replacements from happening.

r - data.table 1.10.0 - why does a named column index value not work while a integer column index value works without with = FALSE

I am using data.table 1.10.0.
# install.packages("install.load") # install in order to use the load_package function
install.load::load_package("data.table", "gsubfn", "fpCompare")
# function to convert from fractions and numeric numbers to numeric (decimal)
# Source 1 begins
to_numeric <- function(n) {
p <- c(if (length(n) == 2) 0, as.numeric(n), 0:1)
p[1] + p[2] / p[3]
}
# Source 1 ends
Source 1 is Convert a character vector of mixed numbers, fractions, and integers to numeric
max_size_aggr <- 3 / 4
water_nonair <- structure(list(`Slump (in.)` = c("1 to 2", "3 to 4", "6 to 7",
"Approximate amount of entrapped air in nonair- entrained concrete (%)"), `3/8 in.` =
c(350, 385, 410, 3), `1/2 in.` = c(335, 365, 385, 2.5), `3/4 in.` = c(315, 340, 360, 2),
`1 in.` = c(300, 325, 340, 1.5), `1 1/2 in.` = c(275, 300, 315, 1), `2 in.` =
c(260, 285, 300, 0.5), `3 in.` = c(220, 245, 270, 0.3), `6 in.` = c(190, 210, NA, 0.2)),
.Names = c("Slump (in.)", "3/8 in.", "1/2 in.",
"3/4 in.", "1 in.", "1 1/2 in.", "2 in.", "3 in.", "6 in."), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
setnames(water_nonair, c("Slump (in.)", "3/8 in.", "1/2 in.", "3/4 in.", "1 in.",
"1 1/2 in.", "2 in.", "3 in.", "6 in."))
water_nonair_col_numeric <- gsub(" in.", "", colnames(water_nonair)[2:ncol(water_nonair)])
water_nonair_col_numeric <- sapply(strapplyc(water_nonair_col_numeric, "\\d+"), to_numeric)
# Source 1
New way (data.table 1.10.0)
water_nonair_column <- which(water_nonair_col_numeric %==% max_size_aggr)+1L
# [1] 4
water_nonair[2, water_nonair_column][[1]]
# [1] 4
Why does the following work when I call out the column index, but the above, also, with a value of 4 does not work?
water_nonair[2, 4][[1]]
# [1] 340
Old way (data.table 1.9.6)
water_nonair[2, which(water_nonair_col_numeric %==% max_size_aggr)+1L, with = FALSE][[1]]
# [1] 340
I removed the with = FALSE from the function after reading the data.table news after the release of version 1.9.8.
The long note 3 in v1.9.8 NEWS starts :
When j contains no unquoted variable names (whether column names or not), with= is now automatically set to FALSE. Thus ...
But your j does contain an unquoted variable name. In fact, it is solely an unquoted variable name. So that item does not apply to it.
That's what the options(datatable.WhenJisSymbolThenCallingScope=TRUE) was about so you could try out the new feature going forward. Please read that same NEWS item about that again. If you set that option, it will work as you expected it to.
HOWEVER please don't. Because yesterday I changed it and in development that option has now gone. A migration timeline is no longer needed. The new strategy needs no code changes and has no breakage. Please see the new notes in the latest development NEWS for v1.10.1. I won't copy them here to save duplication.
So going forward, when j is a symbol (i.e. an unquoted variable name) you either still need with=FALSE :
water_nonair[2, water_nonair_column, with=FALSE]
or you can use the new .. prefix from v1.10.1 added yesterday :
water_nonair[2, ..water_nonair_column]
Otherwise, if j is a symbol it must be a column name for safety, consistency and backwards compatibility. If not, you'll now get the new more helpful error message :
DT = data.table(a=1:3, b=4:6)
myCols = "b"
DT[,myCols]
Error in `[.data.table`(DT, , myCols) :
j (the 2nd argument inside [...]) is a single symbol but column name
'myCols' is not found. Perhaps you intended DT[,..myCols] or
DT[,myCols,with=FALSE]. This difference to data.frame is deliberate
and explained in FAQ 1.1.
As mentioned in NEWS, I reran all 313 CRAN and Bioconductor packages that use data.table against data.table v1.10.1 and 2 of them do break with this change. But that is what we want because they do have a bug (the value of j in calling scope is being returned literally which cannot be what was intended). I've informed their maintainers. This is exactly what we wanted to reveal and improve. The other 311 packages all pass with this change. It doesn't rely on test coverage (which is weak for many packages). The new error happens when j is a symbol that isn't a column, whether there's a test for the result or not.

Resources