Suppressing column names within a matrix and looping function - r

I am attempting to create my first custom function in R (yay!). I've got something that sort of works now but I think it could be improved.
Basically, I want to create my own custom table within R that can be run through xtable for a final report. I want the table to follow this format for each column:
group1mean, group1sd, group2mean, group2sd, t-value, p-value.
At current, my function does this. However, it produces column names (e.g., V3 and V4) that I would like to leave blank and I would like to have it loop through multiple dependent variables and append the results as new rows in the matrix automatically. Right now, I have to write a line of code for each dependent variable manually (in the example below the DVs are PWB, SWB, and EWB.
Here is my code so far:
data <- read.delim("~/c4044sol.txt", header=T)
library(psych)
proc.ttest <- function(dv,group,decimals) {
x1 <- describeBy((dv), (group), mat=TRUE)
stat1 <- t.test((dv) ~ (group))
output1 <- c(paste (round(x1$mean[1], digits=(decimals)),"(", round(x1$sd[1], digits= (decimals)), ")", sep =" "),
paste (round(x1$mean[2], digits=(decimals)), "(", round(x1$sd[2], digits=(decimals)), ")", sep =" "),
round(stat1$statistic, digits=2), round(stat1$p.value, digits=3))
return(output1)
}
toprow <- c("M (SD)", "M (SD)", "t", "p")
outtable <- rbind(toprow,
proc.ttest(data$PWB, data$college, 2),
proc.ttest(data$SWB, data$college, 2),
proc.ttest(data$EWB, data$college, 2))
colnames(outtable) <- c("College graduate", "Less than college graduate", "", "")
row.names(outtable) <- c("", "PWB", "SWB", "EWB")
library(xtable)
xtable(outtable)
So to repeat, I would like to suppress the column names "V3" and "V4" (leave them blank) and make the code run automatically on a list of variables. Are either of these things possible? Thanks for your time.

Try keeping outtable as you have it, but without toprow.
Instead, use toprow as the names:
toprow <- c("M (SD)", "M (SD)", "t", "p")
outtable <- rbind( # toprow,
proc.ttest(data$PWB, data$college, 2),
proc.ttest(data$SWB, data$college, 2),
proc.ttest(data$EWB, data$college, 2))
names(outtable) <- toprow
## note that the parens and spaces are
## not best practices, but this should still
## get your your desired results

I fixed the extra column labels printing issue by putting all the labels I actually wanted in the final table in the first two rows of the matrix...
toptoprow <- c("College graduate", "Less than college graduate", "", "")
toprow <- c("M (SD)", "M (SD)", "t", "p")
outtable <- rbind(toptoprow,toprow, proc.ttest(PWB, college, 2),
proc.ttest(SWB, college, 2),
proc.ttest(EWB, college, 2))
And then suppressing the colnames using the print function (as suggested by Ricardo)...
print(xtable(outtable), hline.after=c(-1,1,nrow(outtable)),include.colnames=FALSE)
I still would like to automate the function itself so I can ideally give it a list of variable names, it will run the function on each variable, and populate the results in the final matrix. But one baby step at at time...

Related

String conversion to array: Opening hours (over a week)

I've done an OSM-extraction and here you can see the column "osm_openin" for the opening hours for each object in R.
It has the following structure:
I would love to have new columns for each day of the week, with a symbol "X" - if it is not open all day - or the according opening hours for the day "07:00 - 21:00".
My solution:
Firstly, I am thinking of using representative values for the week days "Mo = 1", "Tu = 2"..."Su = 7". It is important, if the day/value itself is not explicitly mentioned, but is exisiting in an intervall.
For each value, I am searching its existence in the column.
If it finds the value, I'll take the opening hours following directly after (don't know which R command to use for that)
If not, then the value has to be in an intervall. For example "2" (Tuesday) is not existing, then the script needs to realize Tuesday is between Mo-Sa. (don't know which method to use for that).
Public Holiday is not important.
Any suggestion for a solution?
Thanks.
I don't know the best way, but may be I can help you.
Firstly we need to create array of weekdays:
wdays <- c("Mo", "Tu", "We", "Th", "Fr", "Sa", "Su")
Now let's write code for converting text from "Mo,We-Fr" to vector c(1, 3, 4, 5). Algorithm:
Delete information about holidays ("PH", "SH");
Replace name of weekday with number ("Mo" --> 1, "Tu" --> 2, etc.);
Replace - with :. For example, 3-5 will be 3:5 and it is R-style code;
Add c( to the beginning and ) to the end. For example, 1,3:5 will be c(1, 3:5);
c(1, 3:5) is R-style vector and we can create vector by text (eval(parse(text = "c(1, 3:5)"))).
Full code:
GetWDays <- function(x, wdays) {
holi <- c("PH", "SH")
x <- gsub(paste0("(,|^)", holi, collapse = "|"), "", x) #delete holidays
for (i in 1:7) {
x <- gsub(wdays[i], i, x)
}
x <- gsub("-", ":", x)
x <- paste0("c(", x, ")")
wday_idx <- eval(parse(text = x))
return(wday_idx)
}
Let's create function that has opening hours (like "Mo-Fr 6:30-19:00;Sa 09:00-17:00;Su,PH 09:00-15:00") as input and returns data.frame with 7 columns (for each weekday). Algorithm:
Split text by ;; Now we will work with one part of text (for example, "Mo-Fr 6:30-19:00");
Split text by (space); "Mo-Fr 6:30-19:00" --> "Mo-Fr" and "6:30-19:00"
First part ("Mo-Fr") we put into GetWDays and we make vector from second part (it's size will be like as first part size). Example: "Mo-Fr" --> c(1,2,3,4,5), "6:30-19:00" --> rep("6:30-19:00", 5);
Make data.frame from 2 vectors (Day and Time);
Use bind_rows for each part from first step. Now we have big data.frame, but some weekdays may be missing, and some weekdays may have "Off" in column Time;
So add rows for missing weekdays (by merge) and replace "Off" and NA with "X" (as you want);
Transpose data.frame and return
Full code:
GetTimetable <- function(x) {
wdays <- c("Mo", "Tu", "We", "Th", "Fr", "Sa", "Su")
tmp <- strsplit(strsplit(x, ";")[[1]], " ")
tmp <- lapply(tmp, function(x) {Day <- GetWDays(x[1], wdays); data.frame(Day, Time = rep(x[2], length(Day)))})
tmp <- bind_rows(tmp) %>% arrange(Day) %>% as.data.frame()
tmp <- merge(data.frame(Day = 1:7), tmp, all.x = T, by = "Day")
tmp$Time[is.na(tmp$Time) | tmp$Time == "Off"] = "X"
tmp <- tmp %>% t() %>% "["(2, ) %>% as.list() %>% setNames(wdays) %>% bind_cols()
return(tmp)
}
If you want to apply GetTimetable for each row you can use this code:
df_time <- df$osm_openning %>% lapply(GetTimetable) %>% bind_rows()
And if you want to add this data.frame to your data you can do something like this:
df <- bind_cols(df, df_time)

How to use regex over entire dataframe in R

new user to R so please go easy on me.
I have dataframe like:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?
I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!
[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"
Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?
Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:
df[] <- lapply(df, function (x) {gsub("Z", "", x)})
For a stringr solution (that also uses dplyr), try:
library(tidyverse)
df <- mutate_all(df,
funs(str_replace_all(., "Z", "")))
P.S. I recommend using df <- instead of df = in the future. Good luck!
EDIT: corrected typo - thanks #thelatemail
You may use a simple ^Z regex in the following way:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z', replacement ="")
> df
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).
You are close. If you want to go with base gsub
data$Mineral = gsub("Z", "", data$Mineral)
You can do this for all columns. Or use a combination of apply strategies (see other answers!)
PS. Naming your data data is not a good idea. At least do my_data
You could do:
as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))
You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:
> as.data.frame(apply(df, 2,
function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
function(col) stringi::stri_replace_first_regex(col, '^Z', '')))
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame?
)
As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...
the entire df: stri_replace_first_fixed(df, '^Z', '')
by rows: stri_replace_first_fixed(df[1,], '^Z', '')
by columns: stri_replace_first_fixed(df[,1], '^Z', '')
Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

Altering padding when combining tableGrob and ggplot objects in R

I am trying to combine multiple tableGrob objects with a ggplot object into a single .png file; I am having trouble understanding how to edit the tableGrob theme parameters in a way that lets me adjust the padding and dimensions of the table objects. Ideally, I want them to be in a 4*1 grid, with minimal padding between each. The text of the table objects should also be left-justified.
I am working with dummy data, and each row of my input dataset will be used to create its own .png file (two rows are included in the code snippet below to generate a reproducible example).
I tried to use this post as an example, and set the grid.arrange spacing based on the "heights" attributes of each table object, but this hasn't quite done the trick. As a side note, right now, the plot gets overwritten each time; I'll fix this later, am just concerned with getting the output dimensions/arrangement correct. Code is below; edited to include library calls, and fixed a typo:
require("ggplot2")
require("gridExtra")
require("grid")
# Generate dummy data frame
sampleVector <- c("1", "Amazing", "Awesome", "0.99", "0.75", "0.5", "$5,000.00", "0.55", "0.75", "0.31", "0.89", "0.25", "Strong community support", "Strong leadership", "Partners had experience", "", "CBO not supportive", "Limited experience", "Limited monitoring", "")
sampleVectorB <- c("3", "Amazing", "Awesome", "0.99", "0.75", "0.5", "$5,000.00", "0.55", "0.75", "0.31", "0.89", "0.25", "Strong community support", "Strong leadership", "Partners had experience", "", "CBO not supportive", "Limited experience", "Limited monitoring", "")
sampleDF <- data.frame(rbind(sampleVector, sampleVectorB))
colnames(sampleDF) <- c("CBO", "PMQ", "HMQ", "ER", "PR", "HR", "NS", "CTI", "Home and Hosp", "Home", "Phone", "Other", "S1", "S2", "S3", "S4", "C1", "C2", "C3", "C4")
indata <- sampleDF
#Finds the longest string from a vector of strings (i.e. #chars, incld. whitespace); returns index into this vector that corresponds to this string
findMax <- function(tempVector){
tempMaxIndex <- 1
for(i in 1:length(tempVector)){
print(nchar(tempVector[i]))
if(nchar(tempVector[i]) > nchar(tempVector[tempMaxIndex])){
tempMaxIndex <- i
}
}
return(tempMaxIndex)
}
# Same as above but w/o the colon:
addWhitespacePlain <- function(stringVec, maxNum){
for(i in 1:length(stringVec))
{
string <- stringVec[i]
while(nchar(string) < maxNum+1){
string <- paste(string, " ")
}
stringVec[i] <- string
}
return(stringVec)
}
staticText <- c("Participant Match Quality", "Hospital-Level Match Quality", "Enrollment Rate", "Participant Readmissions", "Hospital Readmissions", "Net Savings",
"Strengths", "Challenges")
m <- findMax(staticText)
staticText <- addWhitespacePlain(staticText, nchar(staticText[m]))
# Loop through our input data and keep only one CBO each pass
for(i in 1:length(indata$CBO)){
# Select only the row that has this CBO's data
temp <- indata[i,]
###############################################################################################
# MAKE TOP TEXT TABLE (as a DF)
# Get values from our input data set to fill in the values for the top text portion of the graphic
topVals <- t(data.frame(temp[2], temp[3], temp[4], temp[5], temp[6], temp[7]))
topDF <- data.frame(staticText[1:6], topVals, row.names=NULL)
colnames(topDF) <- c(paste("CBO", temp[1]), " ")
# Find which of the strings from the top text portion is the longest (i.e. max # of chars; including whitespace)
m2 <- findMax(rownames(topDF)) # returns an index into the vector; this index corresponds to string w/max num of chars
# Add whitespace to non-max strings so all have the same length and also include colons
rownames(topDF) <- addWhitespacePlain(rownames(topDF), nchar(rownames(topDF)[m2]))
# for testing
# print(topDF, right=FALSE)
###############################################################################################
# MAKE BAR CHART
#Subset the data to select the vars we need for the horizontal bar plot
graphdata <- t(data.frame(temp[,8:12]))
vars <- c("CTI", "Home & Hosp.", "Home", "Phone", "Other")
graphDF <- data.frame(vars, graphdata, row.names = NULL)
colnames(graphDF) <- c("vars", "values")
# Make the plot (ggplot object)
barGraph <- ggplot(graphDF, aes(x=vars, y=values,fill=factor(vars))) +
geom_bar(stat = "identity") +
theme(axis.title.y=element_blank())+
theme(legend.position="none")+
coord_flip()
# print(barGraph)
###############################################################################################
# MAKE BOTTOM TEXT TABLE
strengths <- t(data.frame(temp[13], temp[14], temp[15], temp[16]))
challenges <- t(data.frame(temp[17], temp[18], temp[19], temp[20]))
#Drop nulls
strengths <- data.frame(strengths[which(!is.na(strengths)),], row.names=NULL)
challenges <- data.frame(challenges[which(!is.na(challenges)),], row.names=NULL)
colnames(strengths) <- c(staticText[7])
colnames(challenges) <- c(staticText[8])
###############################################################################################
# OUTPUT (padding not resolved yet)
# Set the path for the combined image
png("test1", height=1500, width=1000)
#customTheme <- ttheme_minimal(core=list(fg_params=list(hjust=0, x=0.1)),
# rowhead=list(fg_params=list(hjust=0, x=0)))
# top<-tableGrob(topDF, theme=customTheme)
# bottom_strength <- tableGrob(strengths, theme=customTheme)
# bottom_challenges <- tableGrob(challenges, theme=customTheme)
top<-tableGrob(topDF)
bottom_strength <- tableGrob(strengths)
bottom_challenges <- tableGrob(challenges)
x <- sum(top$heights)
y <- sum(bottom_strength$heights)
z <- sum(bottom_challenges$heights)
grid.arrange(top, barGraph, bottom_strength, bottom_challenges,
as.table=TRUE,
heights=c(2, 1, 2, 2),
nrow = 4)
# heights= unit.c(x, unit(1), y, z))
dev.off()
}
Your example is too complicated, I'm not sure what exactly is the issue. The following 4x1 layout has zero padding, is that what you're after?
ta <- tableGrob(iris[1:4,1:2])
tb <- tableGrob(mtcars[1:3,1:3])
tc <- tableGrob(midwest[1:5,1:2])
p <- qplot(1,1) + theme(plot.background=element_rect(colour = "black"))
h <- unit.c(sum(ta$heights), unit(1,"null"), sum(tb$heights), sum(tc$heights))
grid.newpage()
grid.arrange(ta,p,tb,tc, heights=h)

Using paste to create logical expression for data frame subset

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!
You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.
I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Column names of a data.frame separated with comma

I want to get column names of a data.frame separated with comma (,). I remembered I got this result in past but now forgot the command.
df<- data.frame(x=1:10, y=11:20)
names(df)
Output
"x" "y"
Desired Output
c("x", "y")
The easiest way to get exactly what it sounds like you're asking for (without knowing exactly how you plan to use this information) is to use dput:
dput(names(df))
# c("x", "y")
By extension, without fussing with paste:
x <- capture.output(dput(names(df)))
x
# [1] "c(\"x\", \"y\")"
cat(x)
# c("x", "y")
Although #Jilber deleted his answer, you can use shQuote to go from what he had started with to the output of "x" above:
paste("c(", paste(shQuote(names(df)), collapse = ", "), ")", sep = "")
# [1] "c(\"x\", \"y\")"

Resources