how do you subset a data frame based on a variable name - r

my data frame called d:
dput(d)
structure(list(Hostname = structure(c(8L, 8L, 9L, 5L, 6L, 7L,
1L, 2L, 3L, 4L), .Label = c("db01", "db02", "farm01", "farm02",
"tom01", "tom02", "tom03", "web01", "web03"), class = "factor"),
Date = structure(c(6L, 10L, 5L, 3L, 2L, 1L, 8L, 9L, 7L, 4L
), .Label = c("10/5/2015 1:15", "10/5/2015 1:30", "10/5/2015 2:15",
"10/5/2015 4:30", "10/5/2015 8:30", "10/5/2015 8:45", "10/6/2015 8:15",
"10/6/2015 8:30", "9/11/2015 5:00", "9/11/2015 6:00"), class = "factor"),
Cpubusy = c(31L, 20L, 30L, 20L, 18L, 20L, 41L, 21L, 29L,
24L), UsedPercentMemory = c(99L, 98L, 95L, 99L, 99L, 99L,
99L, 98L, 63L, 99L)), .Names = c("Hostname", "Date", "Cpubusy",
"UsedPercentMemory"), class = "data.frame", row.names = c(NA,
-10L))
In a loop I need to go through this data frame based on metrics variable, I need to createa subset data frame for summarization:
metrics<-as.vector(unique(colnames(d[,c(3:4)])))
for (m in metrics){
sub<-dd[,c(1,m)]
}
I cannot use m in this subset line, any ideas how I could subset data frame based on a variable name?

In your subsetting call you are mixing column indexes and column names so R does not understand what you are trying to do.
Either use column names:
for (m in metrics) {
sub <- d[, c(colnames(d)[1], m)]
}
Or indexes:
for (i in 3:4) {
sub <- d[, c(1, i)]
}
Having said that, for loops in R are usually for cases where dynamic assignments are needed or for calling functions with side effects or some other relatively unusual case. Creating a summary by slicing and dicing data in for loops is almost never the proper way to do it in R. If the usual functional tools are not enough there are fantastic packages like plyr, dplyr, etc that let you split-apply-combine your data in very convenient and idiomatic ways.

Related

What is the best way to use agricolae to do ANOVAs on a split plot design?

I'm trying to run some ANOVAs on data from a split plot experiment, ideally using the agricolae package. It's been a while since I've taken a stats class and I wanted to be sure I'm analyzing this data correctly, so I did some searching online and couldn't really find consistency in the way people were analyzing their split plot experiments. What is the best way for me to do this?
Here's the head of my data:
dput(head(rawData))
structure(list(ï..Plot = 2111:2116, Variety = structure(c(5L,
4L, 3L, 6L, 1L, 2L), .Label = c("Burbank", "Hodag", "Lamoka",
"Norkotah", "Silverton", "Snowden"), class = "factor"), Rate = c(4L,
4L, 4L, 4L, 4L, 4L), Rep = c(1L, 1L, 1L, 1L, 1L, 1L), totalTubers = c(594L,
605L, 656L, 729L, 694L, 548L), totalOzNoCulls = c(2544.18, 2382.07,
2140.69, 2401.56, 2440.56, 2503.5), totalCWTacNoCulls = c(461.76867,
432.345705, 388.535235, 435.88314, 442.96164, 454.38525), avgLWratio = c(1.260615419,
1.287949374, 1.111981583, 1.08647584, 1.350686661, 1.107173509
), Hollow = c(14L, 15L, 22L, 25L, 14L, 13L), Double = c(10L,
13L, 15L, 22L, 11L, 9L), Knob = c(86L, 80L, 139L, 156L, 77L,
126L), Researcher = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Wang", class = "factor"),
CullsPounds = c(1.75, 1.15, 4.7, 1.85, 0.8, 5.55), CullsOz = c(28,
18.4, 75.2, 29.6, 12.8, 88.8), totalOz = c(2572.18, 2400.47,
2215.89, 2431.16, 2453.36, 2592.3), totalCWTacCulls = c(466.85067,
435.685305, 402.184035, 441.25554, 445.28484, 470.50245)), row.names = c(NA,
6L), class = "data.frame")
For these data, the whole plot is Rate, the split plot is Variety, the block is Rep, and for discussion's sake here, we can look at totalCWTacNoCulls as the response.
Any help would be very much appreciated! I am still getting the hang of Stack Overflow, so if I have made any mistakes or shared my data wrong, please let me know and I'll change it. Thank you!
You can do this using agricolae package as follows
library(agricolae)
attach(rawData)
Rate = factor(Rate)
Variety = factor(Variety)
Rep = factor(Rep)
sp.plot(Rep, Rate, Variety, totalCWTacNoCulls)
Usage according to agricolae package is
sp.plot(block, pplot, splot, Y)
where, block is replications, pplot is main-plot Factor, splot is sub-plot Factor and Y response variable

how do you print table in knitr

I am trying to us knitr to print data frame in table format using xtable:
```{r xtable,fig.width=10, fig.height=8, message=FALSE, results = 'asis', echo=FALSE, warning=FALSE, fig.cap='long caption', fig.scap='short',tidy=FALSE}
print(xtable(d),format="markdown")
```
This is the data frame d:
d <- structure(list(Hostname = structure(c(8L, 8L, 9L, 5L, 6L, 7L,
1L, 2L, 3L, 4L), .Label = c("db01", "db02", "farm01", "farm02",
"tom01", "tom02", "tom03", "web01", "web03"), class = "factor"),
Date = structure(c(6L, 10L, 5L, 3L, 2L, 1L, 8L, 9L, 7L, 4L
), .Label = c("10/5/2015 1:15", "10/5/2015 1:30", "10/5/2015 2:15",
"10/5/2015 4:30", "10/5/2015 8:30", "10/5/2015 8:45", "10/6/2015 8:15",
"10/6/2015 8:30", "9/11/2015 5:00", "9/11/2015 6:00"), class = "factor"),
Cpubusy = c(31L, 20L, 30L, 20L, 18L, 20L, 41L, 21L, 29L,
24L), UsedPercentMemory = c(99L, 98L, 95L, 99L, 99L, 99L,
99L, 98L, 63L, 99L)), .Names = c("Hostname", "Date", "Cpubusy",
"UsedPercentMemory"), class = "data.frame", row.names = c(NA,
-10L))
Any ideas what I am missing here?
Try kable from knitr. It will format the table nicely.
If you would like to use xtable try:
print(xtable(d), type="latex", comment=FALSE)
While Pierre’s solution works, this should ideally happen automatically. Luckily, you can use knitr hooks to make this work.
That is, given this code:
```{r}
d
```
We want knitr to automatically produce a nicely formatted table, without having to invoke a formatting function manually.
Here’s some code I’m using for that. You need to put this at the beginning of your knitr document, or in the code that’s compiling your document:
opts_chunk$set(render = function (object, ...) {
if (pander_supported(object))
pander(object, style = 'rmarkdown')
else if (isS4(object))
show(object)
else
print(object)
})
This uses pander and additionally requires a helper function, pander_supported:
library(pander)
pander_supported = function (object)
UseMethod('pander_supported')
pander_supported.default = function (object)
any(class(object) %in% sub('^pander\\.', '', methods('pander')))
pander.table = function (x, ...)
pander(`rownames<-`(rbind(x), NULL), ...)
For nicer formatting, I also use these defaults:
panderOptions('table.split.table', Inf)
panderOptions('table.alignment.default',
function (df) ifelse(sapply(df, is.numeric), 'right', 'left'))
panderOptions('table.alignment.rownames', 'left')
If you are rendering your knitr/rmarkdown report to HTML, you can use the function rmarkdown::paged_table().
For example:
---
title: "My report"
output: html_document
---
```{r}
library(rmarkdown)
f <- function() {
paged_table(mtcars)
}
f()
```
This .Rmd is knit into the following HTML:
Also, consider using the gt package via gt().

How to add multiple data series to a scatterplot and how to format numbers to appear in standard form on y axis

My data set:
structure(list(Site = c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L), Average.worm.weight..g. = c(0.1934,
0.249, 0.263, 0.262, 0.4186, 0.204, 0.311, 0.481, 0.326, 0.657,
0.347, 0.311, 0.239, 0.4156, 0.31, 0.3136, 0.4033, 0.302, 0.277
), Average.total.immune.cell.count = structure(c(8L, 16L, 11L,
12L, 10L, 1L, 4L, 15L, 4L, 3L, 17L, 13L, 18L, 7L, 5L, 6L, 9L,
14L, 2L), .Label = c("0", "168750", "18650000", "200,000", "21,600,000",
"226666.6", "22683333.33", "2533333.33", "283333.333", "291666.6",
"335833.3", "435800", "474816666.7", "500000", "6450000", "729166.667",
"7433333.3", "9916667"), class = "factor"), Average.eleocyte.number = structure(c(2L,
5L, 14L, 10L, 1L, 1L, 6L, 1L, 6L, 7L, 1L, 9L, 15L, 8L, 12L, 3L,
11L, 13L, 4L), .Label = c("0", "1266666.67", "153333.3", "168740",
"17", "200,000", "2266666.667", "22683333.33", "23116666.67",
"264000", "283333.333", "442", "500000", "7.3", "9916667"), class = "factor")), .Names = c("Site",
"Average.worm.weight..g.", "Average.total.immune.cell.count",
"Average.eleocyte.number"), class = "data.frame", row.names = c(NA,
-19L))
This is my R script so far:
Plotting multiple data series on a graph
y1<-dframe1$"Average.total.immune.cell.count"
y2<-dframe1$"Average.eleocyte.number"
x<-dframe1$"Average.worm.weight..g."
plot.default(y1~x,type="p" )
points(y2~x)
I am trying to add to y series to the same scatterplot and I am struggling to do so, I want to have different symbols for the points so as to tell apart the two different data series. Also I would like the axes to meet on the bottom left hand side and would appreciate being informed as to how I can do that? I would also like the y axis to be in standard form, but do not know how to get R to do that.
Best regards.
K.
So this is an object lesson is getting your data in the correct format to begin with. Your numbers have commas, which R does not like. Hence the numbers get converted to character and imported as factors (which your structure(...) clearly shows. You need to fix that, or better yet get rid of the commas prior to exporting.
Something like this will work
colnames(dframe) <- c("Site","x","y1","y2")
dframe$y1 <- as.numeric(as.character(gsub(",","",dframe$y1,fixed=TRUE)))
dframe$y2 <- as.numeric(as.character(gsub(",","",dframe$y2,fixed=TRUE)))
plot(y1~x,dframe, col="red", pch=20)
points(y2~x,dframe, col="blue", pch=20)
But there are additional problems. One of the numbers (in row 12) is a factor of 10 larger than all the others, so the plot above is not very informative. It's hard to know if this is a data input error, or a genuine outlier in your data.
EDIT: Response to OP's comment
dframe <- dframe[-12,] # remove row 12
dframe <- dframe[order(dframe$x),] # order by increasing x
plot(y1~x,dframe, col="red", pch=20, type="b")
points(y2~x,dframe, col="blue", pch=20, type="b")
legend("topleft",legend=c("y1","y2"),col=c("red","blue"),pch=20)

R line chart - removing vexing zero line not associated with data

I have a simple (yet very large) data set of counts made at different sites from Apr to Aug.
Between mid Apr and July there are no zero counts - yet a line at zero extends from the earliest to latest date.
Here is the part of the data used to make the above chart (columns are- Site.ID, DATE, Visible Number):
data=structure(list(Site.ID = c(302L, 302L, 302L, 302L, 302L, 302L,
302L, 302L, 302L, 302L, 302L, 302L, 304L, 304L, 304L, 304L, 304L,
304L, 304L, 304L, 304L, 304L, 304L, 304L), DATE = structure(c(1L,
2L, 5L, 3L, 4L, 6L, 8L, 7L, 9L, 10L, 11L, 12L, 1L, 2L, 5L, 3L,
4L, 6L, 8L, 7L, 9L, 10L, 11L, 12L), .Label = c("3/21/2014", "3/27/2014",
"4/17/2014", "4/28/2014", "4/8/2014", "5/13/2014", "6/17/2014",
"6/6/2014", "7/10/2014", "7/22/2014", "7/29/2014", "8/5/2014"
), class = "factor"), Visible.Number = c(0L, 0L, 5L, 14L, 20L,
21L, 6L, 8L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 7L, 7L, 7L, 7L, 5L,
0L, 0L, 0L, 0L)), .Names = c("Site.ID", "DATE", "Visible.Number"
), class = "data.frame", row.names = c(NA, -24L))
attach(data)
DATE<-as.Date(DATE,"%m/%d/%Y")
plot(data$Visible.Number~DATE, type="l", ylab="Visible Number")
I have two sites but there are three lines. How to make R not plot a line along zero?
Thank you for your help!
Your problem is with the multiple site ID's. It plots the first one, then goes back (drawing a line) to draw the second one. Essentially, base plots tries to draw all the lines without "lifting the pen". With base plotting, your option is to plot them separately with lines, perhaps in a for loop. I think stuff like this is easier with ggplot2
library(ggplot2)
ggplot(data, aes(x = DATE, y = Visible.Number, group = Site.ID)) + geom_line()
# if you prefer more base-like styling
ggplot(data, aes(x = DATE, y = Visible.Number, group = Site.ID)) +
geom_line() +
theme_bw()
In base:
plot(data$DATE, data$Visible.Number, type = "n",
ylab = "Visible Number", xlab = "Date")
for(site in unique(data$Site.ID)) {
with(subset(data, Site.ID == site),
lines(Visible.Number ~ DATE)
)
}
N.B. I did not attach my data as you did, so I don't know if the subsetting in the base solution will work properly for you if you do attach. In general, avoid attach; with is a nice way to save typing without attaching, and is much less "risky" in that it doesn't copy your data columns into isolated vectors, thus making them more difficult to keep track of as you subset or otherwise work with your data.

Using frequency of column value in dataframe to calculate new column value

So I have an example dataframe that hold the columns id, count and username with id and count being numbers and username being a string.
For every row of the dataframe I want to set a value of a new column called 'ratio', with ratio being defined as
count / number of rows where username == the username in this row
Example from the provided data:
In every row where the username is 'Tom' the ratio would be count/4 , because the user Tom is found four times in the data.
This is just a simplified version of my problem, a for-loop is not an option because my original dataframe has about 3.4 million rows and my previous approach where I used for-loops to iterate the unique values of e.g. 'username' to solve this problem takes forever.
dput of my dataframe:
structure(list(id = 1:20, count = c(140L, 89L, 17L, 114L, 129L,
86L, 21L, 50L, 197L, 160L, 8L, 14L, 78L, 208L, 155L, 55L, 63L,
20L, 189L, 79L), usernames = structure(c(4L, 3L, 5L, 5L, 2L,
3L, 1L, 1L, 3L, 1L, 3L, 2L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 3L), .Label = c("Jerry",
"Mark", "Phil", "Tina", "Tom"), class = "factor")), .Names = c("id",
"count", "usernames"), row.names = c(NA, 20L), class = "data.frame")
I hope I provided everything for you to understand and reproduce the problem, if something's missing don't hesitate to mention it in the comments.
There are several options. Here are three, one in base R, one with data.table, and one with "plyr". Both assume we're starting with a data.frame named "mydf":
Base R
within(mydf, {
temp <- as.numeric(ave(as.character(usernames), usernames, FUN = length))
ratio <- count/temp
rm(temp)
})
data.table
library(data.table)
DT <- data.table(mydf)
DT[, ratio := count/.N, by = "usernames"]
DT
plyr
library(plyr)
ddply(mydf, .(usernames), transform,
ratio = count/length(usernames))
You can use ave for this:
transform(d, x=count/as.numeric(ave(d$usernames, d$usernames, FUN=length)))

Resources