apply regression while looping through levels of a factor in R - r

I am trying to apply a regression function to each separate level of a factor (Subject). The idea is that for each Subject, I can get a predicted reading time based on their actual reading time(RT) and the length of the corresponding printed string (WordLen). I was helped along by a colleague with some code for applying the function based on each level of another function (Region) within (Subject). However, neither the original code nor my attempted modification (to applying the function across breaks by a single factor) works.
Here is an attempt at some sample data:
test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L,
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L,
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L,
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L,
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L,
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))
The unfortunate thing is that this data is returning a problem that I don't get with my full dataset:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases"
Maybe this is because the sample data is too small?
Anyway, I am hoping that someone will see the issue with the code, despite my ability to provide working data...
This is the original code (does not work):
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
And this is the modified code (which not surprisingly, also does not work):
for(i in 1:length(levels(test0$Subject)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
I would very much appreciate any suggestions.

You can achieve result with function ddply() from library plyr.
This will split data frame according to Subject, calculate prediction of regression model and then add as new column to data frame.
ddply(test0,.(Subject),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
Subject Region RT WordLen pred
1 101 1 294 3 327.9778
......
4 101 1 339 3 327.9778
5 101 1 332 3 327.9778
6 101 2 NA 3 NA
7 101 2 399 5 363.8444
.......
13 102 1 600 3 785.4146
To split data by Subject and Region you should put both variable inside .().
ddply(test0,.(Subject,Region),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))

The only problem in your test data is that Subject and Region are not factors.
test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
# 26 27 28 29 30
# 442.25 442.25 560.50 678.75 442.25
The reason you were getting the error you were (0 non-NA cases) is that when you were subsetting, you were doing it on levels of variables that were not factors. In you original dataset, try:
test0[test0$Subject==levels(test0$Subject)[1],]
You get:
# [1] Subject Region RT WordLen
# <0 rows> (or 0-length row.names)
Which is what lm() was trying to work with

While your questions seems to be asking for explanation of error, which others have answered (data not being factor at all), here is a way to do it using just base packages
test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~
WordLen, x, na.action = "na.exclude"))))
test0
## Subject Region RT WordLen rt.predicted
## 1 101 1 294 3 310.4000
## 2 101 1 241 3 310.4000
## 3 101 1 346 3 310.4000
## 4 101 1 339 3 310.4000
## 5 101 1 332 3 310.4000
## 6 101 2 NA 3 731.0000
## 7 101 2 399 5 731.0000
## 8 101 2 377 7 731.0000
## 9 101 2 400 3 731.0000
## 10 101 2 439 9 731.0000
## 11 102 1 905 3 448.5000
## 12 102 1 819 3 NA
## 13 102 1 600 3 448.5000
## 14 102 1 520 3 448.5000
## 15 102 1 811 3 448.5000
## 16 102 2 1021 3 NA
## 17 102 2 508 3 399.0000
## 18 102 2 550 5 408.5000
## 19 102 2 1048 7 389.5000
## 20 102 2 1246 3 418.0000
## 21 103 1 470 3 870.4375
## 22 103 1 NA 3 870.4375
## 23 103 1 385 3 877.3750
## 24 103 1 347 3 884.3125
## 25 103 1 592 3 870.4375
## 26 103 2 507 3 442.2500
## 27 103 2 472 3 442.2500
## 28 103 2 396 5 560.5000
## 29 103 2 761 7 678.7500
## 30 103 2 430 3 442.2500

I would expect that this is caused by the fact that for a combination of your two categorical variables no data exists. What you could do is to first extract the subset, check if it isn't equal to NULL, and only perform the lm if there is data.

Related

How do I count the number of cells from the CSV file in R?

The name of my dataset is student_performance which can be seen below:
gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50
I want to calculate how many digits "2" is within a gender column. For this I tried this code:
count(studentperformance$gender[1:10], vars = "2")
But the code shows error. Please suggest how can I achieve this?
As #user2974951 said, you can use base R for that:
sum(studentperformance$gender==2)
[1] 6
You can also create a table for every level in gender:
table(studentperformance$gender,factor(studentperformance$gender))
1 2
1 4 0
2 0 6
Sample data:
studentperformance <- read.table(text = "gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50", header = TRUE)
You can create some simple tables without indexing or comparisons. Try the following with count, which will return the variable gender containing the unique values of gender, and n indicating the count of each unique value:
library(dplyr)
count(df, gender)
#### OUTPUT ####
# A tibble: 2 x 2
gender n
<int> <int>
1 1 4
2 2 6
You can do pretty much the same thing using base R's table. The output is just a little different: The unique values are now the variable headers 1 and 2, and the counts are the row just beneath, with 4 and 6:
table(df$gender)
#### OUTPUT ####
1 2
4 6
Consider also:
studentperformance <- transform(studentperformance,
count_by_gender = ave(studentperformance$gender,
studentperformance$gender,
FUN = length))
Data:
structure(
list(
gender = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L),
race = c(2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 4L, 2L),
lunch = c(2L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
math = c(72L, 69L, 90L,
47L, 76L, 71L, 88L, 40L, 64L, 38L),
reading = c(72L, 90L, 95L,
57L, 78L, 83L, 95L, 43L, 64L, 60L),
writing = c(74L, 88L, 93L,
44L, 75L, 78L, 92L, 39L, 67L, 50L),
count_by_gender = c(6L, 6L,
6L, 4L, 4L, 6L, 6L, 4L, 4L, 6L)
),
class = "data.frame",
row.names = c(NA,-10L)
)

how does one deal with x must be numeric error in correlation plot?

Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot
Tried converting all to numeric but it just changes back to factor automatically
getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)
cp <-cor(myDF)
head(round(cp,2))
'data.frame': 12 obs. of 8 variables:
$ Month : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
$ Price.per.Seat : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
$ Paid.Seats : int 500 500 500 500 500 1000 1000 1000 1000 1000 ...
$ Active.Users : int 10 50 50 100 450 550 800 900 950 800 ...
$ Support.Cases : int 0 0 1 5 35 155 100 75 50 45 ...
$ Users.Trained : int 1 5 0 50 100 300 50 30 0 100 ...
$ Features.Used : int 5 5 5 5 8 9 9 10 15 15 ...
The results to dput(myDF) as are follows:
dput( myDF)
structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec",
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov",
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ",
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ",
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L,
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L,
950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L,
35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L,
5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L,
5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-12L))
You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:
myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
cp <-cor(myDF)
library(ggcorrplot)
ggcorrplot(cp)
You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:
library(data.table)
ir <- data.table(iris) # since you didn't produce a reproducible example
ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]
what is in there:
cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).
.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.
You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF[,2:8],as.numeric)
cor(newdf)
Edited:
If you want to use the month variable. Please install lubridate and use month function.
For example:
library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
The way to convert those months to Date class:
myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")
Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.
with(myDF, cor(Active.Users, as.numeric(MonDt) ) )
[1] 0.8269705
As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:
myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))
myDF
Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1 18-Jan 25000 50 500 10
2 18-Feb 25000 50 500 50
3 18-Mar 25000 50 500 50
4 18-Apr 25000 50 500 100
5 18-May 25000 50 500 450
6 18-Jun 40000 40 1000 550
7 18-Jul 40000 40 1000 800
8 18-Aug 40000 40 1000 900
9 18-Sep 40000 40 1000 950
10 18-Oct 40000 40 1000 800
11 18-Nov 40000 40 1000 700
12 18-Dec 40000 40 1000 600
Support.Cases Users.Trained Features.Used MonDt
1 0 1 5 2018-01-15
2 0 5 5 2018-02-15
3 1 0 5 2018-03-15
4 5 50 5 2018-04-15
5 35 100 8 2018-05-15
6 155 300 9 2018-06-15
7 100 50 9 2018-07-15
8 75 30 10 2018-08-15
9 50 0 15 2018-09-15
10 45 100 15 2018-10-15
11 10 50 15 2018-11-15
12 5 0 15 2018-12-15
This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:
How to add p values for correlation coefficients plotted using splom in lattice?

R- reshape2 with aggregation min function

I need to transpose a df in R and the aggregtion function has to be min.
Example:
library(reshape2)
N <- 20
df <- data.frame(rutcli=sample(101:103, N, replace=T),
mes_atras=sample(1:4, N, replace=T), pay_day=sample(1:30, N, replace=T))
s<-dcast(df, rutcli ~ mes_atras, fun.aggregate = min, value.var = 'pay_day')
View(s)
But I get a warning:
Warning message: In .fun(.value[0], ...) : no non-missing arguments to
min; returning Inf
And the results are not the desired:
rutcli 1 2 3 4
101 1 1 Inf 1
102 Inf 2 14 8
103 3 6 2 25
How can I solve this?
Thanks
You're getting the warning because you're asking for the minimum value of an empty set. For example, there are no values of pay_day for which rutcli=102 and mes_atras=1, so Inf is returned instead.
You can see this more easily if you set fun.aggregate=length. For example:
library(reshape2)
N <- 20
set.seed(11) # To make the `sample` function reproducible
df <- data.frame(rutcli=sample(101:103, N, replace=T),
mes_atras=sample(1:4, N, replace=T),
pay_day=sample(1:30, N, replace=T))
dcast(df, rutcli ~ mes_atras, fun.aggregate = length, value.var = 'pay_day')
rutcli 1 2 3 4
1 101 4 4 2 0
2 102 1 3 1 0
3 103 2 2 0 1
The zeros represent combinations of rutcli and mes_atras for which there are no values of pay_day. If we run dcast on this data frame with the min function, we'll get Inf where the zeros appear:
dcast(df, rutcli ~ mes_atras, fun.aggregate = min, value.var = 'pay_day')
rutcli 1 2 3 4
1 101 1 5 7 Inf
2 102 18 13 14 Inf
3 103 10 13 Inf 7
Warning message:
In .fun(.value[0], ...) : no non-missing arguments to min; returning Inf
You can get NA instead of Infby using one of the split-apply-combine methods. #MatthewLundberg gives a base R method. Here's one with dplyr:
library(dplyr)
df %>%
group_by(rutcli, mes_atras) %>%
summarise(min_pay_day=min(pay_day)) %>%
dcast(rutcli ~ mes_atras, value.var="min_pay_day")
rutcli 1 2 3 4
1 101 1 5 7 NA
2 102 18 13 14 NA
3 103 10 13 NA 7
You can do this with aggregate and reshape from package stats:
reshape(
aggregate(pay_day ~ mes_atras + rutcli, data=df, FUN=min),
direction='wide', timevar='mes_atras', idvar='rutcli'
)
## rutcli pay_day.1 pay_day.2 pay_day.3 pay_day.4
## 1 101 1 20 15 2
## 5 102 18 30 NA 3
## 8 103 2 5 23 16
You can replace NA values with Inf if desired.
Here's my df:
structure(list(rutcli = c(103L, 103L, 103L, 103L, 103L, 103L,
102L, 102L, 103L, 102L, 101L, 101L, 101L, 101L, 101L, 103L, 102L,
101L, 101L, 103L), mes_atras = c(1L, 3L, 4L, 1L, 1L, 2L, 1L,
4L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 4L, 4L, 4L, 1L, 2L), pay_day = c(3L,
23L, 16L, 18L, 2L, 5L, 18L, 3L, 12L, 30L, 20L, 2L, 15L, 24L,
29L, 24L, 3L, 19L, 1L, 12L)), .Names = c("rutcli", "mes_atras",
"pay_day"), row.names = c(NA, -20L), class = "data.frame")
I did it with:
my.min <- function (v) {if (length(v) == 0) 0 else min(v)}
s<-dcast(df, rutcli ~ mes_atras, fun.aggregate = my.min, value.var = 'pay_day')
And because I know that I don't have any 0:
s[s == 0] <- NA

Plotting data with Multiple Conditions on a Single Chart

I am attempting to make a plot using ggplot2 with side by side bars generated from certain conditions that can be calculated from the data. I suspect the problem is formatting my data properly so that ggplot will give me what I want. I can't for the life of me get it right though.
What I have is data frame filled with rows for each time a student takes a course at a school. The variables of interest are Student.ID, Course.ID, Session, Fiscal.Year, and Facility. Each row is an occurrence of a student taking a course and tells what course they took, where they took it, etc. As far as I know, this is what's required for the data to be in long form (correct me if I'm wrong). The only field with possible NA values is the Facility, but I plan to exclude those from the plot anyways so you can treat the data frame as being completely filled.
What I want to do is produce a plot showing by fiscal year how many courses had <= 2 students, how many had < 4 students, and how many had <= 4 students, and how many courses were offered total. (Note: When I'm talking about how many courses were offered, I'm taking into account that each course may be offered multiple times and each time it's offered it has a session number associated with it. The tricky part is that the session numbers are not unique. I hope that makes sense, and I can try to clarify more if needed.)
I envision the final product being multiple charts using facet on the locations, x-axis being Fiscal.Year, and the y-axis being the number of courses/sessions. For each FY in the chart, I want different colored bars stacked side by side showing the numbers of <2, <4, <=4, total courses offered for that FY at that location. Consider the following chart, only instead of "Income, Expense, Loans", I want "<=2, <4, <=4, Total" (they would also be ascending from left to right, since there is inclusion between the different categories).
Here is some sample data to work with (typed as CSV since I can't just copy the head of the file). I've excluded the Facility column because faceting by that is easy and we can just assume one FY for a test example I think. For reference, it should have 3 courses with <=2 students, 5 courses with < 4, and 6 with <= 4. The total number of courses offered in this sample set is 6.
ID,CourseID,Session,Fiscal.Year
101,1,,1,FY13
102,1,1,FY13
103,1,1,FY13
104,1,1,FY13
101,2,1,FY13
102,2,1,FY13
103,2,1,FY13
101,2,2,FY13
102,2,2,FY13
103,2,2,FY13
101,3,1,FY13
102,3,1,FY13
101,3,2,FY13
102,3,2,FY13
101,3,3,FY13
102,3,3,FY13
I have tried:
Creating a new data frame using ddply with columns Course.ID, Session, FY, Facility, Count of Students. Then I used created a new column called "TwoLess", which just has a 1 if the count is <=2 and 0 otherwise. (I repeated this process for the other conditions, creating new columns for the others as well similarly.) Using the ggplot code below I was able to get a faceted plot for only one of the conditions (ie: only <=2 students), but wasn't able to get them to combine. I believe the following is the equivalent code used, changed to reflect my test set above:
ggplot(na.omit(df), aes(y = TwoLess, x = Fiscal.Year)) + geom_bar(stat = 'identity') + facet_wrap(~Facility)
I am thinking this approach is heavily flawed and I'm missing out on some of the "niceness" of having data in long form, since that's what ggplot wants as I understand it.
What is the best way to approach plotting this in ggplot?
It's also worth mentioning that while I have access to some of the more popular packages like ggplot2, plyr, reshape2, I do not have the ability to load all packages so I would prefer a solution that uses the above packages (or any of their dependencies). It shouldn't be that large of a restriction, I don't think.
Would something like this help?
Extending your data
> dput(df)
structure(list(ID = c(101L, 102L, 103L, 104L, 101L, 102L, 103L,
101L, 102L, 103L, 101L, 102L, 101L, 102L, 101L, 102L, 101L, 102L,
103L, 104L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 102L, 101L,
102L, 101L, 102L, 101L, 102L, 103L, 104L, 101L, 102L, 103L, 101L,
102L, 103L, 101L, 102L, 101L, 102L, 101L, 102L), CourseID = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
Session = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 3L, 3L), Fiscal.Year = c("FY13", "FY13",
"FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY13",
"FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY14", "FY14",
"FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY14",
"FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY15", "FY15",
"FY15", "FY15", "FY15", "FY15", "FY15", "FY15", "FY15", "FY15",
"FY15", "FY15", "FY15", "FY15", "FY15", "FY15")), .Names = c("ID",
"CourseID", "Session", "Fiscal.Year"), class = "data.frame", row.names = c(NA,
-48L))
df
ID CourseID Session Fiscal.Year
1 101 1 1 FY13
2 102 1 1 FY13
3 103 1 1 FY13
4 104 1 1 FY13
5 101 2 1 FY13
6 102 2 1 FY13
7 103 2 1 FY13
8 101 2 2 FY13
9 102 2 2 FY13
10 103 2 2 FY13
11 101 3 1 FY13
12 102 3 1 FY13
13 101 3 2 FY13
14 102 3 2 FY13
15 101 3 3 FY13
16 102 3 3 FY13
17 101 1 1 FY14
18 102 1 1 FY14
19 103 1 1 FY14
20 104 1 1 FY14
21 101 2 1 FY14
22 102 2 1 FY14
23 103 2 1 FY14
24 101 2 2 FY14
25 102 2 2 FY14
26 103 2 2 FY14
27 101 3 1 FY14
28 102 3 1 FY14
29 101 3 2 FY14
30 102 3 2 FY14
31 101 3 3 FY14
32 102 3 3 FY14
33 101 1 1 FY15
34 102 1 1 FY15
35 103 1 1 FY15
36 104 1 1 FY15
37 101 2 1 FY15
38 102 2 1 FY15
39 103 2 1 FY15
40 101 2 2 FY15
41 102 2 2 FY15
42 103 2 2 FY15
43 101 3 1 FY15
44 102 3 1 FY15
45 101 3 2 FY15
46 102 3 2 FY15
47 101 3 3 FY15
48 102 3 3 FY15
Summarise it with dplyr
d1 <- df %>%
group_by(CourseID, Session, Fiscal.Year) %>%
summarise(n=length(ID))
And again
d2 <- d1 %>%
group_by(Fiscal.Year) %>%
summarise(d1 = length(n[n <= 2]),
d2 = length(n[n < 4]),
d3 = length(n[n <= 4])
)
library(reshape2)
d3 <- melt(d2)
ggplot(d3, aes(Fiscal.Year, value, fill = variable)) +
geom_bar(stat = 'identity', position = 'dodge')
to plot it with ggplot2
Someone must provide a clever option. I'm tired. Go to bed now.

Calculating means based on conditionals for another column

I have a dataframe like
df <- structure(list(DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("04/23/90", "04/28/90", "05/03/95", "05/07/95"
), class = "factor"), JULIAN = c(113L, 113L, 113L, 113L, 113L,
113L, 118L, 118L, 118L, 118L, 118L, 118L, 123L, 123L, 123L, 123L,
123L, 123L, 127L, 127L, 127L, 127L, 127L, 127L), ID = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("AHFG-01", "AHFG-02",
"AHFG-03", "OIUR-01", "OIUR-02", "OIUR-03"), class = "factor"),
PERCENT = c(0L, 0L, 0L, 80L, 55L, 0L, 25L, 50L, 75L, 100L,
75L, 45L, 10L, 20L, 30L, 50L, 50L, 50L, 50L, 60L, 70L, 75L,
90L, 95L)), .Names = c("DATE", "JULIAN", "ID", "PERCENT"), class = "data.frame", row.names = c(NA,
-24L))
DATE JULIAN ID PERCENT
1 04/23/90 113 AHFG-01 0
2 04/23/90 113 AHFG-02 0
3 04/23/90 113 AHFG-03 0
4 04/23/90 113 OIUR-01 80
5 04/23/90 113 OIUR-02 55
6 04/23/90 113 OIUR-03 0
7 04/28/90 118 AHFG-01 25
8 04/28/90 118 AHFG-02 50
9 04/28/90 118 AHFG-03 75
10 04/28/90 118 OIUR-01 100
11 04/28/90 118 OIUR-02 75
12 04/28/90 118 OIUR-03 45
13 05/03/95 123 AHFG-01 10
14 05/03/95 123 AHFG-02 20
15 05/03/95 123 AHFG-03 30
16 05/03/95 123 OIUR-01 50
17 05/03/95 123 OIUR-02 50
18 05/03/95 123 OIUR-03 50
19 05/07/95 127 AHFG-01 50
20 05/07/95 127 AHFG-02 60
21 05/07/95 127 AHFG-03 70
22 05/07/95 127 OIUR-01 75
23 05/07/95 127 OIUR-02 90
24 05/07/95 127 OIUR-03 95
In this dataframe, ID gives replicates at different sites. For example, AHFG-01 is replicate 1 and AHFG-02 is replicate 2, both at site AHFG. PERCENT refers to percent completion.
I need to calculate two things:
1) Mean JULIAN when PERCENT first exceeds 50 for each site, across years
2) Mean JULIAN when PERCENT first exceeds 50 for all sites, across years
I am a bit baffled about the best way to proceed here. My approach is to:
1) Calculate mean PERCENT for each site (from ID) at each DATE/JULIAN
2) Identify JULIAN when mean PERCENT first exceeds 50, for each site for each YEAR
3) Calculate mean JULIAN from 2) for each site across years
4) Calculate mean JULIAN from 2) for all sites across years
For the datamrame above, the end results I need by site and for sites together would look something like this:
SITE JULIAN
AHFG 122.5
OIUR 120.5
JULIAN, all sites combined = 121.5
What I have done so far is first create columns YEAR and SITE to use for operations:
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
Then I can use aggregate to calculate SITE means for step 1 above:
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
However, I am getting stuck at step 2 and beyond. Can anyone suggest a way to calculate the mean JULIAN when PERCENT first exceeds 50, for each SITE across years, and all combined SITEs across years?
Solution:
Here is a modified form of Hekrik's excellent solution that is working for me. Note that Henkik's original solution did work but my question was a bit unclear on what I wanted (see comments below).
# make year column
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
# make new ID column (remove numbers for individuals)
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
# Calculate average PERCENT for each SITE
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
# order by SITE and JULIAN
df2 <- df2[order(df2$SITE, df2$JULIAN), ]
# within each YEAR and SITE, select first registration where PERCENT is 50 or more
df2 <- do.call(rbind,
by(df2, list(df2$YEAR, df2$SITE), function(x){
x[x$PERCENT >= 50, ][1, ]
}))
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# overall mean
mean(df2$JULIAN)
Here's one possibility:
# order by SITE and DATE
df <- df[order(df$SITE, df$DATE), ]
# within each YEAR and SITE, select first registration where PERCENT exceeds 50
df2 <- do.call(rbind,
by(df, list(df$YEAR, df$SITE), function(x){
x[x$PERCENT > 50, ][1, ]
}))
df2
# DATE JULIAN ID PERCENT YEAR SITE
# 6 1990-04-28 118 AHFG-03 75 1990 AHFG
# 11 1995-05-07 127 AHFG-02 60 1995 AHFG
# 13 1990-04-23 113 OIUR-01 80 1990 OIUR
# 22 1995-05-07 127 OIUR-01 75 1995 OIUR
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# SITE JULIAN
# 1 AHFG 122.5
# 2 OIUR 120.0
# overall mean
mean(df2$JULIAN)
# [1] 121.25
Please note that I don't get the same mean for OIUR as in your example.

Resources