I am attempting to make a plot using ggplot2 with side by side bars generated from certain conditions that can be calculated from the data. I suspect the problem is formatting my data properly so that ggplot will give me what I want. I can't for the life of me get it right though.
What I have is data frame filled with rows for each time a student takes a course at a school. The variables of interest are Student.ID, Course.ID, Session, Fiscal.Year, and Facility. Each row is an occurrence of a student taking a course and tells what course they took, where they took it, etc. As far as I know, this is what's required for the data to be in long form (correct me if I'm wrong). The only field with possible NA values is the Facility, but I plan to exclude those from the plot anyways so you can treat the data frame as being completely filled.
What I want to do is produce a plot showing by fiscal year how many courses had <= 2 students, how many had < 4 students, and how many had <= 4 students, and how many courses were offered total. (Note: When I'm talking about how many courses were offered, I'm taking into account that each course may be offered multiple times and each time it's offered it has a session number associated with it. The tricky part is that the session numbers are not unique. I hope that makes sense, and I can try to clarify more if needed.)
I envision the final product being multiple charts using facet on the locations, x-axis being Fiscal.Year, and the y-axis being the number of courses/sessions. For each FY in the chart, I want different colored bars stacked side by side showing the numbers of <2, <4, <=4, total courses offered for that FY at that location. Consider the following chart, only instead of "Income, Expense, Loans", I want "<=2, <4, <=4, Total" (they would also be ascending from left to right, since there is inclusion between the different categories).
Here is some sample data to work with (typed as CSV since I can't just copy the head of the file). I've excluded the Facility column because faceting by that is easy and we can just assume one FY for a test example I think. For reference, it should have 3 courses with <=2 students, 5 courses with < 4, and 6 with <= 4. The total number of courses offered in this sample set is 6.
ID,CourseID,Session,Fiscal.Year
101,1,,1,FY13
102,1,1,FY13
103,1,1,FY13
104,1,1,FY13
101,2,1,FY13
102,2,1,FY13
103,2,1,FY13
101,2,2,FY13
102,2,2,FY13
103,2,2,FY13
101,3,1,FY13
102,3,1,FY13
101,3,2,FY13
102,3,2,FY13
101,3,3,FY13
102,3,3,FY13
I have tried:
Creating a new data frame using ddply with columns Course.ID, Session, FY, Facility, Count of Students. Then I used created a new column called "TwoLess", which just has a 1 if the count is <=2 and 0 otherwise. (I repeated this process for the other conditions, creating new columns for the others as well similarly.) Using the ggplot code below I was able to get a faceted plot for only one of the conditions (ie: only <=2 students), but wasn't able to get them to combine. I believe the following is the equivalent code used, changed to reflect my test set above:
ggplot(na.omit(df), aes(y = TwoLess, x = Fiscal.Year)) + geom_bar(stat = 'identity') + facet_wrap(~Facility)
I am thinking this approach is heavily flawed and I'm missing out on some of the "niceness" of having data in long form, since that's what ggplot wants as I understand it.
What is the best way to approach plotting this in ggplot?
It's also worth mentioning that while I have access to some of the more popular packages like ggplot2, plyr, reshape2, I do not have the ability to load all packages so I would prefer a solution that uses the above packages (or any of their dependencies). It shouldn't be that large of a restriction, I don't think.
Would something like this help?
Extending your data
> dput(df)
structure(list(ID = c(101L, 102L, 103L, 104L, 101L, 102L, 103L,
101L, 102L, 103L, 101L, 102L, 101L, 102L, 101L, 102L, 101L, 102L,
103L, 104L, 101L, 102L, 103L, 101L, 102L, 103L, 101L, 102L, 101L,
102L, 101L, 102L, 101L, 102L, 103L, 104L, 101L, 102L, 103L, 101L,
102L, 103L, 101L, 102L, 101L, 102L, 101L, 102L), CourseID = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
Session = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 3L, 3L), Fiscal.Year = c("FY13", "FY13",
"FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY13",
"FY13", "FY13", "FY13", "FY13", "FY13", "FY13", "FY14", "FY14",
"FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY14",
"FY14", "FY14", "FY14", "FY14", "FY14", "FY14", "FY15", "FY15",
"FY15", "FY15", "FY15", "FY15", "FY15", "FY15", "FY15", "FY15",
"FY15", "FY15", "FY15", "FY15", "FY15", "FY15")), .Names = c("ID",
"CourseID", "Session", "Fiscal.Year"), class = "data.frame", row.names = c(NA,
-48L))
df
ID CourseID Session Fiscal.Year
1 101 1 1 FY13
2 102 1 1 FY13
3 103 1 1 FY13
4 104 1 1 FY13
5 101 2 1 FY13
6 102 2 1 FY13
7 103 2 1 FY13
8 101 2 2 FY13
9 102 2 2 FY13
10 103 2 2 FY13
11 101 3 1 FY13
12 102 3 1 FY13
13 101 3 2 FY13
14 102 3 2 FY13
15 101 3 3 FY13
16 102 3 3 FY13
17 101 1 1 FY14
18 102 1 1 FY14
19 103 1 1 FY14
20 104 1 1 FY14
21 101 2 1 FY14
22 102 2 1 FY14
23 103 2 1 FY14
24 101 2 2 FY14
25 102 2 2 FY14
26 103 2 2 FY14
27 101 3 1 FY14
28 102 3 1 FY14
29 101 3 2 FY14
30 102 3 2 FY14
31 101 3 3 FY14
32 102 3 3 FY14
33 101 1 1 FY15
34 102 1 1 FY15
35 103 1 1 FY15
36 104 1 1 FY15
37 101 2 1 FY15
38 102 2 1 FY15
39 103 2 1 FY15
40 101 2 2 FY15
41 102 2 2 FY15
42 103 2 2 FY15
43 101 3 1 FY15
44 102 3 1 FY15
45 101 3 2 FY15
46 102 3 2 FY15
47 101 3 3 FY15
48 102 3 3 FY15
Summarise it with dplyr
d1 <- df %>%
group_by(CourseID, Session, Fiscal.Year) %>%
summarise(n=length(ID))
And again
d2 <- d1 %>%
group_by(Fiscal.Year) %>%
summarise(d1 = length(n[n <= 2]),
d2 = length(n[n < 4]),
d3 = length(n[n <= 4])
)
library(reshape2)
d3 <- melt(d2)
ggplot(d3, aes(Fiscal.Year, value, fill = variable)) +
geom_bar(stat = 'identity', position = 'dodge')
to plot it with ggplot2
Someone must provide a clever option. I'm tired. Go to bed now.
Related
The name of my dataset is student_performance which can be seen below:
gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50
I want to calculate how many digits "2" is within a gender column. For this I tried this code:
count(studentperformance$gender[1:10], vars = "2")
But the code shows error. Please suggest how can I achieve this?
As #user2974951 said, you can use base R for that:
sum(studentperformance$gender==2)
[1] 6
You can also create a table for every level in gender:
table(studentperformance$gender,factor(studentperformance$gender))
1 2
1 4 0
2 0 6
Sample data:
studentperformance <- read.table(text = "gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50", header = TRUE)
You can create some simple tables without indexing or comparisons. Try the following with count, which will return the variable gender containing the unique values of gender, and n indicating the count of each unique value:
library(dplyr)
count(df, gender)
#### OUTPUT ####
# A tibble: 2 x 2
gender n
<int> <int>
1 1 4
2 2 6
You can do pretty much the same thing using base R's table. The output is just a little different: The unique values are now the variable headers 1 and 2, and the counts are the row just beneath, with 4 and 6:
table(df$gender)
#### OUTPUT ####
1 2
4 6
Consider also:
studentperformance <- transform(studentperformance,
count_by_gender = ave(studentperformance$gender,
studentperformance$gender,
FUN = length))
Data:
structure(
list(
gender = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L),
race = c(2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 4L, 2L),
lunch = c(2L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
math = c(72L, 69L, 90L,
47L, 76L, 71L, 88L, 40L, 64L, 38L),
reading = c(72L, 90L, 95L,
57L, 78L, 83L, 95L, 43L, 64L, 60L),
writing = c(74L, 88L, 93L,
44L, 75L, 78L, 92L, 39L, 67L, 50L),
count_by_gender = c(6L, 6L,
6L, 4L, 4L, 6L, 6L, 4L, 4L, 6L)
),
class = "data.frame",
row.names = c(NA,-10L)
)
I'm trying to fix the NA problem and trying to dot plot the data frame from ".CSV" file.
I'm trying to get the mean median and 10% trimmed mean of the given data frame, somehow I'm getting an error. I have already tried previous suggestions and still not helping me out. I have data and I can't plot the dot chart from it.
code for mean median and 10% trimmed mean
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
as.numeric(unlist(data_val))
print(ncol(data_val))
print(nrow(data_val))
# I have used several logics but it's not helping to solve the problem
mean(data_val,data_val$cholesterol_level[data_val$Treatment_type ==
'Control_group'])
mean(data_val$cholesterol_level[data_val$Treatment_type ==
'Treatment_group'])
code for dot chart & dot plot
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
data_val
plot(data_val$Treatment_type ~ data_val$cholestrol_level, xlab =
"Health Unit Range", ylab = " ",
main = "Regiment_Health", type="p") #p for point chart
#dotchart(data_val, data_val$Treatment_type ~
data_val$cholestrol_leve, labels = row.names(data_val),
#cex = 0.6,xlab = "units")
Following is the error message
[2] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 7 3
-4 14 2 5 22 -7 9 5 -6 5 9 4 4 12 37 [38] 5 3 3 [2] 2 [2] 20 argument is not numeric or logical: returning NA[2] NA argument is
not numeric or logical: returning NA[2] NA
and instead of point plot, I'm getting bar chart and dot chart syntax
is not working though I have given the proper syntax.
.csv data
Treatment_type cholestrol_level
Control_group 7
Control_group 3
Control_group -4
Control_group 14
Control_group 2
Control_group 5
Control_group 22
Control_group -7
Control_group 9
Control_group 5
Treatment_group -6
Treatment_group 5
Treatment_group 9
Treatment_group 4
Treatment_group 4
Treatment_group 12
Treatment_group 37
Treatment_group 5
Treatment_group 3
Treatment_group 3
Data in dput format.
data_val <-
structure(list(Treatment_type = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Control_group", "Treatment_group"),
class = "factor"), cholestrol_level = c(7L, 3L, -4L, 14L,
2L, 5L, 22L, -7L, 9L, 5L, -6L, 5L, 9L, 4L, 4L, 12L, 37L,
5L, 3L, 3L)), class = "data.frame", row.names = c(NA, -20L))
I am trying to filter some data that I have in R. It is formatted like this:
id config_id alpha begin end day
1 1 1 5 138 139 6
2 1 2 5 137 138 6
3 1 3 5 47 48 2
4 1 3 3 46 47 2
5 1 4 3 45 46 2
6 1 4 3 43 44 2
...
id config_id alpha begin end day
1 2 1 5 138 139 6
2 2 2 5 137 138 6
3 2 2 5 136 137 6
4 2 3 3 45 46 2
5 2 3 3 44 45 2
6 2 4 3 43 44 2
My goal is to remove any configuration which results in having beginnings and endings on the same day. For example, in the top example config_id 3 is not acceptable because both instances of config_id occur on day 2. Same story for config_id 4. In the bottom example config_id 2 and config_id 3 are unacceptable for the same reason.
Basically, if I have a repeated config_id AND any day (from the day) column shows up more than once for that config_id, then I want to remove that config_id from the list.
Right now I'm using something of a fairly complex lapply algorithm but there must be an easier way.
Thanks!
You can do this several ways, assuming your data is stored in a data frame called my_data.
base R
same_day <- aggregate(my_data$day, my_data["config_id"], function(x) any(table(x) > 1))
names(same_day)[2] <- "same_day"
my_data <- merge(my_data, same_day, by = "config_id")
my_data <- same_day[!same_day$repeated_id, ]
dplyr
library(dplyr)
my_data %<>% group_by(config_id) %>%
mutate(same_day = any(table(day) > 1)) %>%
filter(!same_day)
data.table
library(data.table)
my_data <- data.table(my_data, key = "config_id")
same_day <- my_data[, .(same_day = any(table(day) > 1)), by = "config_id"]
my_data[!my_data[same_day]$same_day, ]
We can also use n_distinct from dplyr. Here, I am grouping by 'id' and 'config_id', then remove the rows using filter. If the number of elements within the group is greater than 1 (n()>1) and (&) the number of distinct elements in 'day' is equal to 1 (n_distinct==1), we remove it.
library(dplyr)
df1 %>%
group_by(id, config_id) %>%
filter(!(n()>1 & n_distinct(day)==1))
#Source: local data frame [4 x 6]
#Groups: id, config_id [4]
# id config_id alpha begin end day
# (int) (int) (int) (int) (int) (int)
#1 1 1 5 138 139 6
#2 1 2 5 137 138 6
#3 2 1 5 138 139 6
#4 2 4 3 43 44 2
This should also work if we have different 'day' for the same 'config_id'.
df1$day[4] <- 3
A similar option using data.table is uniqueN. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id' and 'config_id', we subset the dataset (.SD) using the logical condition.
library(data.table)#v1.9.6+
setDT(df1)[, if(!(.N>1 & uniqueN(day) == 1L)) .SD, by = .(id, config_id)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), config_id = c(1L, 2L, 3L, 3L, 4L, 4L, 1L, 2L, 2L, 3L,
3L, 4L), alpha = c(5L, 5L, 5L, 3L, 3L, 3L, 5L, 5L, 5L, 3L, 3L,
3L), begin = c(138L, 137L, 47L, 46L, 45L, 43L, 138L, 137L, 136L,
45L, 44L, 43L), end = c(139L, 138L, 48L, 47L, 46L, 44L, 139L,
138L, 137L, 46L, 45L, 44L), day = c(6L, 6L, 2L, 2L, 2L, 2L, 6L,
6L, 6L, 2L, 2L, 2L)), .Names = c("id", "config_id", "alpha",
"begin", "end", "day"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
I have a dataframe like
df <- structure(list(DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("04/23/90", "04/28/90", "05/03/95", "05/07/95"
), class = "factor"), JULIAN = c(113L, 113L, 113L, 113L, 113L,
113L, 118L, 118L, 118L, 118L, 118L, 118L, 123L, 123L, 123L, 123L,
123L, 123L, 127L, 127L, 127L, 127L, 127L, 127L), ID = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("AHFG-01", "AHFG-02",
"AHFG-03", "OIUR-01", "OIUR-02", "OIUR-03"), class = "factor"),
PERCENT = c(0L, 0L, 0L, 80L, 55L, 0L, 25L, 50L, 75L, 100L,
75L, 45L, 10L, 20L, 30L, 50L, 50L, 50L, 50L, 60L, 70L, 75L,
90L, 95L)), .Names = c("DATE", "JULIAN", "ID", "PERCENT"), class = "data.frame", row.names = c(NA,
-24L))
DATE JULIAN ID PERCENT
1 04/23/90 113 AHFG-01 0
2 04/23/90 113 AHFG-02 0
3 04/23/90 113 AHFG-03 0
4 04/23/90 113 OIUR-01 80
5 04/23/90 113 OIUR-02 55
6 04/23/90 113 OIUR-03 0
7 04/28/90 118 AHFG-01 25
8 04/28/90 118 AHFG-02 50
9 04/28/90 118 AHFG-03 75
10 04/28/90 118 OIUR-01 100
11 04/28/90 118 OIUR-02 75
12 04/28/90 118 OIUR-03 45
13 05/03/95 123 AHFG-01 10
14 05/03/95 123 AHFG-02 20
15 05/03/95 123 AHFG-03 30
16 05/03/95 123 OIUR-01 50
17 05/03/95 123 OIUR-02 50
18 05/03/95 123 OIUR-03 50
19 05/07/95 127 AHFG-01 50
20 05/07/95 127 AHFG-02 60
21 05/07/95 127 AHFG-03 70
22 05/07/95 127 OIUR-01 75
23 05/07/95 127 OIUR-02 90
24 05/07/95 127 OIUR-03 95
In this dataframe, ID gives replicates at different sites. For example, AHFG-01 is replicate 1 and AHFG-02 is replicate 2, both at site AHFG. PERCENT refers to percent completion.
I need to calculate two things:
1) Mean JULIAN when PERCENT first exceeds 50 for each site, across years
2) Mean JULIAN when PERCENT first exceeds 50 for all sites, across years
I am a bit baffled about the best way to proceed here. My approach is to:
1) Calculate mean PERCENT for each site (from ID) at each DATE/JULIAN
2) Identify JULIAN when mean PERCENT first exceeds 50, for each site for each YEAR
3) Calculate mean JULIAN from 2) for each site across years
4) Calculate mean JULIAN from 2) for all sites across years
For the datamrame above, the end results I need by site and for sites together would look something like this:
SITE JULIAN
AHFG 122.5
OIUR 120.5
JULIAN, all sites combined = 121.5
What I have done so far is first create columns YEAR and SITE to use for operations:
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
Then I can use aggregate to calculate SITE means for step 1 above:
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
However, I am getting stuck at step 2 and beyond. Can anyone suggest a way to calculate the mean JULIAN when PERCENT first exceeds 50, for each SITE across years, and all combined SITEs across years?
Solution:
Here is a modified form of Hekrik's excellent solution that is working for me. Note that Henkik's original solution did work but my question was a bit unclear on what I wanted (see comments below).
# make year column
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
# make new ID column (remove numbers for individuals)
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
# Calculate average PERCENT for each SITE
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
# order by SITE and JULIAN
df2 <- df2[order(df2$SITE, df2$JULIAN), ]
# within each YEAR and SITE, select first registration where PERCENT is 50 or more
df2 <- do.call(rbind,
by(df2, list(df2$YEAR, df2$SITE), function(x){
x[x$PERCENT >= 50, ][1, ]
}))
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# overall mean
mean(df2$JULIAN)
Here's one possibility:
# order by SITE and DATE
df <- df[order(df$SITE, df$DATE), ]
# within each YEAR and SITE, select first registration where PERCENT exceeds 50
df2 <- do.call(rbind,
by(df, list(df$YEAR, df$SITE), function(x){
x[x$PERCENT > 50, ][1, ]
}))
df2
# DATE JULIAN ID PERCENT YEAR SITE
# 6 1990-04-28 118 AHFG-03 75 1990 AHFG
# 11 1995-05-07 127 AHFG-02 60 1995 AHFG
# 13 1990-04-23 113 OIUR-01 80 1990 OIUR
# 22 1995-05-07 127 OIUR-01 75 1995 OIUR
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# SITE JULIAN
# 1 AHFG 122.5
# 2 OIUR 120.0
# overall mean
mean(df2$JULIAN)
# [1] 121.25
Please note that I don't get the same mean for OIUR as in your example.
I am trying to apply a regression function to each separate level of a factor (Subject). The idea is that for each Subject, I can get a predicted reading time based on their actual reading time(RT) and the length of the corresponding printed string (WordLen). I was helped along by a colleague with some code for applying the function based on each level of another function (Region) within (Subject). However, neither the original code nor my attempted modification (to applying the function across breaks by a single factor) works.
Here is an attempt at some sample data:
test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L,
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L,
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L,
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L,
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L,
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))
The unfortunate thing is that this data is returning a problem that I don't get with my full dataset:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases"
Maybe this is because the sample data is too small?
Anyway, I am hoping that someone will see the issue with the code, despite my ability to provide working data...
This is the original code (does not work):
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
And this is the modified code (which not surprisingly, also does not work):
for(i in 1:length(levels(test0$Subject)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
I would very much appreciate any suggestions.
You can achieve result with function ddply() from library plyr.
This will split data frame according to Subject, calculate prediction of regression model and then add as new column to data frame.
ddply(test0,.(Subject),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
Subject Region RT WordLen pred
1 101 1 294 3 327.9778
......
4 101 1 339 3 327.9778
5 101 1 332 3 327.9778
6 101 2 NA 3 NA
7 101 2 399 5 363.8444
.......
13 102 1 600 3 785.4146
To split data by Subject and Region you should put both variable inside .().
ddply(test0,.(Subject,Region),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
The only problem in your test data is that Subject and Region are not factors.
test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
# 26 27 28 29 30
# 442.25 442.25 560.50 678.75 442.25
The reason you were getting the error you were (0 non-NA cases) is that when you were subsetting, you were doing it on levels of variables that were not factors. In you original dataset, try:
test0[test0$Subject==levels(test0$Subject)[1],]
You get:
# [1] Subject Region RT WordLen
# <0 rows> (or 0-length row.names)
Which is what lm() was trying to work with
While your questions seems to be asking for explanation of error, which others have answered (data not being factor at all), here is a way to do it using just base packages
test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~
WordLen, x, na.action = "na.exclude"))))
test0
## Subject Region RT WordLen rt.predicted
## 1 101 1 294 3 310.4000
## 2 101 1 241 3 310.4000
## 3 101 1 346 3 310.4000
## 4 101 1 339 3 310.4000
## 5 101 1 332 3 310.4000
## 6 101 2 NA 3 731.0000
## 7 101 2 399 5 731.0000
## 8 101 2 377 7 731.0000
## 9 101 2 400 3 731.0000
## 10 101 2 439 9 731.0000
## 11 102 1 905 3 448.5000
## 12 102 1 819 3 NA
## 13 102 1 600 3 448.5000
## 14 102 1 520 3 448.5000
## 15 102 1 811 3 448.5000
## 16 102 2 1021 3 NA
## 17 102 2 508 3 399.0000
## 18 102 2 550 5 408.5000
## 19 102 2 1048 7 389.5000
## 20 102 2 1246 3 418.0000
## 21 103 1 470 3 870.4375
## 22 103 1 NA 3 870.4375
## 23 103 1 385 3 877.3750
## 24 103 1 347 3 884.3125
## 25 103 1 592 3 870.4375
## 26 103 2 507 3 442.2500
## 27 103 2 472 3 442.2500
## 28 103 2 396 5 560.5000
## 29 103 2 761 7 678.7500
## 30 103 2 430 3 442.2500
I would expect that this is caused by the fact that for a combination of your two categorical variables no data exists. What you could do is to first extract the subset, check if it isn't equal to NULL, and only perform the lm if there is data.