selecting groups with zero values by action column in R - r

I have next data
mydat=structure(list(group = c(111L, 111L, 111L, 111L, 111L, 111L,
111L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 555L, 555L, 555L,
555L, 555L, 555L, 555L), group2 = c(222L, 222L, 222L, 222L, 222L,
222L, 222L, 444L, 444L, 444L, 444L, 444L, 444L, 444L, 666L, 666L,
666L, 666L, 666L, 666L, 666L), action = c(0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L
), x1 = c(1L, 2L, 3L, 0L, 0L, 1L, 2L, 1L, 2L, 3L, 0L, 0L, 1L,
2L, 1L, 2L, 3L, 10L, 20L, 1L, 2L)), .Names = c("group", "group2",
"action", "x1"), class = "data.frame", row.names = c(NA, -21L
))
Here two group variables(group and group2) .
there are three group
111 222
333 444
555 666
action column can take value only 0 and 1.
So i need find these groups where for 1 category of action they have only zero values by x1.
in our case it is
111 222
333 444
because for all 1 category of action they have zeros by x1.
So i can work only with 555 666 group.
because it has at least one non-zero value of first category of action by x1 variable.
The desired output
Mydat1 here groups with at least one non-zero value of first category of action by x1 variable.
group group2 action x1
555 666 0 1
555 666 0 2
555 666 0 3
555 666 1 **10**
555 666 1 **20**
555 666 0 1
555 666 0 2
mydat2 groups which for all 1 category of action they have zeros by x1
group group2 action x1
111 222 0 1
111 222 0 2
111 222 0 3
111 222 1 **0**
111 222 1 **0**
111 222 0 1
111 222 0 2
333 444 0 1
333 444 0 2
333 444 0 3
333 444 1 **0**
333 444 1 **0**
333 444 0 1
333 444 0 2

If i correctly you, then understand your question is:
i need find these groups where for 1 category of action they have
only zero values by x1.
so here is the response:
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( action==1 & x1==0 )
and the response is:
group group2 action x1
<int> <int> <int> <int>
1 111 222 1 0
2 111 222 1 0
3 333 444 1 0
4 333 444 1 0
What does this code do?
it looks at action feature, and consider 2 main categories for all rows(0,and 1). Then it filters out the observations which pass action==1 & x1==0. So, it means, among those rows who have action==1 the x1==0 is true as well.
can script return all values of 555+666 group?
No. it does not return these 2 groups. And it should not do that. Let's write a code which filters 555,and 666
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( group==555 | group2==666 )
and the result is:
group group2 action x1
<int> <int> <int> <int>
1 555 666 0 1
2 555 666 0 2
3 555 666 0 3
4 555 666 1 10
5 555 666 1 20
6 555 666 0 1
7 555 666 0 2
so, as you can see, none of these observation fulfills the condition action==1 & x1==0 . Therefore, they are not among the valid response.

Related

Computing mean and variance over time per group

I have a data frame with 1,000,000 rows. I would like to calculate mean and variance of Tor overtime for each SID to see if I can predict when Tor is starting to go out of limits. The Low limit is 0.4 and the high limit is 0.7. Below is a small example of my data.
dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01",
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05",
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09",
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13",
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17",
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21",
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25",
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L,
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L,
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L),
Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166,
0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521,
0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259,
0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271,
0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903,
0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-27L))
head(dat)
# timestamp SID Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192 0 1
#2 29-06-2021-06:01 351 0.6394163 0 0
#3 29-06-2021-06:02 304 0.2882821 1 0
#4 29-06-2021-06:03 357 0.9323622 0 1
#5 29-06-2021-06:04 358 0.3686166 1 0
#6 29-06-2021-06:05 302 0.4217556 0 0
Timestamp is when sample is recorded
SID is the ID of the part taking the reading. These values can be 301 - 310 and 351 to 360
Tor is the actual reading, and its data type is <dbl>.
Lowt is a binary variable showing that the Tor reading is below the lower limit.
Hit is a binary variable showing that the Tor reading is below the upper limit.
I have read up about variance but I can't seem to get my head around it. Any help would be great.
This is a very good question. You want to compute cumulative mean and cumulative variance of Tor (over time) per SID. Given the volume of your actual dataset, it is appropriate to use online algorithms. See my answer and Benjamin's answer on this topic back in 2018 for algorithmic details. In brief, my contribution is:
cummean <- function (x) cumsum(x) / seq_along(x)
cumvar <- function (x, sd = FALSE) {
x <- x - x[sample.int(length(x), 1)]
n <- seq_along(x)
v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
if (sd) v <- sqrt(v)
v
}
The extra work required here, is to apply these functions per SID.
## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]
## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)
## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)
## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar
Here is the result. Don't be surprised by the NaN in variance. The first value is always NaN in each SID. This is just normal (we can only compute variance when there are 2+ data).
## inspection
sorted_dat
# timestamp SID Tor Lowt Hit runmean runvar
#1 29-06-2021-06:00 301 0.70161919 0 1 0.70161919 NaN
#6 29-06-2021-06:05 302 0.42175565 0 0 0.42175565 NaN
#16 29-06-2021-06:15 302 0.44062517 0 0 0.43119041 0.0001780293
#7 29-06-2021-06:06 303 0.40973592 1 0 0.40973592 NaN
#17 29-06-2021-06:16 303 0.63975950 0 0 0.52474771 0.0264554237
#3 29-06-2021-06:02 304 0.28828207 1 0 0.28828207 NaN
#10 29-06-2021-06:09 304 0.81810239 0 1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178 1 0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020 0 0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458 0 1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027 0 1 0.77566027 NaN
#24 29-06-2021-06:23 307 0.97079290 0 1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167 0 0 0.65939167 NaN
#22 29-06-2021-06:21 308 0.24467497 1 0 0.45203332 0.0859949690
#8 29-06-2021-06:07 309 0.94217020 0 1 0.94217020 NaN
#2 29-06-2021-06:01 351 0.63941629 0 0 0.63941629 NaN
#12 29-06-2021-06:11 351 0.24638798 1 0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184 1 0 0.05000184 NaN
#23 29-06-2021-06:22 352 0.38620466 1 0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419 0 1 0.76293419 NaN
#26 29-06-2021-06:25 354 0.19036674 1 0 0.19036674 NaN
#9 29-06-2021-06:08 356 0.38139652 1 0 0.38139652 NaN
#27 29-06-2021-06:26 356 0.56596405 0 0 0.47368029 0.0170325864
#4 29-06-2021-06:03 357 0.93236217 0 1 0.93236217 NaN
#14 29-06-2021-06:13 357 0.63263026 0 0 0.78249621 0.0449196080
#5 29-06-2021-06:04 358 0.36861663 1 0 0.36861663 NaN
#15 29-06-2021-06:14 358 0.66618385 0 0 0.51740024 0.0442731264

Identify cases where data sequence changes based on other column UserIDs

I am working on a data frame df which is as below:
Input:
TUserId SUID mid_sum final_sum
115 201 2 7
115 309 1 8
115 404 1 9
209 245 2 10
209 398 2 10
209 510 2 10
209 602 1 10
371 111 2 11
371 115 1 11
371 123 3 11
371 124 2 11
1- My data is arranged in a wide format, where each row has a unique student ID shown as SUID.
2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId.
3- The data includes student scores in mid-terms and then students' final scores.
4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum gave inconsistent scores on their final exams as shown by final_sum. If such inconsistency is found in data, I want to add a column Status that records this inconsistency.
Requirement:
a- For this, my rule is that if mid_sum and final_sum are sorted in ascending order, as I have done in this example data frame df. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum and final_sum.
b- Can it be done, if the data is not sorted?
Example 1:
For example, for SUID = 309, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. It should only happen for students who were marked by the same teacher TUserId, which in this case is 115.
Example 2:
Similarly, for SUID = 602, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. Again, it is for the same teacher TUserId = 209
To elaborate further, I want an output like this:
Output:
TUserId SUID mid_sum final_sum Status
115 201 2 7 consistent
115 309 1 8 inconsistent
115 404 1 9 consistent
209 245 2 10 consistent
209 398 2 10 consistent
209 510 2 10 consistent
209 602 1 10 inconsistent
371 111 2 11 consistent
371 115 1 11 inconsistent
371 123 3 11 consistent
371 124 2 11 inconsistent
Data import dput()
The dput() for the data frame is below:
dput(df)
structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L),
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L),
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L),
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)),
class = "data.frame", row.names = c(NA, -11L))
I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question.
Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.
Any advice on how to solve this problem would be greatly appreciated.
Thanks!
Here's a fairly straightforward way where we test the sign of the lagged difference. If the mid_sum difference sign is the same as the final_sum difference sign, they are "consistent".
library(dplyr)
df %>%
arrange(TUserId, final_sum) %>%
group_by(TUserId) %>%
mutate(
Status = if_else(
sign(final_sum + 0.1 - lag(final_sum, default = 0)) == sign(mid_sum + 0.1 - lag(mid_sum, default = 0)),
"consisent", "inconsistent"
)
)
# # A tibble: 11 x 5
# # Groups: TUserId [3]
# TUserId SUID mid_sum final_sum Status
# <int> <int> <int> <int> <chr>
# 1 115 201 2 7 consisent
# 2 115 309 1 8 inconsistent
# 3 115 404 1 9 consisent
# 4 209 245 2 10 consisent
# 5 209 398 2 10 consisent
# 6 209 510 2 10 consisent
# 7 209 602 1 10 inconsistent
# 8 371 111 2 11 consisent
# 9 371 115 1 11 inconsistent
# 10 371 123 3 11 consisent
# 11 371 124 2 11 inconsistent
The + .1 serves to make rows where the scores stay the same count as a positive sign.
Perhaps accumulate family of functions has been designed for these situations. Using accumulate2 here -
As first argument I am passing through mid_sum
second argument is lagged value i.e. lag(mid_sum) with default as any value except NA and actual values it may take. I am taking 0 as safe
.init is provided with any value. I chose c only.
if first argument (..2) [..1 is accumulated value and not first arg] is less than ..3 i.e. second argument, return inconsistent else consistent.
Now since .init is provided the results will be one value large than provided, so stripped its first value [-1]
df <- structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L),
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L),
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L),
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)),
class = "data.frame", row.names = c(NA, -11L))
library(tidyverse)
df %>%
arrange(TUserId, final_sum) %>%
group_by(TUserId) %>%
mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c',
~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups: TUserId [3]
#> TUserId SUID mid_sum final_sum status
#> <int> <int> <int> <int> <chr>
#> 1 115 201 2 7 consistent
#> 2 115 309 1 8 inconsistent
#> 3 115 404 1 9 consistent
#> 4 209 245 2 10 consistent
#> 5 209 398 2 10 consistent
#> 6 209 510 2 10 consistent
#> 7 209 602 1 10 inconsistent
#> 8 371 111 2 11 consistent
#> 9 371 115 1 11 inconsistent
#> 10 371 123 3 11 consistent
#> 11 371 124 2 11 inconsistent
Created on 2021-06-15 by the reprex package (v2.0.0)

Drop the rows after the Event/Diseased(1) occurred in R

I'm new to R, I have a set of PATENT IDs with Disease status. I want to drop the rows after 1 status occurrence of disease. My data set looks like
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Expected Output
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Kindly suggest a code that returns the expected output.
Thanks in Advance!
Using dplyr one way would be to select all rows if no Disease == 1 occur in an ID or select rows only till first 1.
library(dplyr)
df %>%
group_by(ID) %>%
filter(if(any(Disease == 1)) row_number() <= match(1, Disease) else TRUE)
# ID Date Disease
# <int> <chr> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
#6 645 08-09-2019 0
#7 645 10-10-2018 0
#8 645 11-10-2012 0
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-11L))
This would do it.
set.seed(1012)
datas <- data_frame(ids = rep(1:3, each = 3),
times = runif(9, 0, 100),
event = rep(c(0, 1, 0), 3)) %>%
arrange(ids, times)
datas %>%
group_by(ids) %>%
mutate(lag(cumsum(event), default = 0) == 0)
We can use cumsum to create a logical vector for subsetting
library(data.table)
setDT(df)[df[, .I[cumsum(cumsum(Disease)) <= 1], ID]$V1]
# ID Date Disease
#1: 123 02-03-2012 0
#2: 123 03-03-2013 1
#3: 321 03-03-2015 1
#4: 423 06-06-2016 1
#5: 543 08-05-2018 1
#6: 645 08-09-2019 0
#7: 645 10-10-2018 0
#8: 645 11-10-2012 0
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(cumsum(cumsum(Disease)) <=1)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-11L))

function for event occurrence data in r

I have a patient data set i need to drop the rows after the first occurrence of disease column. for instance
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
and the expected output i want
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
One way with dplyr select rows till first occurrence of 1 for each ID.
library(dplyr)
df %>% group_by(ID) %>% filter(row_number() <= which(Disease == 1)[1])
# ID Date Disease
# <int> <fct> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
We can also use slice
df %>% group_by(ID) %>% slice(if(any(Disease == 1)) 1:which.max(Disease) else 0)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = structure(c(1L, 2L, 4L, 3L, 5L, 6L, 7L, 9L,
8L), .Label = c("02-03-2012", "03-03-2013", "03-03-2015", "04-03-2014",
"06-06-2016", "07-06-2017", "08-05-2018", "08-09-2019", "09-06-2019"
), class = "factor"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L,
0L, 0L)), class = "data.frame", row.names = c(NA, -9L))
I have no idea why don't have the last line 645 08-09-2019 0 in your expected result. The first occurrence of disease column for ID 645 has not appeared yet, so I guess you might have missed it in your expected result.
Based on my guess above, maybe you can try the base R solution below, using subset + ave
dfout <- subset(df,!!ave(Disease,ID,FUN = function(v) !duplicated(cumsum(v)>0)))
such that
> dfout
ID Date Disease
1 123 02-03-2012 0
2 123 03-03-2013 1
4 321 03-03-2015 1
5 423 06-06-2016 1
7 543 08-05-2018 1
9 645 08-09-2019 0
DATA
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = c("02-03-2012", "03-03-2013", "04-03-2014",
"03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018", "09-06-2019",
"08-09-2019"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -9L))

apply regression while looping through levels of a factor in R

I am trying to apply a regression function to each separate level of a factor (Subject). The idea is that for each Subject, I can get a predicted reading time based on their actual reading time(RT) and the length of the corresponding printed string (WordLen). I was helped along by a colleague with some code for applying the function based on each level of another function (Region) within (Subject). However, neither the original code nor my attempted modification (to applying the function across breaks by a single factor) works.
Here is an attempt at some sample data:
test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L,
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L,
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L,
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L,
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L,
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))
The unfortunate thing is that this data is returning a problem that I don't get with my full dataset:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases"
Maybe this is because the sample data is too small?
Anyway, I am hoping that someone will see the issue with the code, despite my ability to provide working data...
This is the original code (does not work):
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
And this is the modified code (which not surprisingly, also does not work):
for(i in 1:length(levels(test0$Subject)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
I would very much appreciate any suggestions.
You can achieve result with function ddply() from library plyr.
This will split data frame according to Subject, calculate prediction of regression model and then add as new column to data frame.
ddply(test0,.(Subject),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
Subject Region RT WordLen pred
1 101 1 294 3 327.9778
......
4 101 1 339 3 327.9778
5 101 1 332 3 327.9778
6 101 2 NA 3 NA
7 101 2 399 5 363.8444
.......
13 102 1 600 3 785.4146
To split data by Subject and Region you should put both variable inside .().
ddply(test0,.(Subject,Region),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
The only problem in your test data is that Subject and Region are not factors.
test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
# 26 27 28 29 30
# 442.25 442.25 560.50 678.75 442.25
The reason you were getting the error you were (0 non-NA cases) is that when you were subsetting, you were doing it on levels of variables that were not factors. In you original dataset, try:
test0[test0$Subject==levels(test0$Subject)[1],]
You get:
# [1] Subject Region RT WordLen
# <0 rows> (or 0-length row.names)
Which is what lm() was trying to work with
While your questions seems to be asking for explanation of error, which others have answered (data not being factor at all), here is a way to do it using just base packages
test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~
WordLen, x, na.action = "na.exclude"))))
test0
## Subject Region RT WordLen rt.predicted
## 1 101 1 294 3 310.4000
## 2 101 1 241 3 310.4000
## 3 101 1 346 3 310.4000
## 4 101 1 339 3 310.4000
## 5 101 1 332 3 310.4000
## 6 101 2 NA 3 731.0000
## 7 101 2 399 5 731.0000
## 8 101 2 377 7 731.0000
## 9 101 2 400 3 731.0000
## 10 101 2 439 9 731.0000
## 11 102 1 905 3 448.5000
## 12 102 1 819 3 NA
## 13 102 1 600 3 448.5000
## 14 102 1 520 3 448.5000
## 15 102 1 811 3 448.5000
## 16 102 2 1021 3 NA
## 17 102 2 508 3 399.0000
## 18 102 2 550 5 408.5000
## 19 102 2 1048 7 389.5000
## 20 102 2 1246 3 418.0000
## 21 103 1 470 3 870.4375
## 22 103 1 NA 3 870.4375
## 23 103 1 385 3 877.3750
## 24 103 1 347 3 884.3125
## 25 103 1 592 3 870.4375
## 26 103 2 507 3 442.2500
## 27 103 2 472 3 442.2500
## 28 103 2 396 5 560.5000
## 29 103 2 761 7 678.7500
## 30 103 2 430 3 442.2500
I would expect that this is caused by the fact that for a combination of your two categorical variables no data exists. What you could do is to first extract the subset, check if it isn't equal to NULL, and only perform the lm if there is data.

Resources