Calculating means based on conditionals for another column - r

I have a dataframe like
df <- structure(list(DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("04/23/90", "04/28/90", "05/03/95", "05/07/95"
), class = "factor"), JULIAN = c(113L, 113L, 113L, 113L, 113L,
113L, 118L, 118L, 118L, 118L, 118L, 118L, 123L, 123L, 123L, 123L,
123L, 123L, 127L, 127L, 127L, 127L, 127L, 127L), ID = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("AHFG-01", "AHFG-02",
"AHFG-03", "OIUR-01", "OIUR-02", "OIUR-03"), class = "factor"),
PERCENT = c(0L, 0L, 0L, 80L, 55L, 0L, 25L, 50L, 75L, 100L,
75L, 45L, 10L, 20L, 30L, 50L, 50L, 50L, 50L, 60L, 70L, 75L,
90L, 95L)), .Names = c("DATE", "JULIAN", "ID", "PERCENT"), class = "data.frame", row.names = c(NA,
-24L))
DATE JULIAN ID PERCENT
1 04/23/90 113 AHFG-01 0
2 04/23/90 113 AHFG-02 0
3 04/23/90 113 AHFG-03 0
4 04/23/90 113 OIUR-01 80
5 04/23/90 113 OIUR-02 55
6 04/23/90 113 OIUR-03 0
7 04/28/90 118 AHFG-01 25
8 04/28/90 118 AHFG-02 50
9 04/28/90 118 AHFG-03 75
10 04/28/90 118 OIUR-01 100
11 04/28/90 118 OIUR-02 75
12 04/28/90 118 OIUR-03 45
13 05/03/95 123 AHFG-01 10
14 05/03/95 123 AHFG-02 20
15 05/03/95 123 AHFG-03 30
16 05/03/95 123 OIUR-01 50
17 05/03/95 123 OIUR-02 50
18 05/03/95 123 OIUR-03 50
19 05/07/95 127 AHFG-01 50
20 05/07/95 127 AHFG-02 60
21 05/07/95 127 AHFG-03 70
22 05/07/95 127 OIUR-01 75
23 05/07/95 127 OIUR-02 90
24 05/07/95 127 OIUR-03 95
In this dataframe, ID gives replicates at different sites. For example, AHFG-01 is replicate 1 and AHFG-02 is replicate 2, both at site AHFG. PERCENT refers to percent completion.
I need to calculate two things:
1) Mean JULIAN when PERCENT first exceeds 50 for each site, across years
2) Mean JULIAN when PERCENT first exceeds 50 for all sites, across years
I am a bit baffled about the best way to proceed here. My approach is to:
1) Calculate mean PERCENT for each site (from ID) at each DATE/JULIAN
2) Identify JULIAN when mean PERCENT first exceeds 50, for each site for each YEAR
3) Calculate mean JULIAN from 2) for each site across years
4) Calculate mean JULIAN from 2) for all sites across years
For the datamrame above, the end results I need by site and for sites together would look something like this:
SITE JULIAN
AHFG 122.5
OIUR 120.5
JULIAN, all sites combined = 121.5
What I have done so far is first create columns YEAR and SITE to use for operations:
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
Then I can use aggregate to calculate SITE means for step 1 above:
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
However, I am getting stuck at step 2 and beyond. Can anyone suggest a way to calculate the mean JULIAN when PERCENT first exceeds 50, for each SITE across years, and all combined SITEs across years?
Solution:
Here is a modified form of Hekrik's excellent solution that is working for me. Note that Henkik's original solution did work but my question was a bit unclear on what I wanted (see comments below).
# make year column
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
# make new ID column (remove numbers for individuals)
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
# Calculate average PERCENT for each SITE
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
# order by SITE and JULIAN
df2 <- df2[order(df2$SITE, df2$JULIAN), ]
# within each YEAR and SITE, select first registration where PERCENT is 50 or more
df2 <- do.call(rbind,
by(df2, list(df2$YEAR, df2$SITE), function(x){
x[x$PERCENT >= 50, ][1, ]
}))
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# overall mean
mean(df2$JULIAN)

Here's one possibility:
# order by SITE and DATE
df <- df[order(df$SITE, df$DATE), ]
# within each YEAR and SITE, select first registration where PERCENT exceeds 50
df2 <- do.call(rbind,
by(df, list(df$YEAR, df$SITE), function(x){
x[x$PERCENT > 50, ][1, ]
}))
df2
# DATE JULIAN ID PERCENT YEAR SITE
# 6 1990-04-28 118 AHFG-03 75 1990 AHFG
# 11 1995-05-07 127 AHFG-02 60 1995 AHFG
# 13 1990-04-23 113 OIUR-01 80 1990 OIUR
# 22 1995-05-07 127 OIUR-01 75 1995 OIUR
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# SITE JULIAN
# 1 AHFG 122.5
# 2 OIUR 120.0
# overall mean
mean(df2$JULIAN)
# [1] 121.25
Please note that I don't get the same mean for OIUR as in your example.

Related

How do I count the number of cells from the CSV file in R?

The name of my dataset is student_performance which can be seen below:
gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50
I want to calculate how many digits "2" is within a gender column. For this I tried this code:
count(studentperformance$gender[1:10], vars = "2")
But the code shows error. Please suggest how can I achieve this?
As #user2974951 said, you can use base R for that:
sum(studentperformance$gender==2)
[1] 6
You can also create a table for every level in gender:
table(studentperformance$gender,factor(studentperformance$gender))
1 2
1 4 0
2 0 6
Sample data:
studentperformance <- read.table(text = "gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50", header = TRUE)
You can create some simple tables without indexing or comparisons. Try the following with count, which will return the variable gender containing the unique values of gender, and n indicating the count of each unique value:
library(dplyr)
count(df, gender)
#### OUTPUT ####
# A tibble: 2 x 2
gender n
<int> <int>
1 1 4
2 2 6
You can do pretty much the same thing using base R's table. The output is just a little different: The unique values are now the variable headers 1 and 2, and the counts are the row just beneath, with 4 and 6:
table(df$gender)
#### OUTPUT ####
1 2
4 6
Consider also:
studentperformance <- transform(studentperformance,
count_by_gender = ave(studentperformance$gender,
studentperformance$gender,
FUN = length))
Data:
structure(
list(
gender = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L),
race = c(2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 4L, 2L),
lunch = c(2L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
math = c(72L, 69L, 90L,
47L, 76L, 71L, 88L, 40L, 64L, 38L),
reading = c(72L, 90L, 95L,
57L, 78L, 83L, 95L, 43L, 64L, 60L),
writing = c(74L, 88L, 93L,
44L, 75L, 78L, 92L, 39L, 67L, 50L),
count_by_gender = c(6L, 6L,
6L, 4L, 4L, 6L, 6L, 4L, 4L, 6L)
),
class = "data.frame",
row.names = c(NA,-10L)
)

how does one deal with x must be numeric error in correlation plot?

Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot
Tried converting all to numeric but it just changes back to factor automatically
getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)
cp <-cor(myDF)
head(round(cp,2))
'data.frame': 12 obs. of 8 variables:
$ Month : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
$ Price.per.Seat : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
$ Paid.Seats : int 500 500 500 500 500 1000 1000 1000 1000 1000 ...
$ Active.Users : int 10 50 50 100 450 550 800 900 950 800 ...
$ Support.Cases : int 0 0 1 5 35 155 100 75 50 45 ...
$ Users.Trained : int 1 5 0 50 100 300 50 30 0 100 ...
$ Features.Used : int 5 5 5 5 8 9 9 10 15 15 ...
The results to dput(myDF) as are follows:
dput( myDF)
structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec",
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov",
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ",
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ",
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L,
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L,
950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L,
35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L,
5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L,
5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-12L))
You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:
myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
cp <-cor(myDF)
library(ggcorrplot)
ggcorrplot(cp)
You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:
library(data.table)
ir <- data.table(iris) # since you didn't produce a reproducible example
ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]
what is in there:
cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).
.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.
You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF[,2:8],as.numeric)
cor(newdf)
Edited:
If you want to use the month variable. Please install lubridate and use month function.
For example:
library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
The way to convert those months to Date class:
myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")
Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.
with(myDF, cor(Active.Users, as.numeric(MonDt) ) )
[1] 0.8269705
As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:
myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))
myDF
Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1 18-Jan 25000 50 500 10
2 18-Feb 25000 50 500 50
3 18-Mar 25000 50 500 50
4 18-Apr 25000 50 500 100
5 18-May 25000 50 500 450
6 18-Jun 40000 40 1000 550
7 18-Jul 40000 40 1000 800
8 18-Aug 40000 40 1000 900
9 18-Sep 40000 40 1000 950
10 18-Oct 40000 40 1000 800
11 18-Nov 40000 40 1000 700
12 18-Dec 40000 40 1000 600
Support.Cases Users.Trained Features.Used MonDt
1 0 1 5 2018-01-15
2 0 5 5 2018-02-15
3 1 0 5 2018-03-15
4 5 50 5 2018-04-15
5 35 100 8 2018-05-15
6 155 300 9 2018-06-15
7 100 50 9 2018-07-15
8 75 30 10 2018-08-15
9 50 0 15 2018-09-15
10 45 100 15 2018-10-15
11 10 50 15 2018-11-15
12 5 0 15 2018-12-15
This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:
How to add p values for correlation coefficients plotted using splom in lattice?

Replacing NA depending on distribution type of gender in R

When i selected NA value here
data[data=="na"] <- NA
data[!complete.cases(data),]
i must replace it, but depending on type of distribution.
If using Shapiro.test the distribution by variables not normal,
then missing value must be replace by median,
If it's normal, than replace by mean.
But distribution for each gender(1 girl, 2 -man)
data=structure(list(sex = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), emotion = c(20L,
15L, 49L, NA, 34L, 35L, 54L, 45L), IQ = c(101L, 98L, 105L, NA,
123L, 120L, 115L, NA)), .Names = c("sex", "emotion", "IQ"), class = "data.frame", row.names = c(NA,
-8L))
the desired output
sex emotion IQ
1 20 101
1 15 98
1 49 105
1 28 101
2 34 123
2 35 120
2 54 115
2 45 119
Following code will replace NA values according to the Shapiro Test:
library(dplyr)
data %>%
group_by(sex) %>%
mutate(
emotion = ifelse(!is.na(emotion), emotion,
ifelse(shapiro.test(emotion)$p.value > 0.05,
mean(emotion, na.rm=TRUE), quantile(emotion, na.rm=TRUE, probs=0.5) ) ),
IQ = ifelse(!is.na(IQ), IQ,
ifelse(shapiro.test(IQ)$p.value > 0.05,
mean(IQ, na.rm=TRUE), quantile(IQ, na.rm=TRUE, probs=0.5) )
)
)

Facing difficulty in convert a data.frame to time series object in R?

I am a novice in R language. I am having text file separated by tab available with sales data for each day. The format will be like product-id, day0, day1, day2, day3 and so on. The part of the input file given below
productid 0 1 2 3 4 5 6
1 53 40 37 45 69 105 62
4 0 0 2 4 0 8 0
5 57 133 60 126 90 87 107
6 108 130 143 92 88 101 66
10 0 0 2 0 4 0 36
11 17 22 16 15 45 32 36
I used code below to read a file
pdInfo <- read.csv("products.txt",header = TRUE, sep="\t")
This allows to read the entire file and variable x is a data frame. I would like to change data.frame x to time series object in order for the further processing.On a stationary test, Dickey–Fuller test (ADF) it shows an error. I tried the below code
x <- ts(data.matrix(pdInfo),frequency = 1)
adf <- adf.test(x)
error: Error in adf.test(x) : x is not a vector or univariate time series
Thanks in advance for the suggestions
In R, time series are usually in the form "one row per date", where your data is in the form "one column per date". You probably need to transpose the data before you convert to a ts object.
First transpose it:
y= t(pdInfo)
Then make the top row (being the product id's) into the row titles
colnames(y) = y[1,]
y= y[-1,] # to drop the first row
This should work:
x = ts(y, frequency = 1)
library(purrr)
library(dplyr)
library(tidyr)
library(tseries)
# create the data
df <- structure(list(productid = c(1L, 4L, 5L, 6L, 10L, 11L),
X0 = c(53L, 0L, 57L, 108L, 0L, 17L),
X1 = c(40L, 0L, 133L, 130L, 0L, 22L),
X2 = c(37L, 2L, 60L, 143L, 2L, 16L),
X3 = c(45L, 4L, 126L, 92L, 0L, 15L),
X4 = c(69L, 0L, 90L, 88L, 4L, 45L),
X5 = c(105L, 8L, 87L, 101L, 0L, 32L),
X6 = c(62L, 0L, 107L, 66L, 36L, 36L)),
.Names = c("productid", "0", "1", "2", "3", "4", "5", "6"),
class = "data.frame", row.names = c(NA, -6L))
# apply adf.test to each productid and return p.value
adfTest <- df %>% gather(key = day, value = sales, -productid) %>%
arrange(productid, day) %>%
group_by(productid) %>%
nest() %>%
mutate(adf = data %>% map(., ~adf.test(as.ts(.$sales)))
,adf.p.value = adf %>% map_dbl(., "p.value")) %>%
select(productid, adf.p.value)

apply regression while looping through levels of a factor in R

I am trying to apply a regression function to each separate level of a factor (Subject). The idea is that for each Subject, I can get a predicted reading time based on their actual reading time(RT) and the length of the corresponding printed string (WordLen). I was helped along by a colleague with some code for applying the function based on each level of another function (Region) within (Subject). However, neither the original code nor my attempted modification (to applying the function across breaks by a single factor) works.
Here is an attempt at some sample data:
test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L,
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L,
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L,
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L,
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L,
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))
The unfortunate thing is that this data is returning a problem that I don't get with my full dataset:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases"
Maybe this is because the sample data is too small?
Anyway, I am hoping that someone will see the issue with the code, despite my ability to provide working data...
This is the original code (does not work):
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
And this is the modified code (which not surprisingly, also does not work):
for(i in 1:length(levels(test0$Subject)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
I would very much appreciate any suggestions.
You can achieve result with function ddply() from library plyr.
This will split data frame according to Subject, calculate prediction of regression model and then add as new column to data frame.
ddply(test0,.(Subject),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
Subject Region RT WordLen pred
1 101 1 294 3 327.9778
......
4 101 1 339 3 327.9778
5 101 1 332 3 327.9778
6 101 2 NA 3 NA
7 101 2 399 5 363.8444
.......
13 102 1 600 3 785.4146
To split data by Subject and Region you should put both variable inside .().
ddply(test0,.(Subject,Region),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
The only problem in your test data is that Subject and Region are not factors.
test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
# 26 27 28 29 30
# 442.25 442.25 560.50 678.75 442.25
The reason you were getting the error you were (0 non-NA cases) is that when you were subsetting, you were doing it on levels of variables that were not factors. In you original dataset, try:
test0[test0$Subject==levels(test0$Subject)[1],]
You get:
# [1] Subject Region RT WordLen
# <0 rows> (or 0-length row.names)
Which is what lm() was trying to work with
While your questions seems to be asking for explanation of error, which others have answered (data not being factor at all), here is a way to do it using just base packages
test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~
WordLen, x, na.action = "na.exclude"))))
test0
## Subject Region RT WordLen rt.predicted
## 1 101 1 294 3 310.4000
## 2 101 1 241 3 310.4000
## 3 101 1 346 3 310.4000
## 4 101 1 339 3 310.4000
## 5 101 1 332 3 310.4000
## 6 101 2 NA 3 731.0000
## 7 101 2 399 5 731.0000
## 8 101 2 377 7 731.0000
## 9 101 2 400 3 731.0000
## 10 101 2 439 9 731.0000
## 11 102 1 905 3 448.5000
## 12 102 1 819 3 NA
## 13 102 1 600 3 448.5000
## 14 102 1 520 3 448.5000
## 15 102 1 811 3 448.5000
## 16 102 2 1021 3 NA
## 17 102 2 508 3 399.0000
## 18 102 2 550 5 408.5000
## 19 102 2 1048 7 389.5000
## 20 102 2 1246 3 418.0000
## 21 103 1 470 3 870.4375
## 22 103 1 NA 3 870.4375
## 23 103 1 385 3 877.3750
## 24 103 1 347 3 884.3125
## 25 103 1 592 3 870.4375
## 26 103 2 507 3 442.2500
## 27 103 2 472 3 442.2500
## 28 103 2 396 5 560.5000
## 29 103 2 761 7 678.7500
## 30 103 2 430 3 442.2500
I would expect that this is caused by the fact that for a combination of your two categorical variables no data exists. What you could do is to first extract the subset, check if it isn't equal to NULL, and only perform the lm if there is data.

Resources