I have a large dataframe with 31181 observations and 9 variables. In this dataframe, the academic performance of students is registered.
I also have a second dataframe, in which each student is represented in 1 row. In this row I would like to store his/her results from the academic performance dataframe.
Dataframe 1 (let's call it Academic) looks as follows:
Programme Resits Student_ID Course_code Academic_year Course_Grade_Binned Graduated Master_Student Course.rating_M Rating.tutor_M Selfstudy_M
1 IB 0 9000006 ABC1198 2013 B TRUE 1 7.5 8.2 14.1
2 IB 0 9000006 ABC1192 2014 B TRUE 1 8.4 8.8 13.0
3 IB 0 9000006 ABC1277 2014 A TRUE 1 6.0 6.4 10.6
4 IB 0 9000006 ABC1448 2013 B TRUE 1 5.7 7.8 14.4
5 IB 0 9000006 ABC1120 2014 B TRUE 1 7.1 7.4 11.2
6 IB 0 9000006 ABC1362 2013 B TRUE 1 6.7 7.5 15.8
7 IB 0 9000006 ABC1213 2013 C TRUE 1 7.7 8.1 11.4
8 IB 0 9000006 ABC1382 2013 B TRUE 1 6.6 7.1 16.3
9 IB 0 9000006 ABC1108 2013 C TRUE 1 7.1 7.6 15.7
10 IB 1 9000006 ABC1329 2014 B TRUE 1 7.5 7.9 10.7
11 IB 0 9000006 ABC1126 2013 B TRUE 1 6.7 7.5 15.3
12 IB 0 9000006 ABC1003 2013 B TRUE 1 7.3 8.5 12.6
13 IB 0 9000014 ABC1309 2014 B TRUE 0 6.9 6.1 12.4
14 IB 0 9000014 ABC1198 2013 A TRUE 0 7.5 8.2 14.1
15 IB 0 9000014 ABC1277 2014 A TRUE 0 6.0 6.4 10.6
16 IB 0 9000014 ABC1448 2013 A TRUE 0 5.7 7.8 14.4
17 IB 0 9000014 ABC1362 2013 B TRUE 0 6.7 7.5 15.8
18 IB 0 9000014 ABC1213 2013 B TRUE 0 7.7 8.1 11.4
19 IB 0 9000014 ABC1152 2014 A TRUE 0 7.0 7.6 12.3
20 IB 0 9000014 ABC1382 2013 A TRUE 0 6.6 7.1 16.3
21 IB 0 9000014 ABC1108 2013 B TRUE 0 7.1 7.6 15.7
22 IB 0 9000014 ABC1455 2014 A TRUE 0 6.7 7.3 11.2
23 IB 0 9000014 ABC1126 2013 B TRUE 0 6.7 7.5 15.3
24 IB 0 9000014 ABC1003 2013 A TRUE 0 7.3 8.5 12.6
25 IB 1 9000028 ABC1213 2014 C TRUE 0 7.8 8.6 10.7
26 IB 0 9000028 ABC1198 2014 B TRUE 0 7.1 8.0 15.5
Dataframe 2 (let's call it NewData) looks like this:
Student_ID Master Resits Programme ABC1198 ABC1192 ABC1277 ABC1448 ABC1120 ABC1362 ABC1213 ABC1382 ABC1108 ABC1329 ABC1126 ABC1003 ABC1309 ABC1152 ABC1455 ABC1123 ABC1409
1 9000006 1 1 IB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 9000014 0 0 IB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 9000028 0 5 IB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 9000045 1 5 EBE NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
As you can see, all the course columns are still NA. I would like to create a loop to check if a course_code exists in a group (i.e. by Student_ID) in the academic dataframe and then put a 1 in the particular course column in the NewData dataframe and a 0 if the student didn't do that course.
The end result (the NewData) should thus look like this:
Student_ID Master Resits Programme ABC1198 ABC1192 ABC1277 ABC1448 ABC1120 ABC1362 ABC1213 ABC1382 ABC1108 ABC1329 ABC1126 ABC1003 ABC1309 ABC1152 ABC1455 ABC1123 ABC1409
1 9000006 1 1 IB 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0
Using base R, we can first define the columns where the subjects are present in NewData. split Course_code based on Student_ID and create a logical vector using %in% based on subjects present for each Student.
cols <- 5:ncol(NewData)
NewData[cols] <- t(sapply(split(Academic$Course_code, Academic$Student_ID),
function(x) +(names(NewData)[cols] %in% x)))
You can use tidyr also.
library(tidyr)
library(dplyr)
Academic$value = 1
NewData = Academic %>% spread(key = Course_code, value = value)
NewData[is.na(NewData)] = 0
Related
I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3
I want to do a logistic regression to calculate the probability of a student pursuing a master degree.
I have dataset containing many students who have done certain courses in certain years. These courses also receive a rating (as well does the tutor) and this is course and year specific.
These students may or may not do a master at the same university. Based on the results a student gets, bases on the rating a course gets, based on the number of resits a student does, I want to predict the probability of a student pursuing a master.
To do so, I want to run a logistic regression and hence I need to split the data into a training and validation/test set. However, as you see, multiple rows can revolve around the same student. E.g. row 1 to 12 revolve around student 9000006.
The problem when doing a logistic regression now is that the regression sees every row as a seperate unit, while in fact, the students are kind of 'grouped'.
Programme Resits Student_ID Course_code Academic_year Course_Grade_Binned Graduated Master_Student Course.rating_M Rating.tutor_M Selfstudy_M
1 IB 0 9000006 ABC1198 2013 B TRUE 1 7.5 8.2 14.1
2 IB 0 9000006 ABC1192 2014 B TRUE 1 8.4 8.8 13.0
3 IB 0 9000006 ABC1277 2014 A TRUE 1 6.0 6.4 10.6
4 IB 0 9000006 ABC1448 2013 B TRUE 1 5.7 7.8 14.4
5 IB 0 9000006 ABC1120 2014 B TRUE 1 7.1 7.4 11.2
6 IB 0 9000006 ABC1362 2013 B TRUE 1 6.7 7.5 15.8
7 IB 0 9000006 ABC1213 2013 C TRUE 1 7.7 8.1 11.4
8 IB 0 9000006 ABC1382 2013 B TRUE 1 6.6 7.1 16.3
9 IB 0 9000006 ABC1108 2013 C TRUE 1 7.1 7.6 15.7
10 IB 1 9000006 ABC1329 2014 B TRUE 1 7.5 7.9 10.7
11 IB 0 9000006 ABC1126 2013 B TRUE 1 6.7 7.5 15.3
12 IB 0 9000006 ABC1003 2013 B TRUE 1 7.3 8.5 12.6
13 IB 0 9000014 ABC1309 2014 B TRUE 0 6.9 6.1 12.4
14 IB 0 9000014 ABC1198 2013 A TRUE 0 7.5 8.2 14.1
15 IB 0 9000014 ABC1277 2014 A TRUE 0 6.0 6.4 10.6
16 IB 0 9000014 ABC1448 2013 A TRUE 0 5.7 7.8 14.4
17 IB 0 9000014 ABC1362 2013 B TRUE 0 6.7 7.5 15.8
18 IB 0 9000014 ABC1213 2013 B TRUE 0 7.7 8.1 11.4
19 IB 0 9000014 ABC1152 2014 A TRUE 0 7.0 7.6 12.3
20 IB 0 9000014 ABC1382 2013 A TRUE 0 6.6 7.1 16.3
21 IB 0 9000014 ABC1108 2013 B TRUE 0 7.1 7.6 15.7
22 IB 0 9000014 ABC1455 2014 A TRUE 0 6.7 7.3 11.2
23 IB 0 9000014 ABC1126 2013 B TRUE 0 6.7 7.5 15.3
24 IB 0 9000014 ABC1003 2013 A TRUE 0 7.3 8.5 12.6
25 IB 1 9000028 ABC1213 2014 C TRUE 0 7.8 8.6 10.7
26 IB 0 9000028 ABC1198 2014 B TRUE 0 7.1 8.0 15.5
Does anyone have any tips on how to perform a logistic regression on this kind of data? If you have another suggestion to calculate a probability of a student pursuing a master, then please let me know as well :)
Cheers!
I'm trying to compare model-based forecasts from two different models. Model number 2 however requires more non-missing data and has thus more missing values (NA) than model 1.
I am wondering now how I can quickly query both dataframes for non-missing values and identify the common sample? I used to work with excel and the function
=IF(AND(ISVALUE(a1);ISVALUE(b1));then;else)
comes to my mind but I don't know how to do it properly with R.
This is my df from model 1: Every observation is clearly identified by id and time.
(the rownumbers on the left are from my overall dataframe and are identical in both dataframes.)
> head(model1)
id time f1 f2 f3 f4 f5
9 1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
10 1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
11 1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
12 1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
13 1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
14 1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737
and this model 2:
> head(model2)
id time meanf1 meanf2 meanf3 meanf4 meanf5
9 1 1995 4.56 5.14 6.05 NA NA
10 1 1996 4.38 4.94 NA NA NA
11 1 1997 4.05 4.51 NA NA NA
12 1 1998 4.07 5.04 6.52 NA NA
13 1 1999 3.61 4.96 NA NA NA
14 1 2000 4.35 4.83 6.46 NA NA
Thank you for your help and hints.
The function complete.cases gives non-missing data across all columns. The sets (f4,meanf4) and (f5,meanf5) have no "common" non-missing values in the sample data, hence have no observations. Is this what you were looking for
#Read Data
model1=read.table(text='id time f1 f2 f3 f4 f5
1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737',header=TRUE)
model2=read.table(text=' id time meanf1 meanf2 meanf3 meanf4 meanf5
1 1995 4.56 5.14 6.05 NA NA
1 1996 4.38 4.94 NA NA NA
1 1997 4.05 4.51 NA NA NA
1 1998 4.07 5.04 6.52 NA NA
1 1999 3.61 4.96 NA NA NA
1 2000 4.35 4.83 6.46 NA NA',header=TRUE)
#name indices of f1..f5 = 3..7
#merge data for each f1..f5 and keep only non-missing values using, complete.cases()
DF_list = lapply(3:7,function(x) {
DF=merge(model1[,c(1,2,x)],model2[,c(1,2,x)],by=c("id","time"));
DF=DF[complete.cases(DF),];
return(DF);
})
DF_list
#[[1]]
# id time f1 meanf1
#1 1 1995 16.351261 4.56
#2 1 1996 15.942914 4.38
#3 1 1997 24.187390 4.05
#4 1 1998 3.101094 4.07
#5 1 1999 33.562234 3.61
#6 1 2000 59.979666 4.35
#
#[[2]]
# id time f2 meanf2
#1 1 1995 -1.856662 5.14
#2 1 1996 -1.749530 4.94
#3 1 1997 15.099166 4.51
#4 1 1998 -10.455754 5.04
#5 1 1999 2.610512 4.96
#6 1 2000 -45.106093 4.83
#
#[[3]]
# id time f3 meanf3
#1 1 1995 6.577671 6.05
#4 1 1998 -9.674086 6.52
#6 1 2000 -100.352866 6.46
#
#[[4]]
#[1] id time f4 meanf4
#<0 rows> (or 0-length row.names)
#
#[[5]]
#[1] id time f5 meanf5
#<0 rows> (or 0-length row.names)
I am new to R. Having a problem to solve this dataset.
df
ID Time Value
1001 -34 3.3
1001 14 4.2
1002 -34 3.8
1002 14 6.5
1004 -18 4.1
1004 -11 3.4
1004 37 3.8
1005 -16 5.8
1005 -10 6.0
1005 14 8.1
1006 -20 16.1
1006 -10 14.1
1006 158 14.1
1007 -35 7.1
1007 -20 4.6
1007 -20 5.1
1007 10 5.0
For each ID, if there are more than one reading for negative time, replace the value with the mean and set the time to 0. The resulting dataset should be:
df1
ID Time Value
1001 0 3.3
1001 14 4.2
1002 0 3.8
1002 14 6.5
1004 0 3.75
1004 37 3.8
1005 0 5.9
1005 14 8.1
1006 0 15.1
1006 158 14.1
1007 0 5.6
1007 10 5.0
Thanks for help!
This will be pretty fast if you have lots of data.
#Convert to data.table object
require("data.table")
dt <- data.table(df)
#Label Negative values
dt[,Neg:=(Time<0)*1]
#Make positive and negative datasets
dt1 <- dt[Neg==0]
dt2 <- dt[Neg==1,list(Time=0,Value=mean(Value,na.rm=T),Neg=1),by="ID"]
#Recombine them together
df.final <- rbindlist(list(dt1,dt2))[order(ID,Time)]
Here is the result
# ID Time Value Neg
# 1: 1001 0 3.30 1
# 2: 1001 14 4.20 0
# 3: 1002 0 3.80 1
# 4: 1002 14 6.50 0
# 5: 1004 0 3.75 1
# 6: 1004 37 3.80 0
# 7: 1005 0 5.90 1
# 8: 1005 14 8.10 0
# 9: 1006 0 15.10 1
# 10: 1006 158 14.10 0
# 11: 1007 0 5.60 1
# 12: 1007 10 5.00 0
You can also put it all together in a one-liner to get a similar answer as follows:
dt[, list(Time=Time[1] * tt,
Value = if(tt) Value else mean(Value)),
by=list(ID, tt=Time>0)]
Here's yet another solution.
#copy raw data
dx <- df
#find time<0
lz <- dx$Time<0
#set those to tim 0
dx$Time[lz] <- 0
#update means for each ID for those values where time<0
dx$Value[lz] <- ave(dx$Value, dx$ID, lz, FUN=mean)[lz]
#remove duplicated time<0 values
dx<- dx[!(duplicated(dd$ID, lz) & lz), ]
And the results...
ID Time Value
1 1001 0 3.30
2 1001 14 4.20
3 1002 0 3.80
4 1002 14 6.50
5 1004 0 3.75
7 1004 37 3.80
8 1005 0 5.90
10 1005 14 8.10
11 1006 0 15.10
13 1006 158 14.10
14 1007 0 5.60
17 1007 10 5.00
Given the following set of data:
transect <- c("B","N","C","D","H","J","E","L","I","I")
sampler <- c(rep("J",5),rep("W",5))
species <- c("ROB","HAW","HAW","ROB","PIG","HAW","PIG","PIG","HAW","HAW")
weight <- c(2.80,52.00,56.00,2.80,16.00,55.00,16.20,18.30,52.50,57.00)
wingspan <- c(13.9, 52.0, 57.0, 13.7, 11.0,52.5, 10.7, 11.1, 52.3, 55.1)
week <- c(1,2,3,4,5,6,7,8,9,9)
# Warning to R newbs: Really bad idea to use this code
ex <- as.data.frame(cbind(transect,sampler,species,weight,wingspan,week))
What Iām trying to achieve is to transpose the species and its associated information on weight and wingspan. For a better idea of the expected result please see below. My data set is about half a million lines long with approximately 200 different species so it will be a very large dataframe.
transect sampler week ROBweight HAWweight PIGweight ROBwingspan HAWwingspan PIGwingspan
1 B J 1 2.8 0.0 0.0 13.9 0.0 0.0
2 N J 2 0.0 52.0 0.0 0.0 52.0 0.0
3 C J 3 0.0 56.0 0.0 0.0 57.0 0.0
4 D J 4 2.8 0.0 0.0 13.7 0.0 0.0
5 H J 5 0.0 0.0 16.0 0.0 0.0 11.0
6 J W 6 0.0 55.0 0.0 0.0 52.5 0.0
7 E W 7 0.0 0.0 16.2 0.0 0.0 10.7
8 L W 8 0.0 0.0 18.3 0.0 0.0 11.1
9 I W 9 0.0 52.5 0.0 0.0 52.3 0.0
10 I W 9 0.0 57.0 0.0 0.0 55.1 0.0
The main problem is that you don't currently have unique "id" variables, which will create problems for the usual suspects of reshape and dcast.
Here's a solution. I've used getanID from my "splitstackshape" package, but it's pretty easy to create your own unique ID variable using many different methods.
library(splitstackshape)
library(reshape2)
idvars <- c("transect", "sampler", "week")
ex <- getanID(ex, id.vars=idvars)
From here, you have two options:
reshape from base R:
reshape(ex, direction = "wide",
idvar=c("transect", "sampler", "week", ".id"),
timevar="species")
melt and dcast from "reshape2"
First, melt your data into a "long" form.
exL <- melt(ex, id.vars=c(idvars, ".id", "species"))
Then, cast your data into a wide form.
dcast(exL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1 B J 1 1 NA NA NA NA 2.8 13.9
# 2 C J 3 1 56.0 57.0 NA NA NA NA
# 3 D J 4 1 NA NA NA NA 2.8 13.7
# 4 E W 7 1 NA NA 16.2 10.7 NA NA
# 5 H J 5 1 NA NA 16.0 11.0 NA NA
# 6 I W 9 1 52.5 52.3 NA NA NA NA
# 7 I W 9 2 57.0 55.1 NA NA NA NA
# 8 J W 6 1 55.0 52.5 NA NA NA NA
# 9 L W 8 1 NA NA 18.3 11.1 NA NA
# 10 N J 2 1 52.0 52.0 NA NA NA NA
A better option: "data.table"
Alternatively (and perhaps preferably), you can use the "data.table" package (at least version 1.8.11) as follows:
library(data.table)
library(reshape2) ## Also required here
packageVersion("data.table")
# [1] ā1.8.11ā
DT <- data.table(ex)
DT[, .id := sequence(.N), by = c("transect", "sampler", "week")]
DTL <- melt(DT, measure.vars=c("weight", "wingspan"))
dcast.data.table(DTL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1: B J 1 1 NA NA NA NA 2.8 13.9
# 2: C J 3 1 56.0 57.0 NA NA NA NA
# 3: D J 4 1 NA NA NA NA 2.8 13.7
# 4: E W 7 1 NA NA 16.2 10.7 NA NA
# 5: H J 5 1 NA NA 16.0 11.0 NA NA
# 6: I W 9 1 52.5 52.3 NA NA NA NA
# 7: I W 9 2 57.0 55.1 NA NA NA NA
# 8: J W 6 1 55.0 52.5 NA NA NA NA
# 9: L W 8 1 NA NA 18.3 11.1 NA NA
# 10: N J 2 1 52.0 52.0 NA NA NA NA
Add fill = 0 to either of the dcast versions to replace NA values with 0.