R Convert data frame to matrix keeping certain columns as numeric - r

I have the following data fame df:
author filename a all also
1 dispt dispt_fed_49.txt 0.28 0.052 0.009
2 dispt dispt_fed_50.txt 0.18 0.063 0.013
3 dispt dispt_fed_51.txt 0.34 0.090 0.008
4 dispt dispt_fed_52.txt 0.27 0.024 0.016
5 dispt dispt_fed_53.txt 0.30 0.054 0.027
I want to convert to matrix using:
mat <- as.matrix(df)
But the result shows numeric columns as text:
author filename a all also
[1,] "dispt" "dispt_fed_49.txt" "0.280" "0.052" "0.009"
[2,] "dispt" "dispt_fed_50.txt" "0.177" "0.063" "0.013"
[3,] "dispt" "dispt_fed_51.txt" "0.339" "0.090" "0.008"
[4,] "dispt" "dispt_fed_52.txt" "0.270" "0.024" "0.016"
[5,] "dispt" "dispt_fed_53.txt" "0.303" "0.054" "0.027"
How can I keep these as numeric while

Related

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

Convert column headers into new columns

My data frame consists of time series financial data from many public companies. I purposely set companies' weights as their column headers while cleaning the data, and I also calculated log returns for each of them in order to calculate weighted returns in the next step.
Here is an example. There are four companies: A, B, C and D, and their corresponding weights in the portfolio are 0.4, 0.3, 0.2, 0.1 separately. So the current data set looks like:
df1 <- data.frame(matrix(vector(),ncol=9, nrow = 4))
colnames(df1) <- c("Date","0.4","0.4.Log","0.3","0.3.Log","0.2","0.2.Log","0.1","0.1.Log")
df1[1,] <- c("2004-10-29","103.238","0","131.149","0","99.913","0","104.254","0")
df1[2,] <- c("2004-11-30","104.821","0.015","138.989","0.058","99.872","0.000","103.997","-0.002")
df1[3,] <- c("2004-12-31","105.141","0.003","137.266","-0.012","99.993","0.001","104.025","0.000")
df1[4,] <- c("2005-01-31","107.682","0.024","137.08","-0.001","99.782","-0.002","105.287","0.012")
df1
Date 0.4 0.4.Log 0.3 0.3.Log 0.2 0.2.Log 0.1 0.1.Log
1 2004-10-29 103.238 0 131.149 0 99.913 0 104.254 0
2 2004-11-30 104.821 0.015 138.989 0.058 99.872 0.000 103.997 -0.002
3 2004-12-31 105.141 0.003 137.266 -0.012 99.993 0.001 104.025 0.000
4 2005-01-31 107.682 0.024 137.08 -0.001 99.782 -0.002 105.287 0.012
I want to create new columns that contain company weights so that I can calculate weighted returns in my next step:
Date 0.4 0.4.W 0.4.Log 0.3 0.3.W 0.3.Log 0.2 0.2.W 0.2.Log 0.1 0.1.W 0.1.Log
1 2004-10-29 103.238 0.400 0.000 131.149 0.300 0.000 99.913 0.200 0.000 104.254 0.100 0.000
2 2004-11-30 104.821 0.400 0.015 138.989 0.300 0.058 99.872 0.200 0.000 103.997 0.100 -0.002
3 2004-12-31 105.141 0.400 0.003 137.266 0.300 -0.012 99.993 0.200 0.001 104.025 0.100 0.000
4 2005-01-31 107.682 0.400 0.024 137.080 0.300 -0.001 99.782 0.200 -0.002 105.287 0.100 0.012
We can try
v1 <- grep("^[0-9.]+$", names(df1), value = TRUE)
df1[paste0(v1, ".w")] <- as.list(as.numeric(v1))

svd imputation R

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).
This is the dataset with missing data
http://pastebin.com/YS9qaUPs
#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)
head(SVDImpNA)
V1 V2 V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01
Where am I wrong?
The impute.svd algorithm works as follows:
Replace all missing values with the corresponding column means.
Compute a rank-k approximation to the imputed matrix.
Replace the values in the imputed positions with the corresponding values from the rank-k approximation computed in Step 2.
Repeat Steps 2 and 3 until convergence.
In your example code, you are setting k=min(n,p) (the default). Then, in Step 2, the rank-k approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).
If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k. The following code demonstrates this with your sample data:
> library("bcv")
> dataMiss = read.csv('dataMiss.csv')
k=3
> SVDimputation = impute.svd(dataMiss, k = 3, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-0.01 0.01
531 1062
k=2
> SVDimputation = impute.svd(dataMiss, k = 2, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-11.31 -6.94 -2.59 -2.52 -2.19 -2.02 -1.67 -1.63
25 23 61 2 54 23 5 44
-1.61 -1.2 -0.83 -0.8 -0.78 -0.43 -0.31 -0.15
14 10 13 19 39 1 14 19
-0.14 -0.02 0 0.01 0.02 0.03 0.06 0.17
83 96 94 77 30 96 82 28
0.46 0.53 0.55 0.56 0.83 0.91 1.26 1.53
1 209 83 23 28 111 16 8
1.77 5.63 9.99 14.34
112 12 33 5
Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000.
The problem that you describe likely occurs because impute.svd initially sets all of the NA values to be equal to the column means, and then doesn't change these values upon convergence.
It depends on the reason that you are using SVD imputation in the first place, but in case you are flexible, a good solution to this problem might be to switch the rank of the SVD call, by setting k to, e.g., 1. Currently, k is set automatically to min(n, p), where n = nrow, and p = ncol, which for your data means k = 3. For example, if you set it to 1 (as it is set in the example in the impute.svd function documentation), then this problem does not occur:
library(bcv)
dataMiss = read.csv("dataMiss.csv")
SVDimputation = round(impute.svd(dataMiss, k = 1)$x, 2)
head(SVDimputation)
[,1] [,2] [,3]
[1,] 0.96 -0.23 0.52
[2,] 0.02 -0.23 -1.92
[3,] -1.87 -0.23 0.52
[4,] -0.92 -0.23 0.52
[5,] 0.49 -0.46 0.52
[6,] -1.87 -0.23 0.52

Simulating data in R with multiple probability distributions

I am trying to simulate data via bootstrapping to create confidence bands for my real data with a funnel plot. I am building on the strategy of the accepted answer to a previous question. Instead of using a single probability distribution for simulating my data I want to modify it to use different probability distributions depending on the part of the data being simulated.
I greatly appreciate anyone who can help answer the question or help me phrase the question more clearly.
My problem is writing the appropriate R code to do a more complicated form of data simulation.
The current code is:
n <- 1e4
set.seed(42)
sims <- sapply(1:80,
function(k)
rowSums(
replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
This code simulates data where each data point has a value which is the mean of between 1:80 observations.
For example, when the values of the data points are the mean of 10 observations (k=10) it randomly samples 10 values (which can be either 0.1,0.2,0.3, 0.4, 0.5,0.6 or 0.7) based on a probability distribution ps, which gives the probability of each value (based on the entire empirical distribution).
ps looks like this:
ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124
eg probability that the value of an observation is 0.1 is 0.582089552.
Now instead of using one frequency distribution for all simulations I would like to use different frequency distributions conditionally depending on the number of observations underlying each datapoint.
I made a table, cond_probs, that has a row for each of my real data points. There is a column with the total number of observations and a column giving the frequency of each of the values for each observation.
Example of the cond_probs table:
gene_name 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 total
A1 0.664 0.319 0.018 0.000 0.000 0.000 0.000 0.000 0.000 113.000
A2 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
So for the data point A2, there is only 1 observation, which has a value of 0.1. Therefore the frequency of the 0.1 observations is 1. For A1, there are 113 observations and the majority of those (0.664) have the value 0.1. The idea is that cond_probs is like ps, but cond_probs has a probability distribution for each data point rather than one for all the data.
I would like to modify the above code so that the sampling is modified to use cond_probs instead of ps for the frequency distribution. And to use the number of observations, k , as a criteria when choosing which row in cond_probs to sample from. So it would work like this:
For data points with k number of observations:
look in the cond_probs table and randomly select a row where the total number of observations is similar in size to k: 0.9k-1.1k. If no such rows exist, continue.
Once a datapoint is selected, use the probability distribution from that line in cond_probs just like ps is used in the original code, to randomly sample k number of observations and output the mean of these observations.
For each of the n iterations of replicate, randomly sample with replacement a new data point from cond_probs, out of all rows where the value of total is similar to the current value of k ( 0.9k-1.1k).
The idea is that for this dataset one should condition which probability distribution to use based on the number of observations underlying a data point. This is because in this dataset the probability of an observation is influenced by the number of observations (genes with more SNPs tend to have a lower score per observation due to genetic linkage and background selection).
UPDATE USING ANSWER BELOW:
I tried using the answer below and it works for the simulated cond_probs data in the example but not for my real cond_probs file.
I imported and converted my cond_probs file to a matrix with
cond_probs <- read.table("cond_probs.txt", header = TRUE, check.names = FALSE)
cond_probs <- as.matrix(cond_probs)
and the first example ten rows (out of ~20,000 rows) looks like this:
>cond_probs
total 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1,] 109 0.404 0.174 0.064 0.183 0.165 0.009 0.000 0.000 0.000 0.000
[2,] 181 0.564 0.221 0.144 0.066 0.006 0.000 0.000 0.000 0.000 0.000
[3,] 289 0.388 0.166 0.118 0.114 0.090 0.093 0.028 0.003 0.000 0.000
[4,] 388 0.601 0.214 0.139 0.039 0.008 0.000 0.000 0.000 0.000 0.000
[5,] 133 0.541 0.331 0.113 0.000 0.008 0.008 0.000 0.000 0.000 0.000
[6,] 221 0.525 0.376 0.068 0.032 0.000 0.000 0.000 0.000 0.000 0.000
[7,] 147 0.517 0.190 0.150 0.054 0.034 0.048 0.007 0.000 0.000 0.000
[8,] 107 0.458 0.196 0.252 0.084 0.009 0.000 0.000 0.000 0.000 0.000
[9,] 13 0.846 0.154 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
If I run:
sampleSize <- 20
set.seed(42)
#replace 1:80 with 1: max number of SNPs in gene in dataset
sims_test <- sapply( 1:50, simulateData, sampleSize )
and look at the means from the sampling with x number of observations I only get a single result, when there should be 20.
for example:
> sims_test[[31]]
[1] 0.1
And sims_test is not ordered in the same way as sims:
>sims_test
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.1 0.1 0.1666667 0.200 0.14 0.2666667 0.2000000 0.2375 0.1888889
[2,] 0.1 0.1 0.1333333 0.200 0.14 0.2333333 0.1571429 0.2625 0.1222222
[3,] 0.1 0.1 0.3333333 0.225 0.14 0.1833333 0.2285714 0.2125 0.1555556
[4,] 0.1 0.1 0.2666667 0.250 0.10 0.1500000 0.2000000 0.2625 0.2777778
[5,] 0.1 0.1 0.3000000 0.200 0.16 0.2000000 0.2428571 0.1750 0.1000000
[6,] 0.1 0.1 0.3666667 0.250 0.16 0.1666667 0.2142857 0.2500 0.2000000
[7,] 0.1 0.1 0.4000000 0.300 0.12 0.2166667 0.1857143 0.2375 0.1666667
[8,] 0.1 0.1 0.4000000 0.250 0.10 0.2500000 0.2714286 0.2375 0.2888889
[9,] 0.1 0.1 0.1333333 0.300 0.14 0.1666667 0.1714286 0.2750 0.2888889
UPDATE 2
Using cond_probs <- head(cond_probs,n) I have determined that the code works until n = 517 then for all sizes greater than this it produces the same output as above. I am not sure if this is an issue with the file itself or a memory issue. I found that if I remove line 518 and duplicate the lines before several times to make a larger file, it works, suggesting that the line itself is causing the problem. Line 518 looks like this:
9.000 0.889 0.000 0.000 0.000 0.111 0.000 0.000 0.000 0.000 0.000
I found another 4 offending lines:
9.000 0.444 0.333 0.111 0.111 0.000 0.000 0.000 0.000 0.000 0.000
9.000 0.444 0.333 0.111 0.111 0.000 0.000 0.000 0.000 0.000 0.000
9.000 0.111 0.222 0.222 0.111 0.111 0.222 0.000 0.000 0.000 0.000
9.000 0.667 0.111 0.000 0.000 0.000 0.222 0.000 0.000 0.000 0.000
I don't notice anything unusual about them. They all have a 'total' of 9 sites. If I remove these lines and run the 'cond_probs' file containing only the lines BEFORE these then the code works. But there must be other problematic lines as the entire 'cond_probs' still doesn't work.
I tried putting these problematic lines back into a smaller 'cond_probs' file and this file then works, so I am very confused as it doesn't seem the lines are inherently problematic. On the other hand the fact they all have 9 total sites suggests some kind of causative pattern.
I would be happy to share the entire file privately if that helps as I don't know what to do next for troubleshooting.
One further issue that comes up is I'm not sure if the code is working as expected. I made a dummy cond_probs file where there are two data points with a 'total' of '1' observation:
total 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000
1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
So I would expect them to both be sampled for data points with '1' observation and therefore get roughly 50% of observations with a mean of '0.2' and 50% with a mean of '0.6'. However the mean is always 0.2:
sims_test[[1]]
[1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
Even if I sample 10000 times all observations are 0.2 and never 0.6. My understanding of the code is that it should be randomly selecting a new row from cond_probs with similar size for each observation, but in this case is seems not to be doing so. Do I misunderstand the code or is it still a problem with my input not being correct?
The entire cond_probs file can be found at the following address:
cond_probs
UPDATE 3
Changing sapply to lapply when running the simulations fixed this issue.
Another reason I think leaving cond_probs as it is and choosing a distribution sampleSize number of times might be the best solution: The probability of choosing a distribution should be related to its frequency in cond_probs. If we combine distributions the odds of picking a distribution with total 9 or 10 will no longer depend on the number of observations with these totals. Example: If there are 90 distributions with total=10 and 10 with total=9 there should be a 90% chance to choose a distribution with total=10. If we combine distributions wouldn't the odds become 50/50 for choosing a distribution with 'total'= 9 or 10 (which would not be ideal)?
I simply wrote a function ps that chooses an appropriate distribution from cond_probs:
N <- 10 # The sampled values are 0.1, 0.2, ... , N/10
M <- 8 # number of distributions in "cond_probs"
#-------------------------------------------------------------------
# Example data:
set.seed(1)
cond_probs <- matrix(0,M,N)
is.numeric(cond_probs)
for(i in 1:nrow(cond_probs)){ cond_probs[i,] <- dnorm((1:N)/M,i/M,0.01*N) }
is.numeric(cond_probs)
total <- sort( sample(1:80,nrow(cond_probs)) )
cond_probs <- cbind( total, cond_probs/rowSums(cond_probs) )
colnames(cond_probs) <- c( "total", paste("P",1:N,sep="") )
#---------------------------------------------------------------------
# A function that chooses an appropiate distribution from "cond_prob",
# depending on the number of observations "numObs":
ps <- function( numObs,
similarityLimit = 0.1 )
{
similar <- which( abs(cond_probs[,"total"] - numObs) / numObs < similarityLimit )
if ( length(similar) == 0 )
{
return(NA)
}
else
{
return( cond_probs[similar[sample(1:length(similar),1)],-1] )
}
}
#-----------------------------------------------------------------
# A function that simulates data using a distribution that is
# appropriate to the number of observations, if possible:
simulateData <- function( numObs, sampleSize )
{
if (any(is.na(ps(numObs))))
{
return (NA)
}
else
{
return( rowSums(
replicate(
numObs,
replicate( sampleSize, sample((1:N)/10, 1, prob = ps(numObs))))
) / numObs )
}
}
#-----------------------------------------------------------------
# Test:
sampleSize <- 30
set.seed(42)
sims <- lapply( 1:80, simulateData, sampleSize )
The distributions in cond_probs:
total P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
[1,] 16 6.654875e-01 3.046824e-01 2.923948e-02 5.881753e-04 2.480041e-06 2.191926e-09 4.060763e-13 1.576900e-17 1.283559e-22 2.189990e-28
[2,] 22 2.335299e-01 5.100762e-01 2.335299e-01 2.241119e-02 4.508188e-04 1.900877e-06 1.680045e-09 3.112453e-13 1.208647e-17 9.838095e-23
[3,] 30 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17
[4,] 45 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04 1.858391e-06 1.642495e-09 3.042886e-13
[5,] 49 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06 1.642492e-09
[6,] 68 1.642492e-09 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06
[7,] 70 3.042886e-13 1.642495e-09 1.858391e-06 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04
[8,] 77 1.182153e-17 3.044228e-13 1.643219e-09 1.859210e-06 4.409369e-04 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02
The means of the distributions:
> cond_probs[,-1] %*% (1:10)/10
[,1]
[1,] 0.1364936
[2,] 0.2046182
[3,] 0.3001330
[4,] 0.4000007
[5,] 0.5000000
[6,] 0.6000000
[7,] 0.6999993
[8,] 0.7998670
Means of the simulated data for 31 observations:
> sims[[31]]
[1] 0.2838710 0.3000000 0.2935484 0.3193548 0.3064516 0.2903226 0.3096774 0.2741935 0.3161290 0.3193548 0.3032258 0.2967742 0.2903226 0.3032258 0.2967742
[16] 0.3129032 0.2967742 0.2806452 0.3129032 0.3032258 0.2935484 0.2935484 0.2903226 0.3096774 0.3161290 0.2741935 0.3161290 0.3193548 0.2935484 0.3032258
The appopriate distribution is the third one:
> ps(31)
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17

reshape unique strings in rows into columns in R

I would like to reshape my data based in unique string in a "Bull" column (all data frame):
EBV Bulls
0.13 NE001362
0.17 NE001361
0.05 NE001378
-0.12 NE001359
-0.14 NE001379
0.13 NE001380
-0.46 NE001379
-0.46 NE001359
-0.68 NE001394
0.28 NE001391
0.84 NE001394
-0.43 NE001393
-0.18 NE001707
My expected output:
NE001362 NE001361 NE001378 NE001359 NE001379 NE001380 NE001394 NE001391 NE001393 NE001707
0.13 0.17 0.05 -0.12 -0.14 0.13 -0.68 0.28 -0.43 -0.18
-0.46 -0.46 0.84
I tried dat2 <- dcast(all, EBV~variable, value.var = "Bulls") but do not works.
You have two options. Indexing the multiple occurrences for each level of Bulls or using a list to hold the different levels of EBV.
Option 1: Indexing multiple occurrences
You can use data.table to generate an index that numbers multiple occurrences of EBV:
require(data.table)
setDT(all) ## convert to data.table
all[, index:=1:.N, by=Bulls] ## generate index
dcast.data.table(all, formula=index ~ Bulls, value.var='EBV')
Option 2: Using a list to store multiple values
You could use a list as a value with data.table (I'm not sure if plain data.frame supports it).
require(data.table)
setDT(all) ## convert to data.table
all[, list(list(EBV)), by=Bulls] ## multiple values stored as list
Just to make sure that base R gets some acknowledgement:
## Add an ID, like ilir did, but with base R functions
mydf$ID <- with(mydf, ave(rep(1, nrow(mydf)), Bulls, FUN = seq_along))
Here's reshape:
reshape(mydf, direction = "wide", idvar="ID", timevar="Bulls")
# ID EBV.NE001362 EBV.NE001361 EBV.NE001378 EBV.NE001359 EBV.NE001379
# 1 1 0.13 0.17 0.05 -0.12 -0.14
# 7 2 NA NA NA -0.46 -0.46
# EBV.NE001380 EBV.NE001394 EBV.NE001391 EBV.NE001393 EBV.NE001707
# 1 0.13 -0.68 0.28 -0.43 -0.18
# 7 NA 0.84 NA NA NA
And xtabs. Note: This is a table-like matrix, so if you want a data.frame, you'll have to use as.data.frame.matrix on the output.
xtabs(EBV ~ ID + Bulls, mydf)
# Bulls
# ID NE001359 NE001361 NE001362 NE001378 NE001379 NE001380 NE001391
# 1 -0.12 0.17 0.13 0.05 -0.14 0.13 0.28
# 2 -0.46 0.00 0.00 0.00 -0.46 0.00 0.00
# Bulls
# ID NE001393 NE001394 NE001707
# 1 -0.43 -0.68 -0.18
# 2 0.00 0.84 0.00

Resources