Most efficient data processing strategy, potential double and missing values

Most efficient data processing strategy, potential double and missing values - r

There are a lot of (inefficient) ideas I have about processing this data, but was suggested to me on another question to directly ask you all in another Q.
Basically, I have a lot of data that was taken by multiple users, with an ID number for the sample, and two weight variables (pre and post processing). Because the data was not processed sequentially by ID, and the data was collected at very different times for pre- and post-processing, it would be difficult (and probably increase data entry error likelihood) for the user to locate the ID and pre- column to input the post.
So instead the dataframe looks something like this:
#example data creation
id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)
print(df)
id pre post
[1,] 1 10 NA
[2,] 2 10 NA
[3,] 3 10 NA
[4,] 4 10 NA
[5,] 1 NA 10
[6,] 2 NA 10
[7,] 3 NA 10
[8,] 4 NA 10
[9,] 5 100 100
[10,] 6 100 100
[11,] 7 100 100
[12,] 8 100 100
I asked another question on how to merge the data, so I feel alright about that. I want to know what the best method is to sweep the dataframe for user error before I merge the columns.
Specifically, I am looking for if one ID has a pre but not a post (or vise versa), or if there is a double entry for any of the values. I would ideally just like to pump out all strange data (doubles, missing) IDs into a new dataframe so then I can go investigate and see what the problem is.
For example, if my data frame has this:
id pre post
1 1 10 NA
2 1 10 NA
3 2 10 NA
4 3 10 NA
6 2 NA 10
7 3 NA 10
8 4 NA 10
9 5 100 100
10 6 100 100
How do I get it to recognize that id #1 has been entered twice, and that ids 1 and 4 are missing a post and pre entry? All I need it to do is detect those anomalies and spit them out into a dataframe!
Thanks!

I am not sure the following is all the question asks for.
na <- rowSums(is.na(df[, -1])) > 0
df[duplicated(df[,1]) | na, ]
# id pre post
#[1,] 1 10 NA
#[2,] 2 10 NA
#[3,] 3 10 NA
#[4,] 4 10 NA
#[5,] 1 NA 10
#[6,] 2 NA 10
#[7,] 3 NA 10
#[8,] 4 NA 10

Related

Why group_by() not working with apply() function

I have a matrix and want to calculate products of sequences for each column of a matrix by groups. First, each group corresponds to different numbers. Second, each column of the matrix should be computed separately too.
For example, for user 1 and y column, the number should be 1*1; for user 2 and y column, the number should be 2*3*1
I used apply with group_by but the output is incorrect.
#data
user<-c(1,1,2,2,2)
y<-c(1,1,2,3,1)
z<-c(2,2,3,3,3)
dt<-data.frame(user,y,z)
user y z
1 1 1 2
2 1 1 2
3 2 2 3
4 2 3 3
5 2 1 3
The output below is not by user
dt%>%group_by(user)%>% (function(x){apply(x[,2:3],2,prod)})
y z
6 108
Desired output ( you can remove the NA too. It does not matter whether adding a new column or not)
[,1] [,2]
[1,] NA NA
[2,] 1 4
[3,] NA NA
[4,] NA NA
[5,] 6 27

R: append values from one matrix to another - at the right location [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have two matrices, two columns each. The first column is the subject number, the second column the appropriate measurement of the subject.
I want to add the values of the first matrix ("value 1") to the second matrix but in the right position. I.e. in the position belonging to the correct subject.
There is two problems:
The matrices are of different numbers of rows. And each matrix may contain subjects that do not appear in the other.
The subject numbers in the first matrix are not ordered.
I tried to do it with a for-loop:
a1<-c(1,18,2,6,25,8,9,7,4)
b1<-seq(1:15)
a2<-a1*100
b2<-seq(from=10,to=150,by=10)
a<-cbind(a1,a2)
b<-cbind(b1,b2,NA)
colnames(a)<-(c("subject","value 1"))
colnames(b)<-(c("subject","value 2","value 1"))
length_b<-length(b[,1])
for (i in length_b)
{
present<-(b[i,1] %in% a[,1])
if (present==TRUE)
{
location[i]<-which(a[,1]==b[i,1])
b[i,3]<-a[location,2]
}
}
But then
b
still looks like this:
subject value 2 value 1
[1,] 1 10 NA
[2,] 2 20 NA
[3,] 3 30 NA
[4,] 4 40 NA
[5,] 5 50 NA
[6,] 6 60 NA
[7,] 7 70 NA
[8,] 8 80 NA
[9,] 9 90 NA
[10,] 10 100 NA
[11,] 11 110 NA
[12,] 12 120 NA
[13,] 13 130 NA
[14,] 14 140 NA
[15,] 15 150 NA

Use merge:
merge(x=a,y = b, by.x="subject", by.y = "subject")
subject value 1.x value 2 value 1.y
1 1 100 10 NA
2 2 200 20 NA
3 4 400 40 NA
4 6 600 60 NA
5 7 700 70 NA
6 8 800 80 NA
7 9 900 90 NA

How can I create a 2D heatmap grid in R?

I have a data set like this
head(data)
V1 V2
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] 5 2
[5,] 5 2
[6,] 5 2
where
unique(data$V1)
[1] NA 5 4 3 2 1 0 6 7 9 8
unique(data$V2)
[1] NA 2 6 1 5 3 7 4 0 8 9
What I would like to do is a plot similar to this
plot(df$V1,df$V2)
but with a colour indicator indicating how many match there are with a grid instead of points.
Can anyone help me?

It looks like this may be what you want - first you tabulate using table(), then plot a heatmap of the table using heatmap():
set.seed(1)
data <- data.frame(V1=sample(1:10,100,replace=TRUE),V2=sample(1:10,100,replace=TRUE))
foo <- table(data)
heatmap(foo,Rowv=NA,Colv=NA)

Data frame with NA in R

I have a data frame which has some rows with NA entries, I want to find the index of the row and the column at which the entry is NA. I am looping in a nested fashion to do that, and that is taking too long. Is there a quicker way to do it? Thanks.

set.seed(123)
dfrm <- data.frame(a=sample(c(1:5, NA), 25,T), b=sample(c(letters,NA), 25,rep=T)
which(is.na(dfrm), arr.ind=TRUE)
row col
[1,] 4 1
[2,] 5 1
[3,] 8 1
[4,] 11 1
[5,] 16 1
[6,] 20 1
[7,] 21 1
[8,] 24 1
[9,] 6 2

"Loop through" data.table to calculate conditional averages

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.

The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.

A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.

Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?