Related
There are a lot of (inefficient) ideas I have about processing this data, but was suggested to me on another question to directly ask you all in another Q.
Basically, I have a lot of data that was taken by multiple users, with an ID number for the sample, and two weight variables (pre and post processing). Because the data was not processed sequentially by ID, and the data was collected at very different times for pre- and post-processing, it would be difficult (and probably increase data entry error likelihood) for the user to locate the ID and pre- column to input the post.
So instead the dataframe looks something like this:
#example data creation
id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)
print(df)
id pre post
[1,] 1 10 NA
[2,] 2 10 NA
[3,] 3 10 NA
[4,] 4 10 NA
[5,] 1 NA 10
[6,] 2 NA 10
[7,] 3 NA 10
[8,] 4 NA 10
[9,] 5 100 100
[10,] 6 100 100
[11,] 7 100 100
[12,] 8 100 100
I asked another question on how to merge the data, so I feel alright about that. I want to know what the best method is to sweep the dataframe for user error before I merge the columns.
Specifically, I am looking for if one ID has a pre but not a post (or vise versa), or if there is a double entry for any of the values. I would ideally just like to pump out all strange data (doubles, missing) IDs into a new dataframe so then I can go investigate and see what the problem is.
For example, if my data frame has this:
id pre post
1 1 10 NA
2 1 10 NA
3 2 10 NA
4 3 10 NA
6 2 NA 10
7 3 NA 10
8 4 NA 10
9 5 100 100
10 6 100 100
How do I get it to recognize that id #1 has been entered twice, and that ids 1 and 4 are missing a post and pre entry? All I need it to do is detect those anomalies and spit them out into a dataframe!
Thanks!
I am not sure the following is all the question asks for.
na <- rowSums(is.na(df[, -1])) > 0
df[duplicated(df[,1]) | na, ]
# id pre post
#[1,] 1 10 NA
#[2,] 2 10 NA
#[3,] 3 10 NA
#[4,] 4 10 NA
#[5,] 1 NA 10
#[6,] 2 NA 10
#[7,] 3 NA 10
#[8,] 4 NA 10
I want to plot a grouped bar chart. The dataset which I'm using for a plot is very large. Here is a small subset of it:
Label size
x 2 3 4 5
y 2 6 8
z 1 6 8
a 2 2
b 4 7 9 10 11
c 8 12
I want to plot a bar-chart in which, on x axis there would be labels, and on y axis there would be multiple bars of sizes given.
For example here, x has sizes 2 3 4 5. So there would be four bars with heights 2, 3 , 4 and 5. Then y has sizes 2 6 8. So there would be 3 bars with sizes 2, 6 and 8 and so on.
Can any one help me out?
First let's save your data as a data.frame with two columns for the label and size.
mydata <- read.table(textConnection('
label\tsize
x\t2 3 4 5
y\t2 6 8
z\t1 6 8
a\t2 2
b\t4 7 9 10 11
c\t8 12
'),header=TRUE,sep="\t")
Show it in R,
> mydata
label size
1 x 2 3 4 5
2 y 2 6 8
3 z 1 6 8
4 a 2 2
5 b 4 7 9 10 11
6 c 8 12
Then is the tricky part. We save each individual value of the size in a matrix and fill with NA for shorter rows. This is inspired by this post.
mylist <- strsplit(as.character(mydata[,"size"])," ")
n <- max(sapply(mylist, length))
mymatrix <- apply(do.call(rbind, lapply(mylist, `[`, seq_len(n))),1,as.numeric)
The matrix looks like,
> mymatrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 2 2 1 2 4 8
[2,] 3 6 6 2 7 12
[3,] 4 8 8 NA 9 NA
[4,] 5 NA NA NA 10 NA
[5,] NA NA NA NA 11 NA
Finally we are ready to make the plot!
barplot(
mymatrix, beside=TRUE,
col=1:n,
names.arg=mydata$label)
I have a bizarre problem where I've combined together several data frames that have different species abundance data. I used rbind.fill() to collate the data frames, but some of the columns names for like species are spelled slightly differently, hence, for several species I have 2-3 columns.
Does anyone know of a way I can merge the data from these columns together?
Simple example:
dat <- matrix(data=c(
Sp.a=c(1,2,3,4,5,NA,NA,NA,NA,NA),
Sp.b=c(3,4,5,6,7,5,4,6,3,4),
Sp.c=c(4,4,4,3,2,NA,NA,NA,NA,NA),
Spp.A=c(NA,NA,NA,NA,NA,2,3,4,2,3),
Spp.C=c(NA,NA,NA,NA,NA,3,4,2,5,4)
), 10,5)
colnames(dat)<- c("Sp.a", "Sp.b", "Sp.c", "Spp.A", "Spp.C")
dat
sp.a sp.b sp.c Spp.A Spp.C
[1,] 1 3 4 NA NA
[2,] 2 4 4 NA NA
[3,] 3 5 4 NA NA
[4,] 4 6 3 NA NA
[5,] 5 7 2 NA NA
[6,] NA 5 NA 2 3
[7,] NA 4 NA 3 4
[8,] NA 6 NA 4 2
[9,] NA 3 NA 2 5
[10,] NA 4 NA 3 4
How can I get sp.a and Spp.A into a single column? (same for sp.c and Spp.C).
Thanks for any help,
Paul
Using reshape2 and going from long --> wide --> long(again) format:
library(reshape2)
## long format
dat.m <- melt(dat)
## remove missing values
dat.m <- dat.m[!is.na(dat.m$value),]
## rename names
dat.m$Var2 <- tolower(sub("Spp","Sp", dat.m$Var2) )
## wide format
dcast(Var1~Var2,data=dat.m)
Var1 sp.a sp.b sp.c
1 1 1 3 4
2 2 2 4 4
3 3 3 5 4
4 4 4 6 3
5 5 5 7 2
6 6 2 5 3
7 7 3 4 4
8 8 4 6 2
9 9 2 3 5
10 10 3 4 4
Here's one way. This is pretty general, and would even work if you had one series divided over three or more columns.
dat <- data.frame(dat)
# get the last letter of each column and make it lowercase,
# we'll be grouping the columns by this
ns <- tolower(gsub('^.+\\.', '', names(dat)))
# group the columns by their last letter, and run each group through pmax
result <- lapply(split.default(dat, ns), function(x) do.call(function(...) pmax(..., na.rm=TRUE), x))
do.call(cbind, result)
# a b c
# [1,] 1 3 4
# [2,] 2 4 4
# [3,] 3 5 4
# [4,] 4 6 3
# [5,] 5 7 2
# [6,] 2 5 3
# [7,] 3 4 4
# [8,] 4 6 2
# [9,] 2 3 5
# [10,] 3 4 4
ColsToMerge <- c("sp.a", "Spp.A")
dat[["A.merged"]] <-
apply(dat[, ColsToMerge], 1, function(rr) ifelse(is.na(rr[[1]]), rr[[2]], rr[[1]]))
I want to create a function that produces a matrix containing several lags of a variable. A simple example that works is
a <- ts(1:10)
cbind(a, lag(a, -1))
To do this for multiple lags, I have
lagger <- function(var, lags) {
### Create list of lags
lagged <- lapply(1:lags, function(x){
lag(var, -x)
})
### Join lags together
do.call(cbind, list(var, lagged))
}
Using the above example gives unexpected results;
lagger(a, 1)
gives a length 20 list with the original time series broken out into separate list slots and the final 10 each being a replication of the lagged series.
Any suggestions to getting this working? Thanks!
This gives a lag of 0 and of 1.
library(zoo)
a <- ts(11:13)
lags <- -(0:1)
a.lag <- as.ts(lag(as.zoo(a), lags))
Now a.lag is this:
> a.lag
Time Series:
Start = 1
End = 4
Frequency = 1
lag0 lag-1
1 11 NA
2 12 11
3 13 12
4 NA 13
If you don't want the NA entries then use: as.ts(na.omit(lag(as.zoo(a), lags))) .
Based on #Joshua Ulrich answer.
I thinkd embed is the correct answer but you get the vectors in the other way around. I mean using embed you'll get the lagged series not in the proper order, see the following
lagged <- embed(a,4)
colnames(lagged) <- paste('t', 3:0, sep='-')
lagged
t-3 t-2 t-1 t-0
[1,] 4 3 2 1
[2,] 5 4 3 2
[3,] 6 5 4 3
[4,] 7 6 5 4
[5,] 8 7 6 5
[6,] 9 8 7 6
[7,] 10 9 8 7
this gives the correct answer to you but not in the correct order, since the lags are in descending order.
But it you reorder just like this:
lagged_OK <- lagged[,ncol(lagged):1]
colnames(lagged_OK) <- paste('t', 0:3, sep='-')
lagged_OK
lag.0 lag.1 lag.2 lag.3
[1,] 1 2 3 4
[2,] 2 3 4 5
[3,] 3 4 5 6
[4,] 4 5 6 7
[5,] 5 6 7 8
[6,] 6 7 8 9
[7,] 7 8 9 10
Then, you get the right lagged matrix.
I add colnames only for explanation purpose, you can just do:
embed(a,4)[ ,4:1]
If you really want a lagger function, try this
lagger <- function(x, lag=1){
lag <- lag+1
Lagged <- embed(x,lag)[ ,lag:1]
colnames(Lagged) <- paste('lag', 0:(lag-1), sep='.')
return(Lagged)
}
lagger(a, 4)
lag.0 lag.1 lag.2 lag.3 lag.4
[1,] 1 2 3 4 5
[2,] 2 3 4 5 6
[3,] 3 4 5 6 7
[4,] 4 5 6 7 8
[5,] 5 6 7 8 9
[6,] 6 7 8 9 10
lagger(a, 1)
lag.0 lag.1
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
[6,] 6 7
[7,] 7 8
[8,] 8 9
[9,] 9 10
I'm not sure what's wrong with your function, but you can probably use embed instead.
> embed(a,4)
[,1] [,2] [,3] [,4]
[1,] 4 3 2 1
[2,] 5 4 3 2
[3,] 6 5 4 3
[4,] 7 6 5 4
[5,] 8 7 6 5
[6,] 9 8 7 6
[7,] 10 9 8 7
I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?