Related
I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))
One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))
Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.
Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))
I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?
I've got a similarity matrix between all cases and, in a separate data frame, classes of these cases. I want to compute average similarity between cases from the same class, here is the equation for an example n from class j:
We have to compute a sum of all squared proximities between n and all cases k that come from the same class as n. Link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers
I implemented that with 2 for loops, but it is really slow. Is there a faster way to do such thing in R?
Thanks.
//DATA (dput)
Data frame with classes:
structure(list(class = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 1L,
1L, 2L, 3L), .Label = c("1", "2", "3", "5", "6", "7"), class = "factor")), .Names = "class", row.names = c(NA,
-10L), class = "data.frame")
Proximity matrix (row m and column m correspond to class in row m of data frame above):
structure(c(1, 0.60996875, 0.51775, 0.70571875, 0.581375, 0.42578125,
0.6595, 0.7134375, 0.645375, 0.468875, 0.60996875, 1, 0.77021875,
0.55171875, 0.540375, 0.53084375, 0.4943125, 0.462625, 0.7910625,
0.56321875, 0.51775, 0.77021875, 1, 0.451375, 0.60353125, 0.62353125,
0.5203125, 0.43934375, 0.6909375, 0.57159375, 0.70571875, 0.55171875,
0.451375, 1, 0.69196875, 0.59390625, 0.660375, 0.76834375, 0.606875,
0.65834375, 0.581375, 0.540375, 0.60353125, 0.69196875, 1, 0.7194375,
0.684, 0.68090625, 0.50553125, 0.60234375, 0.42578125, 0.53084375,
0.62353125, 0.59390625, 0.7194375, 1, 0.53665625, 0.553125, 0.513,
0.801625, 0.6595, 0.4943125, 0.5203125, 0.660375, 0.684, 0.53665625,
1, 0.8456875, 0.52878125, 0.65303125, 0.7134375, 0.462625, 0.43934375,
0.76834375, 0.68090625, 0.553125, 0.8456875, 1, 0.503, 0.6215,
0.645375, 0.7910625, 0.6909375, 0.606875, 0.50553125, 0.513,
0.52878125, 0.503, 1, 0.60653125, 0.468875, 0.56321875, 0.57159375,
0.65834375, 0.60234375, 0.801625, 0.65303125, 0.6215, 0.60653125,
1), .Dim = c(10L, 10L))
Correct result:
c(2.44197227050781, 2.21901680175781, 2.07063155175781, 2.52448621289062,
1.88040830957031, 2.16019295703125, 2.58622273828125, 2.81453253222656,
2.1031745078125, 2.00542063378906)
Should be possible. Your notation does not make clear whether we will find members of like classes in the rows or columns, so this answer presumes in the columns but the obvious modifications would work as well if they were in rows.
colSums(mat^2)) # in R this is element-wise application of ^2 rather than matrix multiplication.
Since both operations are vectorized it would be expected to be much faster than for-loops.
With the modification and assuming the matrix is named 'mat' and the class-dataframe named 'cldf':
sapply( 1:nrow(mat) ,
function(r) sum(mat[r, cldf[['class']][r] == cldf[['class']] ]^2) )
[1] 2.441972 2.219017 2.070632 2.524486 1.880408 2.160193 2.586223 2.814533 2.103175 2.005421
I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.
The data has the format:
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
And I need to create:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
My first attempt at this is extremely slow and not really usable:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
I'll then be able to filter the data.frame based on the time difference column.
Sample data:
structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F",
"00:05:4F:0B:45:F7"), class = "factor"),
time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L),
class = "data.frame")
How about:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))
I've a data frame with time events on each row. In one row I've have the events types of sender (typeid=1) and on the other the events of the receiver (typeid=2). I want to calculate the delay between sender and receiver (time difference).
My data is organized in a data.frame, as the following snapshot shows:
dd[1:10,]
timeid valid typeid
1 18,00035 1,00000 1
2 18,00528 0,00493 2
3 18,02035 2,00000 1
4 18,02116 0,00081 2
5 18,04035 3,00000 1
6 18,04116 0,00081 2
7 18,06035 4,00000 1
8 18,06116 0,00081 2
9 18,08035 5,00000 1
10 18,08116 0,00081 2
calc_DelayVIDEO <- function (dDelay ){
pktProcess <- TRUE
nLost <- 0
myDelay <- data.frame(time=-1, delay=-1, jitter=-1, nLost=-1)
myDelay <- myDelay[-1, ]
tini <- 0
tend <- 0
for (itr in c(1:length(dDelay$timeid))) {
aRec <- dDelay[itr,]
if (aRec$typeid == 1){
tini <- as.numeric(aRec$timeid)
if (!pktProcess ) {
nLost <- (nLost + 1)
myprt(paste("Packet Lost at time ", aRec$timeid, " lost= ", nLost, sep=""))
}
pktProcess <- FALSE
}else if (aRec$typeid == 2){
tend <- as.numeric(aRec$timeid)
dd <- tend - tini
jit <- calc_Jitter(dant=myDelay[length(myDelay), 2], dcur=dd)
myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
pktProcess <- TRUE
#myprt(paste("time=", aRec$timeev, " delay=", dd, " Delay Var=", jit, " nLost=", nLost ))
}
}
colnames(myDelay) <- c("time", "delay", "jitter", "nLost")
return (myDelay)
}
To perform the calculations for delay I use calc_DelayVideo function, neverthless for data frames with a high number of records (~60000) it takes a lot of time.
How can I substitute the for loop with more optimized R functions?
Can I use lapply to do such computation? If so, can you provide me an example?
Thanks in advance,
The usual solution is to think hard enough about the problem to find something vectorized.
If that fails, I sometimes resort to re-writing the loop in C++; the Rcpp package can helps with the interface.
The *apply suite of functions are not optimized for loops. Further, I've worked on problems where for loops are faster than apply because apply used more memory and caused my machine to swap.
I would suggest fully initializing the myDelay object and avoid using rbind (which must re-allocate memory):
init <- rep(NA, length(dDelay$timeid))
myDelay <- data.frame(time=init, delay=init, jitter=init, nLost=init)
then replace:
myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
with
myDelay[i,] <- c(aRec$timeid, dd, jit, nLost)
As Dirk said: vectorization will help. An example of this would be to move the call to as.numeric out of the loop (since this function works with vectors).
dDelay$timeid <- as.numeric(dDelay$timeid)
Other things that may help are
Not bothering with the line aRec <- dDelay[itr,], since you can just access the row of dDelay, without creating a new variable.
Preallocating myDelay, since having it grow within the loop is likely to be a bottleneck. See Joshua's answer for more on this.
Another optimization : If I read your code right, you can easily calculate the vector nLost by using :
nLost <-cumsum(dDelay$typeid==1)
outside the loop. That one you can just add to the dataframe in the end. Saves you a lot of time already. If I use your dataframe, then :
> nLost <-cumsum(dd$typeid==1)
> nLost
[1] 1 1 2 2 3 3 4 4 5 5
Likewise the times at which the packages were lost can be calculated as:
> dd$timeid[which(dd$typeid==1)]
[1] 18,00035 18,02035 18,04035 18,06035 18,08035
in case you want to report them somewhere too.
For testing, I used :
dd <- structure(list(timeid = structure(1:10, .Label = c("18,00035",
"18,00528", "18,02035", "18,02116", "18,04035", "18,04116", "18,06035",
"18,06116", "18,08035", "18,08116"), class = "factor"), valid = structure(c(3L,
2L, 4L, 1L, 5L, 1L, 6L, 1L, 7L, 1L), .Label = c("0,00081", "0,00493",
"1,00000", "2,00000", "3,00000", "4,00000", "5,00000"), class = "factor"),
typeid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)), .Names = c("timeid",
"valid", "typeid"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))