How to compare two consecutive rows with a reference value in R? - r

I have a data frame of vehicle trajectories. Here's a snapshot:
> head(df)
vehicle frame globalx class velocity lane
1 2 43 6451214 2 37.76 2
2 2 44 6451217 2 37.90 2
3 2 45 6451220 2 38.05 2
4 2 46 6451223 2 38.18 2
5 2 47 6451225 2 38.32 2
6 2 48 6451228 2 38.44 2
where, vehicle= vehicle id (repeats because same vehicle is observed in several time frames), frame= frame id of time frames in which it was observed, globalx = x coordinate of the front center of the vehicle, class=type of vehicle (1=motorcycle, 2=car, 3=truck), velocity=speed of vehicles in feet per second, lane= lane number (there are 6 lanes). I think following illustration will better explain the problem:
The 'frame' represents one tenth of a second i.e. one frame is 0.1 seconds long. At frame 't' the vehicle has globalx coordinate x(t) and at frame 't-1' (0.1 seconds before) it was x(t-1). The reference location is 'U'(globalx=6451179.1116) and I simply want a new column in df called 'u' which has 'yes' in the row where globalx of the vehicle was greater than reference coordinate at 'U' AND the previous consecutive globalx coordinate of this vehicle was less than reference coordinate at 'U'. This means that if df has 100 vehicles then there will be 100 'yes' in 'u' column because every vehicle will meet the above criteria once. I have tried to do this by running the function with ifelse and also tried to do the same using a for loop but it doesn't work for me. The output should have one new column:
vehicle frame globalx class velocity lane u
I tried using ifelse inside for loop and a function but it doesn't work for me.

I assume the data frame is sorted primarily for vehicle and secondarily for globalx. If it's not you can do it by:
idx <- with(df,order(vehicle,globalx))
df <- df[idx,]
Now, you can perform it with the following vectorized operations:
# example reference line
U <- 6451220
# adding the extra column
samecar <- duplicated(df[,"vehicle"])
passU <- c(FALSE,diff(sign(df[,"globalx"]-U+1e-10))>0)
df[,"u"] <- ifelse(samecar & passU,"yes","no")

Here is my solution:
First create dummy data, based on your provided data (I have saved it to data.txt on my desktop), duplicate the data so that there are two cars with the same identical data, but different vehicle id's:
library(plyr)
df <- read.table("~/Desktop/data.txt",header=T)
df.B <- df; df.B$vehicle = 3 #For demonstration
df <- rbind(df,df.B); rm(df.B)
Then we can build a function to process:
mvt <- function(xref=NULL,...,data=df){
if(!is.numeric(xref)) #Input must be numeric
stop("xref must be numeric",call.=F)
xref = xref[1]
##Split on vehicle and process.
ddply(data,"vehicle",function(d){
L = nrow(d) #Number of Rows
d$u = FALSE #Default to Not crossing
#One or more rows can be checked.
if(L == 1)
d$u = (d$globalx > xref)
else if(L > 1){
ix <- which(d$globalx[2:L] > xref & d$globalx[1:(L-1)] <= xref)
if(length(ix) > 0)
d$u[ix + 1] = TRUE
}
#done
return(d)
})
}
Which can be used in the following manner:
mvt(6451216)
mvt(6451217)

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

R - how to index rank and accordingly display a data frame?

I have a data frame that lists down some names of individuals and their monetary transactions carried out in USD. The table lists down data according to several districts and the valid transactions made by either cash or credit cards, like so:
X Dist transact.cash transact.card
a 1 USD USD
b 1 USD USD
Where X is an individual and his/her transactions for a period of time keeping that period fixed and Dist is the district where he/she resides. There are over 4000 observations in total for an approx. 80-100 rows per Dist. So far, the sorting, slicing and everything else have been simple operations with dat.cash and dat.card being subsetted tables according to mode of transaction; but I'm having problems when extracting information in reference to ranking the dataset. For this, I have written a function where I specify a rank and the function should show those rows starting from that rank:
rankdat <- function(transact, numb) {
# Truncated
valid.nums = c('highest', 'lowest', 1:nrow(dat.cash)) # for cash subset
if (transact == 'cash' && numb == 'highest') { # This is easy
sort <- dat.cash[order(dat.cash[, 3], decreasing = T), ]# For sorting only cash data set
} else if (transact == 'cash' and numb == 1:nrow(dat.cash)) {
sort <- dat.cash[order(dat.cash[, 3], decreasing = T) == numb, ] } # Not getting results here
}
The last line is returning NULL instead of a ranked transaction and all its rows. Replacing == with %in% still gives NULL and using rank() doesn't change anything. For highest and lowest numbers, its not a great deal since it only involves simple sorting. If I specify rankdat('cash', 10), the function should return values starting from the 10th highest transaction and decreasing irrespective of Dist, similar to:
X Dist transact.cash
b 1 10th highest
h 2 11th highest
p 1 12th highest
and so on
This function is able to do that:
rankdat <- function(df,rank.by,num=10,method="top",decreasing=T){
# ------------------------------------------------------
# RANKDAT
# ------------------------------------------------------
# ARGUMENT
# ========
# df Input dataFrame [d.f]
# num Selected row [num]
# rank.by Name of column(s) used to rank dataFrame
# method Method used to extract rows
# top - to select top rank (e.g. 10 first rows)
# specific - to select specific row
# ------------------------------------------------------
eval(parse(text=paste("sort=df[with(df,order(",rank.by,"), decreasing=",decreasing,"),]",sep=""))) # order dataFrame by
if(method %in% "top"){
return(sort[1:num,])
}else if(method %in% "specific"){
return(sort[num,])
}else{
stop("Please select method used to extract data !!!")
}
}
Suppose that you have the following data.frame:
df=data.frame(X=c(rep('A',2),rep('B',3),rep('A',3),rep('B',2)),
Dist=c(rep(1,5),rep(0,5)),
transact.cash=c(rep('USD',5),rep('€',5)),
transact.card=c(rep('USD',5),rep('€',5)))
We obtain:
X Dist transact.cash transact.card
1 A 1 USD USD
2 A 1 USD USD
3 B 1 USD USD
4 B 1 USD USD
5 B 1 USD USD
6 A 0 € €
7 A 0 € €
8 A 0 € €
9 B 0 € €
10 B 0 € €
If you would like to sort a dataframe with multiple columns transact.cash or transact.cash you can used stackoverflow : How to sort a dataframe by column(s). In your example, you only specified dat.cash, thus :
sort = df[order(df$transact.cash, decreasing=T),] # Order your dataFrame with transact.cash column
If you want to extract rows which respect a specific statement, you need to use which() and == for numeric, double, logical match or %in% for string match. For example:
XA = df[which(df$X %in% "A"),] # Select row by user
XDist = df[which(df$Dist == 1),] # Select row by District
Finally, if you would like to select the first five row after ordering:
sort[1:5,] # Select first five rows
sort[1:numb,] # Select first numb rows
With that you can perform a simple function to easily extract data from your dataframe.
Hope it will help you

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources