R - Compare values in two columns in different rows - r

I have a dataframe df as seen below with two features, a departing city and an arrival city. Every two rows information is stored about a going and a return flight.
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
7 K L
8 K L
There is some inconsistency in the data where the same flight is repeated as it can be seen in the last two rows.
How is it possible to compare for every two rows the departure city of the first row and the arrival city of the second row, and keep the ones that are equal.
The dataset is very big and of course a for-loop is not considered an option.
Thank you in advance.

Here is a method that compares the pairs of rows using head and tail to line them up.
# find Departures that match the Arrival in the next row
sames <- which(head(dat$Departure, -1) == tail(dat$Arrival, -1))
# keep pairs of rows that match, maintaining order with `sort`
dat[sort(unique(c(sames, (sames + 1)))),]
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
Note that the two variables have to be character vectors, not factor variables. you can coerce them to character using as.character if necessary.
data
dat <-
structure(list(Departure = c("A", "B", "F", "G", "U", "V", "K",
"K"), Arrival = c("B", "A", "G", "F", "V", "U", "L", "L")), .Names = c("Departure",
"Arrival"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8"))

So you just want unique flight paths? there are a number of ways to do this, I'd think the fastest would be with data.table, something like:
library(data.table)
df <- as.data.table(df)
uniqueDf <- unique(df)
you can also use the duplicated function, something like
df <- df[!duplicated(df), ]
should do nicely.

You could also do it this way:
right = rep(df[c(T,F),"Arrival"]==df[c(F,T),"Departure"],each=2)
df[right,]
This returns:
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U

This answer doesn't look for unique records, it specifically checks if a row is a duplicate of the row before.
Adding a new column with a 1 if the row has repeated:
for(i in 2:length(df$Departure)){df$test[i]=ifelse(df$Departure[i] == df$Departure[i-1] & df$Arrival[i] == df$Arrival[i-1], 1,0)}
Loops can be slow though:
library(data.table)
df$test2 = ifelse(df$Departure == shift(df$Departure) & df$Arrival == shift(df$Arrival), 1,0)

Try the following solution, if it works for you:
df[duplicated(paste0(df$Departure,df$Arrival))==F,]

Related

add a column based on values in three other columns

I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)

Combine two dataframes conditionally with dates

I am using two differrent data frames. I would like to complete one using the information that is contained in the other. The first data frame contains a list of observations of individual young animals whose birthdate and natal territory are known. The second data frame contains observations of adult animals that were present in given territories within given time intervals.
Here is a reproducible example:
#First dataframe:
ID_young <- c(rep(c("a", "b", "c"), each=3), "d") # All individuals observed three times except "d", observed once
Territory_young <- c(rep(c("x", "y", "z"), each=3), "x") # All individuals are from different territories, except "a" and "d" who are from the same territory, namely "x".
Birthdate <- c(rep(c("2014-01-29", "2014-12-17", "2013-11-19"), each=3), "2012-12-04")
Birthdate <- as.Date(Birthdate)
# Second dataframe:
ID_adult <- c("e", "f", "g", "h", "i", "j", "e","f")
Territory_adult <- c("x", "x", "y", "z", "z", "z", "z", "w")
First_date <- as.Date(c("2014-01-01", "2014-01-15", "2013-12-14", "2013-05-17", "2013-05-09", "2012-09-01", "2013-06-18", "2011-04-17"))
Last_date <- as.Date(c("2014-02-28", "2014-04-17", "2014-11-02", "2014-01-13", "2015-01-03", "2013-04-17", "2013-12-25", "2014-11-11"))
# Data frames complete:
df1 <- data.frame(ID_young, Territory_young, Birthdate)
df2 <- data.frame(ID_adult, Territory_adult, First_date, Last_date)
My goal is to create a new column in df1 that consists of the number of adult animals present in the young animal's territory at the time of its birth.
In other words,
For each line of df1:
find the corresponding territory in df2
count the number of lines in df2 where the interval between df2$First_date and df2$Last_date includes df1$Birthdate
fill in that number in the new column of df1
For example, for the first three lines of df1 (corresponding to the young animal "a"), that count would be 2, because adults "e" and "f" were present in territory "x" when young "a" was born (2014-01-29).
Could someone help me formulate the right combination of conditional statements that would allow me to do that? I am trying for and if statements at the moment but have nothing worth showing.
Thanks!
nb.adults = apply(df1, 1, function(row, df2) {
terr = as.character(row[2])
bd = row[3]
nb.adults = length(which(df2$First_date < bd & bd < df2$Last_date &
df2$Territory_adult==terr))
return(nb.adults)
}, df2)
df1 = cbind(df1, nb.adults)
The recent versions of data.table support non-equi joins which can be used for this purpose:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
DT1 <- data.table(df1)
DT2 <- data.table(df2)
# right non-equi join to find any adults present in terrority during birth
DT2[unique(DT1),
on = c("Territory_adult==Territory_young",
"First_date<=Birthdate",
"Last_date>=Birthdate")][
# count adults for each young
, .(Count_adult = sum(!is.na(ID_adult))), by = ID_young][
# join counts into each matching row of first data.table
DT1, on = "ID_young"]
ID_young Count_adult Territory_young Birthdate
1: a 2 x 2014-01-29
2: a 2 x 2014-01-29
3: a 2 x 2014-01-29
4: b 0 y 2014-12-17
5: b 0 y 2014-12-17
6: b 0 y 2014-12-17
7: c 3 z 2013-11-19
8: c 3 z 2013-11-19
9: c 3 z 2013-11-19
10: d 0 x 2012-12-04
Note that df1 and DT1, resp., contain duplicate rows which require to use unique() in the non-equi join with the adults and to use another join finally to make sure that the adults count appears on each row.

How to combine columns in a data frame so that they overlap in R?

Basically, I have data from a between subjects study design that looks like this:
> head(have, 3)
A B C
1 b
2 a
3 c
Here, A, B, C are various conditions of the study, so each subject (indicated by each row), only has a value for one of the conditions. I would like to combine these so that it looks like this:
> head(want, 3)
all
1 b
2 a
3 c
How can I combine the columns so that they "overlap" like this?
So far, I have tried using some of dplyr's join functions, but they haven't worked out for me. I appreciate any guidance in combining my columns in this way.
We can use pmax
want <- data.frame(all= do.call(pmax, have))
Or using dplyr
transmute(have, all= pmax(A, B, C))
# all
#1 b
#2 a
#3 c
data
have <- structure(list(A = c("", "a", ""), B = c("b", "", ""),
C = c("",
"", "c")), .Names = c("A", "B", "C"), class = "data.frame",
row.names = c("1", "2", "3"))

Computing correlation of vectors by factor label

I have have two data frames. The first one, df1, is a matrix of vectors with labeled columns, like the following:
df1 <- data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10), D=rnorm(10), E=rnorm(10))
> df1
A B C D E
-0.3200306 0.4370963 -0.9146660 1.03219577 0.5215359
-0.3193144 0.8900656 -1.1720264 -0.42591761 0.1936993
0.4897262 -1.3970806 0.6054637 0.12487936 1.0149530
0.3772420 0.8726322 0.3250020 -0.36952560 -0.5447512
-0.6921561 -0.6734468 0.3500812 -0.53373720 -0.6129472
0.2540649 -1.1911106 -0.3266428 0.14013437 1.0830148
0.6606825 -0.8942715 1.1099637 -1.52416540 -0.2383048
1.4767074 -2.1492360 0.2441242 -0.36136344 0.5589114
-0.5338117 -0.2357821 0.7694879 -0.21652356 0.3185631
3.4215916 -0.3157938 0.8895597 0.09946069 -1.0961730
The second data frame, df2, contains items that match the colnames of df1. Example:
group <- c("1", "1", "2", "2", "3", "3")
S1 <- c("A", "D", "E", "C", "B", "D")
S2 <- c("D", "B", "A", "C", "B", "A")
S3 <- c("B", "C", "A", "E", "E", "A")
df2 <- data.frame(group,S1, S2, S3)
> df2
group S1 S2 S3
1 A D B
1 D B C
2 E A A
2 C C E
3 B B E
3 D A A
I would like to compute the correlations between the column vectors in df1 that correspond to the labeled items in df2. Specifically, the vectors that match cor(df2$S1, df2$S2) and cor(df2$S1, df2$S3).
The output should be something like this:
group S1 S2 S3 cor.S1.S2 cor.S1.S3
1 A D B 0.003825055 -0.2817946
1 D B C -0.2817946 -0.4928023
2 E A A -0.3856809 -0.3856809
2 C C E 1 -0.3862433
3 B B E 1 -0.3888541
3 D A A 0.003825055 0.003825055
I've been trying to resolve this with cbind[] but keep running into problems such as the 'x' must be numeric error with cor. Thanks in advance for any help!
You can do this with mapply().
my.cor <- function(x,y) {
cor(df1[,x],df1[,y])
}
df2$cor.S1.S2 <- mapply(my.cor,df2$S1,df2$S2)
df2$cor.S2.S3 <- mapply(my.cor,df2$S2,df2$S3)
Another approach would to the get the correlation between the matrix/data.frame after subsetting the columns of 'df1' with the columns of 'df2', get the diag and assign the output as new column in 'df2'. Here, I am using lapply as we have to do both 'S1 vs S2' and 'S1 vs S3'.
df2[c('cor.S1.S2', 'cor.S1.S3')] <- lapply(c('S2', 'S3'),
function(x) diag(cor(df1[, df2[,x]], df1[,df2$S1])))

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources