I have three sets of identifiers: "x", "y", and "z". I also have two, 2-column data frames that each map one set of identifiers to another set of identifiers.
x2y = data.frame( x = c("A","A","B","B","C","D","E","F"),
y = c(1,2,1,2,3,4,4,5) )
y2z = data.frame( y = c(1,1,2,3,4,4,5,5,5),
z = c(1,2,3,3,6,7,6,7,8) )
This can be visualized in the figure below. Note that each arrow corresponds to one row in a data frame.
Question:
How do I use these two mappings (two data frames) to make a mapping
from x to z (displayed on the right of the figure above). I
think of it as a "transitive mapping": x to y and y to z gives x to z.
The data frame that I would like is...
x2z = data.frame( x = c("A","A","A","B","B","B","C","D","D","E","E","F","F","F"),
z = c(1,2,3,1,2,3,3,6,7,6,7,6,7,8) )
Notes: My data frames are usually ~50,000 rows, so efficient code is very important. When I've solved this problem with loops, it took several minutes to run.
My only requirement is that the code be in R.
You want to merge:
merge(x2y, y2z)[c('x','z')]
## x z
## 1 A 1
## 2 A 2
## 3 B 1
## 4 B 2
## 5 A 3
## 6 B 3
## 7 C 3
## 8 D 6
## 9 D 7
## 10 E 6
## 11 E 7
## 12 F 6
## 13 F 7
## 14 F 8
It helps here that the names agree where necessary.
Related
I have a data frame defined as follows:
t1 <- data.frame(x=c("A","B","C"),y=c(5,7,9))
> t1
x y
1 A 5
2 B 7
3 C 9
and a vector of picks:
picks <- c("B","C","B")
How do I get these rows, with replacement, in this order selected from the data frame?
I want:
x y
B 7
C 9
B 7
I tried
> t1[t1$x %in% picks,]
x y
2 B 7
3 C 9
and several other combinations of match, grep, which, etc and cannot get out what I want. It seems like it should be easy but I'm not finding the path.
Or you can perform an right join using data.table
library(data.table)
picks <- data.table(x = picks)
setDT(t1)[picks, on = "x"]
# x y
#1: B 7
#2: C 9
#3: B 7
By default the merged data.table is sorted according to x in picks.
We can also use
setNames(t1$y, t1$x)[picks]
#B C B
#7 9 7
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)
Suppose the data frame is like this:
df <- data.frame(x = c(1,7,8,15,24,100,9,19,128))
How do I create a new variable that satisfies the following condition:
y = 1 if 1<=x<=7
y = 2 if 8<=x<=14
y = 3 if 15<=x<=21
...
y = k if 1+7*(k-1)<= x<= 7+7*(k-1)
so that I can have the new data frame like this
df <- data.frame(y = c(1,1,2,3,4,15, 2,3, 19))
I am wondering if a for loop can be applied in this case.
Via simple algebra, you can do:
df$y <- floor((df$x+6)/7)
df
# x y
# 1 1 1
# 2 7 1
# 3 8 2
# 4 15 3
# 5 24 4
# 6 100 15
# 7 9 2
# 8 19 3
# 9 128 19
In R you will often find it easier (less typing and less thinking) to use vectorized operators than for loops for simple computations like this. In this case we performed calls to +, /, and floor over a whole vector instead of looping and using them on each element.
I've noticed that aggregate() appears to return its result ordered by the grouping column(s). Is this a guarantee? Can this be relied upon in surrounding logic?
A couple of examples:
set.seed(1); df <- data.frame(group=sample(letters[1:3],10,replace=T),value=1:10);
aggregate(value~group,df,sum);
## group value
## 1 a 16
## 2 b 22
## 3 c 17
And with two groups (notice the second group is ordered first, then the first group to break ties):
set.seed(1); df <- data.frame(group1=sample(letters[1:3],10,replace=T),group2=sample(letters[4:6],10,replace=T),value=1:10);
aggregate(value~group1+group2,df,sum);
## group1 group2 value
## 1 a d 1
## 2 b d 2
## 3 b e 9
## 4 c e 10
## 5 a f 15
## 6 b f 11
## 7 c f 7
Note: I'm asking because I just came up with an answer for Aggregating while merging two dataframes in R which, at least in its current form at the time of writing, depends on aggregate() returning its result ordered by the grouping column.
Yes, as long as you understand the natural ordering of factors to be by their integer keys. You can see this in the code:
y <- as.data.frame(by, stringsAsFactors = FALSE)
... # y becomes the "integerized" dataframe of index vectors
grp <- rank(do.call(paste, c(lapply(rev(y), ident), list(sep = "."))),
ties.method = "min")
y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
...
So, I have several dataframes like this
1 2 a
2 3 b
3 4 c
4 5 d
3 5 e
......
1 2 j
2 3 i
3 4 t
3 5 r
.......
2 3 t
2 4 g
6 7 i
8 9 t
......
What I want is, I want to merge all of these files into one single file showing the values of third column for each pair of values in columns 1 and columns 2 and 0 if that pair is not present.
So, the output for this will be, since, there are three files (there are more)
1 2 aj0
2 3 bit
3 4 ct0
4 5 d00
3 5 er0
6 7 00i
8 9 00t
......
What I did was combine all my text .txt files in a single list.
Then,
L <- lapply(seq_along(L), function(i) {
L[[i]][, paste0('DF', i)] <- 1
L[[i]]
})
Which will indicate the presence of a value when we will be merging them.
I don't know how to proceed further. Any inputs will be great. Thanks!
Here is one way to do it with Reduce
# function to generate dummy data
gen_data<- function(){
data.frame(
x = 1:3,
y = 2:4,
z = sample(LETTERS, 3, replace = TRUE)
)
}
# generate list of data frames to merge
L <- lapply(1:3, function(x) gen_data())
# function to merge by x and y and concatenate z
f <- function(x, y){
d <- merge(x, y, by = c('x', 'y'), all = TRUE)
# set merged column to zero if no match is found
d[['z.x']] = ifelse(is.na(d[['z.x']]), 0, d[['z.x']])
d[['z.y']] = ifelse(is.na(d[['z.y']]), 0, d[['z.y']])
d$z <- paste0(d[['z.x']], d[['z.y']])
d['z.x'] <- d['z.y'] <- NULL
return(d)
}
# merge data frames
Reduce(f, L)