How to select rows from a data frame with replacement - r

I have a data frame defined as follows:
t1 <- data.frame(x=c("A","B","C"),y=c(5,7,9))
> t1
x y
1 A 5
2 B 7
3 C 9
and a vector of picks:
picks <- c("B","C","B")
How do I get these rows, with replacement, in this order selected from the data frame?
I want:
x y
B 7
C 9
B 7
I tried
> t1[t1$x %in% picks,]
x y
2 B 7
3 C 9
and several other combinations of match, grep, which, etc and cannot get out what I want. It seems like it should be easy but I'm not finding the path.

Or you can perform an right join using data.table
library(data.table)
picks <- data.table(x = picks)
setDT(t1)[picks, on = "x"]
# x y
#1: B 7
#2: C 9
#3: B 7
By default the merged data.table is sorted according to x in picks.

We can also use
setNames(t1$y, t1$x)[picks]
#B C B
#7 9 7

Related

Manipulating data.table column recursively on other column condition

I need to calculate a formula in a data frame. Each set of values across few columns have to be, lets say simplicity sake, aggregated. However, I do not want calculation across rows. I want to calculate each set with another set based on condition else where.
This is what I mean:
I have a data.table.
data = data.table(A = c("a","c","b","b","a"),
B = c(1:5),
C = c(1:5)
)
setorder(data, by=A)
> data
A B C
1: a 1 1
2: a 5 5
3: b 3 3
4: b 4 4
5: c 2 2
In column D I need to have and aggregate of values in B and C and values B and C when A is "a". As I have more than one "a", multiple aggregations are needed. From every aggregate minimum should be written in.
Here is an example.
For row 1: (1+1)+(1+1)=4, (5+5)+(1+1)=12, so 4 is minimum - D1 =4.
For row 3: (3+3)+(1+1)=8, (3+3)+(5+5)=16, D3 = 8. And so on.
This is what I expect
> data_new
A B C D
1: a 1 1 4
2: a 5 5 12
3: b 3 3 8
4: b 4 4 10
5: c 2 2 6
I tried this and run into issues.
for (i in data)data[i, D:=(min((data[i,B+C]) + (data[a=="a",(B+C)])))]
The expression below for minimum selection works fine on its own when I substitute i for a row number returning list of two numbers for min() returns proper value. Below answer is 8.
min((data[3,B+C]) + (data[A=="a",(B+C)]))
My previous attempts involved grid.expansion() and intersection(). However, with the size of my data set I ran into memory issue and Rstudio quit on me. As a side note, I need to run the calculations as I could not project the smallest outcome by "a" beforehand - it is a set of coordinates and they do not correlate with the magnitude of an answer.
Any suggestion where is my glaring issue
You can store the value of B + C where A = 'a' in a variable (val). For each row you can take minimum of B + C + val value.
library(data.table)
val <- data[A =='a', B + C]
data[, D := min(B + C + val), seq_len(nrow(data))]
data
# A B C D
#1: a 1 1 4
#2: a 5 5 12
#3: b 3 3 8
#4: b 4 4 10
#5: c 2 2 6
You can also use lapply :
data[, D := lapply(B + C, function(x) min(x + val))]
An option is also to replicate the 'a' rows after taking the min of 'B', 'C' and then do a direct + with the 'B', 'C' columns. The advantage is that, we don't have to group or loop
library(data.table)
Reduce(`+`, (data[A == 'a', .(B = min(B), C = min(C))][rep(seq_len(.N), nrow(data))] + data[, .(B, C)]))
#[1] 4 12 8 10 6
Or in a single line
data[, D := B + C + min(B[A== 'a']) + min(C[A== 'a'])]
data$D
#[1] 4 12 8 10 6

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

pair-wise duplicate removal from dataframe [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 5 years ago.
This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B
If I now remove duplicates, I get the following data frame:
df[duplicated(df),]
a b
3 A B
6 B A
8 C B
However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?
Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.
Thanks!
Extending Ari's answer, to specify columns to check if other columns are also there:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
a b c d
1 A A 10 B
2 A B 8 S
3 A B 7 J
4 B C 3 Q
5 B A 2 I
6 B A 6 U
7 C B 4 L
8 C B 5 V
cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
newdf[i, ] = sort(df[i,cols])
}
df[!duplicated(newdf),]
a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
One solution is to first sort each row of df:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.
The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.
Stealing from the other linked answer using data.table, you could try something like:
df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]
A comparison benchmark with a larger dataset (df2):
df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]
system.time(
df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user system elapsed
# 0.07 0.00 0.06
system.time({
for (i in 1:nrow(df2))
{
df2[i, ] = sort(df2[i, ])
}
df2[!duplicated(df2),]
}
)
# user system elapsed
# 42.07 0.02 42.09
Using apply will be a better option than loops.
newDf <- data.frame(t(apply(df,1,sort)))
All you need to do now is remove duplicates.
newDf <- newDf[!duplicated(newDf),]

Mapping identifiers transitively

I have three sets of identifiers: "x", "y", and "z". I also have two, 2-column data frames that each map one set of identifiers to another set of identifiers.
x2y = data.frame( x = c("A","A","B","B","C","D","E","F"),
y = c(1,2,1,2,3,4,4,5) )
y2z = data.frame( y = c(1,1,2,3,4,4,5,5,5),
z = c(1,2,3,3,6,7,6,7,8) )
This can be visualized in the figure below. Note that each arrow corresponds to one row in a data frame.
Question:
How do I use these two mappings (two data frames) to make a mapping
from x to z (displayed on the right of the figure above). I
think of it as a "transitive mapping": x to y and y to z gives x to z.
The data frame that I would like is...
x2z = data.frame( x = c("A","A","A","B","B","B","C","D","D","E","E","F","F","F"),
z = c(1,2,3,1,2,3,3,6,7,6,7,6,7,8) )
Notes: My data frames are usually ~50,000 rows, so efficient code is very important. When I've solved this problem with loops, it took several minutes to run.
My only requirement is that the code be in R.
You want to merge:
merge(x2y, y2z)[c('x','z')]
## x z
## 1 A 1
## 2 A 2
## 3 B 1
## 4 B 2
## 5 A 3
## 6 B 3
## 7 C 3
## 8 D 6
## 9 D 7
## 10 E 6
## 11 E 7
## 12 F 6
## 13 F 7
## 14 F 8
It helps here that the names agree where necessary.

Using variable value as column name in data.frame or cbind

Is there a way in R to have a variable evaluated as a column name when creating a data frame (or in similar situations like using cbind)?
For example
a <- "mycol";
d <- data.frame(a=1:10)
this creates a data frame with one column named a rather than mycol.
This is less important than the case that would help me remove quite a few lines from my code:
a <- "mycol";
d <- cbind(some.dataframe, a=some.sequence)
My current code has the tortured:
names(d)[dim(d)[2]] <- a;
which is aesthetically barftastic.
> d <- setNames( data.frame(a=1:10), a)
> d
mycol
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Is structure(data.frame(1:10),names="mycol") aesthetically pleasing to you? :-)
just use colnames after creation.
eg
a <- "mycolA"
b<- "mycolB"
d <- data.frame(a=1:10, b=rnorm(1:10))
colnames(d)<-c(a,b)
d
mycolA mycolB
1 -1.5873866
2 -0.4195322
3 -0.9511075
4 0.2259858
5 -0.6619433
6 3.4669774
7 0.4087541
8 -0.3891437
9 -1.6163175
10 0.7642909
Simple solution:
df <- data.frame(1:5, letters[1:5])
logics <- c(T,T,F,F,T)
cities <- c("Warsaw","London","Paris","NY","Tokio")
m <- as.matrix(logics)
m2 <- as.matrix(cities)
name <- "MyCities"
colnames(m) <- deparse(substitute(logics))
colnames(m2) <- eval(name)
df<-cbind(df,m)
cbind(df,m2)
X1.5 letters.1.5. logics MyCities
1 a TRUE Warsaw
2 b TRUE London
3 c FALSE Paris
4 d FALSE NY
5 e TRUE Tokio

Resources