R: How to do fastest replacement in R? - r

I have a input dataframe like this (the real one is very large, so I want to do it faster):
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
A B C
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 5 9 13
I have a dataframe with replacement code like this (the entries here maybe more than df1):
df2 <- data.frame(X=c(1:15), Y=c(101:115))
X Y
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105
6 6 106
7 7 107
8 8 108
9 9 109
10 10 110
11 11 111
12 12 112
13 13 113
14 14 114
15 15 115
By matching df2$X with value in df1$A and df1$B, I want to get a new_df1 by replace df1$A and df1$B with the corresponding values in df2$Y, i.e. resulting this new_df1
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
Could you mind to give me some guidance how to do it faster in R, as my dataframe is very large? Many thanks.

As Thilo mentioned Nico's answer assumes that df2 is ordered by X and X contains every integer 1,2,3....
I would prefer to use match() as a more general case:
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
df2 <- data.frame(X=c(1:15), Y=c(101:115))
new_df1 <- df1
new_df1$A <- df2$Y[match(df1$A,df2$X)]
new_df1$B <- df2$Y[match(df1$B,df2$X)]
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13

It's supereasy! You just need to get the proper offsets in the array.
So for instance, to get the Y column of df2 corresponding to the values in the A column of df1 you'll write df2$Y[df1$A]
Hence, your code will be:
df_new <- data.frame("A" = df2$Y[df1$A], "B" = df2$Y[df1$B], "C" = df1$C)

Here is another (one-liner) way of doing it.
> with(c(df2,df1),data.frame(A = Y[match(A,X)],B = Y[match(B,X)],C))
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
However I am not sure whether it will be faster than the other suggestions

Related

decrease the time of script running (for loop based)

I wrote a script based on two for loops that I would like to optimize to speed up its running time.
Below are reproducible data that I simplified with the code that I am using on my own data.
nuc is a vector with 101 "position" and
tel is a data frame with different coordinates "aa" and "bb"
The aim is to calculate for each position the number of times each position is comprised between each aa and bb coordinate. For example position 111 is comprise between 3 couple of coordinates : G, I and J
#data
tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
bb=c(189,176,185,130,200,189,156,123,198,189),
ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))
> tel
aa bb ID
1 153 189 A
2 113 176 B
3 163 185 C
4 117 130 D
5 193 200 E
6 162 189 F
7 110 156 G
8 109 123 H
9 186 198 I
10 103 189 J
nuc2=100:200
# Loop
count_occ=0
count_occ_int=NULL
count_occ_fin=NULL
for (j in 1:length(nuc2)){
for (i in 1:nrow(tel)) {
if (nuc2[j]< tel$bb[i] & nuc2[j]>tel$aa[i])
{count_occ=count_occ+1}
}
count_occ_int=count_occ
count_occ_fin=c(count_occ_fin,count_occ_int)
count_occ=0
}
nuc_occ=data.frame(nuc=nuc2, occ=count_occ_fin)
> head(nuc_occ,20)
nuc occ
1 100 0
2 101 0
3 102 0
4 103 0
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 1
11 110 2
12 111 3
13 112 3
14 113 3
15 114 4
16 115 4
17 116 4
18 117 4
19 118 5
20 119 5
In my data, the length of my nuc vector is 9304567 and the number of couple of coordinates is 53 (I will have some hundred soon) and it took more than 60 hours to run the code !!
Any idea to help me to speed up this code ?
I though to the apply function but I am not sure how to combine the two for loop operations.
You can use data.table non-equi join like this:
library(data.table)
setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)]
Explanation:
setDT(tel) sets the tel data.frame to be of class data.table
SJ(v=nuc2) is a convenience function for converting a vector to a data.table; in this case converting nuc2 to a data.table with one column v. I'm doing this becuase I want to join two data.tables, one which is tel (with columns aa,bb and v) and one which has a single column v holding the values in nuc2
the join conditions are in the on=.. param of the setDT(tel)[...] clause; here the join condition is that the v value must be >= the aa value and must be <= the bb value
the final step (i.e. the next chained data.table operation) simply counts the number of rows where ID is not NA, by nuc value (by=.(nuc=aa))
Output:
nuc occ
<int> <int>
1: 100 0
2: 101 0
3: 102 0
4: 103 1
5: 104 1
---
97: 196 2
98: 197 2
99: 198 2
100: 199 1
101: 200 1
Here's a tidyverse solution:
lapply(
100:200,
\(x) tel %>%
filter(aa <= x & x <= bb) %>%
summarise(occ=n(), .groups="drop") %>%
add_column(nuc=x, .before=1)
) %>%
bind_rows() %>%
as_tibble()
# A tibble: 101 × 2
nuc occ
<int> <int>
1 100 0
2 101 0
3 102 0
4 103 1
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 2
# … with 91 more rows
Using microbenchmark to assess performance, this gives
Unit: nanoseconds
expr min lq mean median uq max neval
lapply 7 9 8.8 9 9 9 10
original 8 9 23.8 9 9 158 10
In other words, a decrease in speed of about two-thirds. And the tidyverse is not known for speed. A base R solution is likely to be faster still.

How do I regroup data?

I am looking to change the structure of my dataframe, but I am not really sure how to do it. I am not even sure how to word the question either.
ID <- c(1,8,6,2,4)
a <- c(111,94,85,76,72)
b <- c(75,37,86,55,62)
dataframe <- data.frame(ID,a,b)
ID a b
1 1 111 75
2 8 94 37
3 6 85 86
4 2 76 55
5 4 72 62
Above is the code with the output, however, I want the output to look like the following; however, the only way I know how to do this is to just type it manually, is there any other way other than changing the input manually? Because I have quite a large data set that I would like to change and manually would just take forever.
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
We may use pivot_longer
library(dplyr)
library(tidyr)
dataframe %>%
pivot_longer(cols = a:b, names_to = 'letter')
-output
# A tibble: 10 × 3
ID letter value
<dbl> <chr> <dbl>
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
A base R option using reshape:
df <- reshape(dataframe, direction = "long",
v.names = "value",
varying = 2:3,
times = names(dataframe)[2:3],
timevar = "letter",
idvar = "ID")
df <- df[ order(match(df$ID, dataframe$ID)), ]
row.names(df) <- NULL
Output
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62

R: ifelse loop across multiple dataframes

I'd like to create an efficient ifelse statement such that if columns from df2 match columns from df1, then that row in df2 is coded a specific way. My code works but is very inefficient.
Example data:
Df1
A B C
111 2 1
111 5 2
111 7 3
112 2 4
112 8 5
113 2 6
Df2
A B
112 2
111 2
113 2
111 5
111 7
112 8
Desired Outcome:
Df2
A B C
112 2 4
111 2 1
113 2 6
111 5 2
111 7 3
112 8 5
What I've done is this:
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 2, 1, 0)
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 5, 2, 0)
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 7, 3, 0)...
This works, but is there a way such that df2 could reference the columns in df1 and create column df2$C, so that each combination doesn't have to be manually typed out?
This would typically be done with a join. left_join from dplyr will connect each of the rows in your first table with the each of the matching rows in the second table.
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
Df2 %>% left_join(Df1)
Joining, by = c("A", "B")
A B C
1 112 2 4
2 111 2 1
3 113 2 6
4 111 5 2
5 111 7 3
6 112 8 5
merge from base R will give a similar result, but doesn't keep the original row order without some extra wrangling.
Merge two data frames while keeping the original row order
merge(Df2, Df1)
A B C
1 111 2 1
2 111 5 2
3 111 7 3
4 112 2 4
5 112 8 5
6 113 2 6

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Allow a maximum number of entries when certain conditions apply

I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
if dat is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Say your data is in df. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
A solution in two lines using the plyr package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?

Resources