How to make this code more efficient in R? - r

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.

Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]

Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6

Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.

Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

Related

extract and format data from dataset into matrix in R

I want to make this dataframe
into this matrix
I have tried:
x <- read.csv("sample1.csv")
ax <- matrix(c(x[1,1],x[2,1],x[1,3],x[1,1],x[3,1],x[1,4],x[1,1],x[4,1],x[1,5],x[1,1],x[5,1],x[1,6],x[1,1],x[6,1],x[1,7],x[2,1],x[1,1],x[2,2],x[2,1],x[3,1],x[2,4],x[2,1],x[4,1],x[2,5],x[2,1],x[5,1],x[2,6],x[3,1],x[6,1],x[2,7],x[3,1],x[1,1],x[3,2],x[3,1],x[2,1],x[3,3],x[3,1],x[4,1],x[3,5],x[3,1],x[5,1],x[3,6],x[3,1],x[6,1],x[3,7],x[4,1],x[1,1],x[4,2],x[4,1],x[2,1],x[4,3],x[4,1],x[3,1],x[4,4],x[4,1],x[5,1],x[4,6],x[4,1],x[6,1],x[4,7],x[5,1],x[1,1],x[2,2],x[5,1],x[2,1],x[2,4],x[5,1],x[3,1],x[2,5],x[5,1],x[4,1],x[2,6],x[5,1],x[6,1],x[2,7],x[6,1],x[1,1],x[2,2],x[6,1],x[2,1],x[2,4],x[6,1],x[3,1],x[2,5],x[6,1],x[4,1],x[2,6],x[6,1],x[5,1],x[2,7]),10,3, byrow=TRUE)
bx <- ax[order(ax[,3], decreasing = TRUE),]
But it's not beautiful at all, and also it's gonna be lots of work if I got different sample data.
So I wish to simplified it if possible, any suggestion?
This can be achieved by using melt() function from reshape2 package:
> a = matrix(c(1:9), nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], letters[1:3]))
> a
a b c
A 1 4 7
B 2 5 8
C 3 6 9
> library(reshape2)
> melt(a, na.rm = TRUE)
Var1 Var2 value
1 A a 1
2 B a 2
3 C a 3
4 A b 4
5 B b 5
6 C b 6
7 A c 7
8 B c 8
9 C c 9

Find values in data frame 2 which is found in data frame 1, within a certain range

I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})

Combine vectors into data frame, using vector name as a column

library(dplyr)
I have a set of vectors:
Sp_A <- c("A",1,2,3,4,5,6,7,8)
Sp_B <- c("B",9,10,11,12,13,14,15,16)
Sp_C <- c("C",17,18,19,20,21,22,23,24)
which I have made into a list of vectors:
list <- ls(pattern = "Sp_")
I want to use this list to loop over each vector in the list and make it into a data frame . I currently do this for one vector using this:
A_df <- select(data.frame(rep(Sp_A[1], each = 4), c(Sp_A[c(2,4,6,8)]), c(Sp_A[c(3,5,7,9)])), name = 1, var1 = 2, var2 = 3)
I have tried to make this operation into a for loop like this:
for(i in list) {
test[i] <- select(A_df <- data.frame(rep(i[1], each = 4),
c(i[c(2,4,6,8)]),
c(i[c(3,5,7,9)]),
name = 1, var1 = 2, var2 = 3))
}
but to no avail.
I have heard that I might be able to use apply() for this sort of thing but I don't know how.
Maybe this:
lapply(list,function(x) data.frame(name=get(x)[1],matrix(get(x)[-1],ncol = 2)))
[[1]]
name X1 X2
1 A 1 5
2 A 2 6
3 A 3 7
4 A 4 8
[[2]]
name X1 X2
1 B 9 13
2 B 10 14
3 B 11 15
4 B 12 16
[[3]]
name X1 X2
1 C 17 21
2 C 18 22
3 C 19 23
4 C 20 24
Or a simple for loop to assign the dataframes to objects:
for (x in 1:length(list)){
assign(paste0("test",x),data.frame(name=get(list[x])[1],matrix(get(list[x])[-1],ncol = 2)))
}

Same function over multiple data frames in R

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Removal of constant columns in R

I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)

Resources