I'm writing a function to analyse .csv files in a directory on my hard drive, using a series of for and while loops (I know for loops are unpopular in R, but they're good for what I need).
The function creates a number of data-frames, and performs actions on each one in turn before overwriting them and moving on to the next file in the directory to repeat the action.
The part of the code that does not work so far is the creation of a matrix from vectors taken from the data files being analysed. A simplified version of the code is shown below:
data1 <- seq(1, 10, 1)
data2 <- seq(1, 7, 1)
data3 <- seq(1, 5, 1)
n <- max(length(data1), length(data2), length(data3))
k <- c(1, 2, 3)
for(a in k){
if(a == 1){
length(get(paste("data", a, sep = ""))) <- n
data_matrix <- get(paste("data", a, sep = ""))
}else{
while(exists(paste("data", a, sep = ""))){
length(get(paste("data", a, sep = ""))) <- n
data_matrix <- cbind(data_matrix, get(paste("data", a, sep = "")))
}
}
}
The nature of my data is that the length of the columns in my datasets vary with each data collection, so I've adapted a technique found in this post that deals with using cbind to bind objects of a different length without replication of the data within the smaller objects.
The issue I have when trying to implement this code is I get the error message:
Error in length(get(paste("data", a, sep = ""))) <- n :
target of assignment expands to non-language object
I'm guessing the issue is that the function get() cannot be used to select items in the Global Environment and to modify them in this way.
You could use:
get("x")[1:n]
to get a vector called "x" padded with NA to length n.
That is:
> x=1:3
> n=10
> get("x")[1:n]
[1] 1 2 3 NA NA NA NA NA NA NA
Having said that, this is a neater way to get the matrix you want (hopefully you can adapt to your scenario):
> datalist <- list(data1, data2, data3)
> maxlength <- max(lengths(datalist))
> sapply(datalist, function(x) x[1:maxlength] )
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
[4,] 4 4 4
[5,] 5 5 5
[6,] 6 6 NA
[7,] 7 7 NA
[8,] 8 NA NA
[9,] 9 NA NA
[10,] 10 NA NA
For those who want to see how the solution proposed by #GeorgeSavva looks using the loop method that I am employing (my loop contained additional errors):
data1 <- seq(1, 10, 1)
data2 <- seq(1, 7, 1)
data3 <- seq(1, 5, 1)
n <- max(length(data1), length(data2), length(data3))
k <- c(1, 2, 3)
for(a in k){
if(a == 1){
data_matrix <- get(paste("data", a, sep = ""))[1:n]
}else{
data_matrix <- cbind(data_matrix, get(paste("data", a, sep = ""))[1:n])
}
}
While loop was unnecessary. I have written my code this way so that I can make it as versatile as possible as I obtain on a daily basis a varying number of datasets, with a varying size in each dataset.
I can use common operations on each dataset, so I can write a function that will tidy the data, construct charts and compare the datasets automatically without having to write new commands for each analysis.
Related
I am attempting to create a loop that runs a function with specific values of i in a vector:
For example I would like to save i + 2 for when i is 1 and 5
test<- c()
for(i in c(1,5)){
test[i] <- i + 2
}
This ends up printing NA for 2 ,3 and 4:
[1] 3 NA NA NA 7
while the result I would like is:
[1] 3 7
This is probably very elementary but I cannot seem to figure this out.
R is vectorized, means you can do this:
c(1, 5) + 2
# [1] 3 7
for loops in R are often very slow, which is why they are implemented in C in functions of the *apply family, e.g.
sapply(c(1, 5), \(i) i + 2)
# [1] 3 7
If you really need to rely on a for loop, If you really need to rely on a "for" loop, you may want to loop over the indices rather than the values (a quite common mistake!):
v <- c(1, 5)
test <- vector('numeric', length(v))
for (i in seq_along(v)) {
test[i] <- v[i] + 2
}
test
# [1] 3 7
Use append
test<- c()
for(i in c(1,5)){
test<-append(test,i+2)
}
I'm trying to figure out how to iteratively load a matrix (this form part of a bigger function I can't reproduce here).
Let's suppose that I create a matrix:
m <- matrix(c(1:9), nrow = 3, ncol = 3)
m
This matrix can be named "m", "x" or whatsoever. Then, I need to load iteratively the matrix in the function:
if (interactive() ) { mat <-
readline("Your matrix, please: ")
}
So far, the function "knows" the name of the matrix, since mat returns [1] "m", and is a object listed in ls(). But when I try to get the matrix values, for example through x <- get(mat) I keep getting an error
Error in get(mat) : unused argument (mat)
Can anybody be so kind as to tell me what I'm doing wrong here?
1) Assuming you mean interactive, not iterative,
get_matrix <- function() {
nr <- as.numeric(readline("how many rows? "))
cat("Enter space separated data row by row. Enter empty row when finished.\n")
nums <- scan(stdin())
matrix(nums, nr, byrow = TRUE)
}
m <- get_matrix()
Here is a test:
> m <- get_matrix()
how many rows? 3
Enter space separated data row by row. Enter empty row when finished.
1: 1 2
3: 3 4
5: 5 6
7:
Read 6 items
> m
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
>
2) Another possibility is to require that the user create a matrix using R and then just give the name of the matrix:
get_matrix2 <- function(envir = parent.frame()) {
m <- readline("Enter name of matrix: ")
get(m, envir)
}
Test it:
> m <- matrix(1:6, 3)
> mat <- get_matrix2()
Enter name of matrix: m
> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
I am trying to multiply elements of column with itself but am unable to do it.
I have column A with values a, b, c, I want answer as (a*b + a*c + b*c).
For example, with
A <- c(2, 3, 5) the expected output is sum(6 + 10 + 15) = 31.
I am trying to run for loop to execute but was failing. Can anyone please provide R code to do this.
example data :
df1 <- data.frame(A=c(2,3,5))
combn will give you the combinations
combinations <- combn(df1$A,2)
# [,1] [,2] [,3]
# [1,] 2 2 3
# [2,] 3 5 5
apply with margin 2 (by columns), will do the multiplication
multiplied_terms <- apply(combinations,2,function(x) x[1]*x[2])
# [1] 6 10 15
Or shorter and more general, thanks to #zacdav :
multiplied_terms <- apply(combinations,2,prod)
then we can sum them
output <- sum(multiplied_terms)
# [1] 31
Piped for a compact solution:
library(magrittr)
df1$A %>% combn(2) %>% apply(2,prod) %>% sum
Here's another way. Approach by #Moody_Mudskipper maybe easier to extend to groups of 3 etc. But, I think this should be much faster since there isn't the need to actually find the combinations.
Using for loop
It just goes through the vector A multiplying the rest of the elements until the last one.
len <- length(A)
res <- numeric(0)
for (j in seq_len(len - 1))
res <- res + sum(A[j] * A[(j+1) : len]))
res
#[1] 31
Using lapply or sapply
The for loop can be replaced by using lapply
res <- sum(unlist(lapply(1 : (len - 1), function(j) sum(A[j] * A[(j+1) : len]))))
or sapply,
res <- sum(sapply(1 : (len - 1), function(j) sum(A[j] * A[(j+1) : len])))
I didn't check which of these is the fastest.
# If you need to store the pairwise multiplications, then use the following;
# res <- NULL
# for (j in 1 : (len-1))
# res <- c(res, A[j] * A[(j+1) : len])
# res
# [1] 6 10 15
# sum(res)
# [1] 31
I have a data frame with different values of p.value includind missing values (NA):
pvalue2=pvalue[1:679,3:10]
and I need to analyze it and the numbers greater than 0.05 i need to write "Normal" e values less than 0.05 i need to write the value. I want the result to be written in another data frame.
This is my code:
a=data.frame()
for (i in 1:nrow(pvalue2)) {
for (j in 1:ncol(pvalue2)){
if (pvalue2[i,j] >=0.05) {
print (a[i,j]=="Normal")
} else {print a[i,j]==pvalue2[i,j] }
}
}
Someone can help me please?
a <- ifelse(as.matrix(pvalue2) < .05, as.matrix(pvalue2), "normal")
a <- as.data.frame(a)
Since R is a high level language that is not compiled, for loops have a tendendency to get very slow when they grow. By using vectorized functions instead (that do the looping in a lower level language internally) you speed up the code and make it more readable.
Example run
> set.seed(123)
> pvalue2 <- matrix(runif(18)/10, 6, 3)
> pvalue2[sample(length(pvalue2), 4)] <- NA
> pvalue2 <- as.data.frame(pvalue2)
> pvalue2
V1 V2 V3
1 0.02875775 0.05281055 0.067757064
2 0.07883051 0.08924190 0.057263340
3 0.04089769 0.05514350 NA
4 0.08830174 0.04566147 0.089982497
5 0.09404673 NA NA
6 NA 0.04533342 0.004205953
> ifelse(as.matrix(pvalue2) < .05, as.matrix(pvalue2), "normal")
V1 V2 V3
[1,] "0.0287577520124614" "normal" "normal"
[2,] "normal" "normal" "normal"
[3,] "0.04089769218117" "normal" NA
[4,] "normal" "0.0456614735303447" "normal"
[5,] "normal" NA NA
[6,] NA "0.0453334156190977" "0.00420595335308462"
I suppose, your p values are stored as factors. You need to convert them to numeric values first.
tmp <- sapply(pvalue2, function(x) as.numeric(as.character(x)))
Now, the object tmp can be used:
# copy the existing data frame to a new object
df2 <- pvalue2
# fill it with "Normal"
df2[ , ] <- "Normal"
# replace with values from tmp if value < 0.05
df2[tmp < 0.05] <- pvalue2[tmp < 0.05]
assuming your first data frame is called df
df_2<-data.frame(matrix(nrow=nrow(df),ncol=ncol(df)));
for (i in 1:ncol(df)){
df_2[,i]<-ifelse(is.na(df[,i]) == FALSE && df[,i] >= .05,"Normal",ifelse(is.na(df[,i])==FALSE && df[,i] < 0.05,df[,i],NA))
}
set.seed(42)
df <- data.frame(a=runif(10,0,0.1),b=runif(10,0,0.1))
#since there are only numeric values
#you can transform to matrix
m <- as.matrix(df)
#new matrix
m2 <- m
m2[m>0.05] <- "Normal"
df2 <- as.data.frame(m2)
a b
1 Normal 0.045774177624844
2 Normal Normal
3 0.0286139534786344 Normal
4 Normal 0.0255428824340925
5 Normal 0.0462292822543532
6 Normal Normal
7 Normal Normal
8 0.013466659723781 0.0117487361654639
9 Normal 0.0474997081561014
10 Normal Normal
I have two vectors, A and B. For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same.
So for vectors:
A <- c(10, 5, 3, 4, 7)
B <- c(4, 8, 11, 1, 5)
I want a result vector:
R <- c(3, 3, 5, 5, NA)
Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case.
A few bonus questions:
What if I have a sequence of numbers (like seq = 2:10), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq.
Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A, and 10 from seq. I want to find the first value of B, which is higher than a+10, or lower than a-10, and then store it's index and value.
sapply(sapply(seq_along(a),function(x) which(b[-seq(x)]>a[x])+x),"[",1)
[1] 3 3 5 5 NA
This is a great example of when sapply is less efficient than loops.
Although the sapply does make the code look neater, you are paying for that neatness with time.
Instead you can wrap a while loop inside a for loop inside a nice, neat function.
Here are benchmarks comparing a nested-apply loop against nested for-while loop (and a mixed apply-while loop, for good measure). Update: added the vapply..match.. mentioned in comments. Faster than sapply, but still much slower than while loop.
BENCHMARK:
test elapsed relative
1 for.while 0.069 1.000
2 sapply.while 0.080 1.159
3 vapply.match 0.101 1.464
4 nested.sapply 0.104 1.507
Notice you save a third of your time; The savings will likely be larger when you start adding the sequences to A.
For the second part of your question:
If you have this all wrapped up in an nice function, it is easy to add a seq to A
# Sample data
A <- c(10, 5, 3, 4, 7, 100, 2)
B <- c(4, 8, 11, 1, 5, 18, 20)
# Sample sequence
S <- seq(1, 12, 3)
# marix with all index values (with names cleaned up)
indexesOfB <- t(sapply(S, function(s) findIndx(A+s, B)))
dimnames(indexesOfB) <- list(S, A)
Lastly, if you want to instead find values of B less than A, just swap the operation in the function.
(You could include an if-clause in the function and use only a single function. I find it more efficient
to have two separate functions)
findIndx.gt(A, B) # [1] 3 3 5 5 6 NA 8 NA NA
findIndx.lt(A, B) # [1] 2 4 4 NA 8 7 NA NA NA
Then you can wrap it up in one nice pacakge
rangeFindIndx(A, B, S)
# A S indxB.gt indxB.lt
# 10 1 3 2
# 5 1 3 4
# 3 1 5 4
# 4 1 5 NA
# 7 1 6 NA
# 100 1 NA NA
# 2 1 NA NA
# 10 4 6 4
# 5 4 3 4
# ...
FUNCTIONS
(Notice they depend on reshape2)
rangeFindIndx <- function(A, B, S) {
# For each s in S, and for each a in A,
# find the first value of B, which is higher than a+s, or lower than a-s
require(reshape2)
# Create gt & lt matricies; add dimnames for melting function
indexesOfB.gt <- sapply(S, function(s) findIndx.gt(A+s, B))
indexesOfB.lt <- sapply(S, function(s) findIndx.lt(A-s, B))
dimnames(indexesOfB.gt) <- dimnames(indexesOfB.gt) <- list(A, S)
# melt the matricies and combine into one
gtltMatrix <- cbind(melt(indexesOfB.gt), melt(indexesOfB.lt)$value)
# clean up their names
names(gtltMatrix) <- c("A", "S", "indxB.gt", "indxB.lt")
return(gtltMatrix)
}
findIndx.gt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) < A[[j]]) ) {
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}
findIndx.lt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) > A[[j]]) ) { # this line contains the only difference from findIndx.gt
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}