Two-Way Data Table in R - r

I am trying to do the following in R.
Along the rows, I have a set values for a variable X. Along the columns, I have a set values for variable y.
For each combination of X and Y, I would like to perform a calculation and then summarize the results in a two-way data table.
One way I thought of was to create a row matrix containing the combination of row and column. Then rbind all the rows. But the process would be tedious and time-consuming. Is there a more efficient way to build this able using R?
Thanks.

What you need is the function outer. Here is a simple example of its use.
x = 1:5
y = seq(1, 9, 2)
names(x) = x
names(y) = y
MyFunction = function(x,y) x^2 + y^2
outer(x, y, MyFunction)
1 3 5 7 9
1 2 10 26 50 82
2 5 13 29 53 85
3 10 18 34 58 90
4 17 25 41 65 97
5 26 34 50 74 106

Related

How to create area around value?

Sorry, probably my question is not so clear, because I can not formulate it. I wiil explain by example.
I have two dataframesdf and df1:
df <- data.frame(a = c(25,15,35,45,2))
df1 <- data.frame(b = c(28,25,24,43,10))
I want to merge two dataframes with condition if values == +-5 and create column distance. For example, first element in column a is 25, I want to compare 25 with all elements in column b, and I want to select only 25 == +- 25. The output should look like:
a b distance
25 28 3
24 1
25 0
15 10 5
45 43 2
And values which are not equal +- 5 should be excluded like 2 and 35.
We may use outer to create a logical matrix, get the row/column index with which and arr.ind = TRUE. Use the index to subset the 'a', and 'b' column from corresponding datasets and get the difference`
i1 <- which(outer(df$a, df1$b, FUN = function(x, y)
abs(x - y) <=5), arr.ind = TRUE)
transform(data.frame(a = df$a[i1[,1]], b = df1$b[i1[,2]]), distance = abs(a - b))
-output
a b distance
1 25 28 3
2 25 25 0
3 25 24 1
4 45 43 2
5 15 10 5

R apply a vector of functions to a dataframe

I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------

How to exclude variables in columns and save the column to a different name?

Question: based on data table I got a column name xyz and in that column I want to drop student from “outstate” in “xyz” column, than save the column to “ppp”
PPP <- data[code here, ]
Since the question does not include enough accompanying data or code to make the question understandable, it's assumed that what is wanted is to add a new variable to an existing dataframe. If so, then just assign the new vector to the existing dataframe. The new vector should be the same length as the existing df.
x <- c(1:5)
y <- c(12:16)
z <- c(24:20)
df <- data.frame(x, y, z) # existing dataframe
o <- c(99:95) # other vector
df$new <- o # other vector assigned to existing df
print(df)
x y z new
1 1 12 24 99
2 2 13 23 98
3 3 14 22 97
4 4 15 21 96
5 5 16 20 95

Create frequency vector based on input vector

I have a variable test in the structure:
> test <- c(9,87)
> names(test) <- c("VGP", "GGW")
> dput(test)
structure(c(9, 87), .Names = c("VGP", "GGW"))
> class(test)
[1] "numeric"
This is a very simplified version of the input vector, but I want an output as a vector of length 100 which contains the frequency of each number 1-100 inclusive. The real input vector is of length ~1000000, so I am looking for an approach that will work for a vector of any length, assuming only numbers 1-100 are in it.
In this example, the numbers in all positions except 9 and 87 will show up as 0, and the 9th and 87th vector will both say 50.
How can I generate this output?
If we are looking for a proportion inclusive of the values that are not in the vector and to have those values as 0, convert the vector to factor with levels specified and then do the table and prop.table
100*prop.table(table(factor(test, levels = 1:100)))
>freq<-vector(mode="numeric",length=100)
>for(i in X)
+{ if(i>=1 && i<=100)
+ freq[i]=freq[i]+1
+}
>freq
X is the vector containing 10000 elements
Adding an if condition could ensure that the values are in the range of [1,100].
Hope this helps.
If you have a numeric vector and just want to get a frequency table of the values, use the table function.
set.seed(1234)
d <- sample(1:10, 1000, replace = TRUE)
x <- table(d)
x
# 1 2 3 4 5 6 7 8 9 10
# 92 98 101 104 87 112 104 94 88 120
If there is a possibility of missing values, say 11 is a possibility in my example then I'd do the following:
y <- rep(0, 11)
names(y) <- as.character(1:11)
y[as.numeric(names(x))] <- x
y
# 1 2 3 4 5 6 7 8 9 10 11
92 98 101 104 87 112 104 94 88 120 0

Fastest way to find nearest value in vector

I have two integer/posixct vectors:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
Now my resulting vector c should contain for each element of vector a the nearest element of b:
c <- c(4,4,4,4,4,6,6,...)
I tried it with apply and which.min(abs(a - b)) but it's very very slow.
Is there any more clever way to solve this? Is there a data.table solution?
As it is presented in this link you can do either:
which(abs(x - your.number) == min(abs(x - your.number)))
or
which.min(abs(x - your.number))
where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.
For example:
x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))
would output:
[1] 21 22
Update: Based on the very kind comment of hendy I have added the following to make it more clear:
Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:
x <- seq(from = 100, to = 10, by = -5)
x
[1] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10
Now let's find the number closest to 42:
your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]
which would output the "value" we are looking for from the x vector:
[1] 40
Not quite sure how it will behave with your volume but cut is quite fast.
The idea is to cut your vector a at the midpoints between the elements of b.
Note that I am assuming the elements in b are strictly increasing!
Something like this:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)
cut(a, breaks=cuts, labels=b)
# [1] 4 4 4 4 4 6 6 6 10 10 10 10 10 16 16
# Levels: 4 6 10 16
This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).
findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4
So of course you can do something like:
index = findInterval(a, cuts)
b[index]
# [1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.
library(data.table)
a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,merge:=Value]
b=data.table(Value=c(4,6,10,16))
b[,merge:=Value]
setkeyv(a,c('merge'))
setkeyv(b,c('merge'))
Merge_a_b=a[b,roll='nearest']
In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.
For those who would be satisfied with the slow solution:
sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
Here might be a simple base R option, using max.col + outer:
b[max.col(-abs(outer(a,b,"-")))]
which gives
> b[max.col(-abs(outer(a,b,"-")))]
[1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)
To get around this we can lapply over your a list, and find the closest.
library(DescTools)
lapply(a, function(i) Closest(x = b, a = i))
You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).
To get around this, put either min or max around the result:
lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))
Then unlist the result to get a plain vector :)

Resources