I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------
Related
Let's say I have a simple toy vector in R like:
x = seq(1:10);x
[1] 1 2 3 4 5 6 7 8 9 10
I want to use the rollapply function from zoo package but in a different way.Rollapply calculates a function from a vector x with width argument to be a rolling window.I want instead of rolling to be expanding.There is similar question here and here but they don't help me with my problem.
For example what I want to calculate the sum of the first observations of vector x and then expand the window but by 2.
Doing so I did :
rollapplyr(x, seq_along(x) ,sum,by=2,partial = 5,fill=NA)
[1] NA NA NA NA 15 21 28 36 45 55
or replace the NA's
na.locf0(rollapplyr(x, 5 ,sum,by=2,partial = 5,fill=NA))
[1] NA NA NA NA 15 15 25 25 35 35
But what I ideally want as a result is:
[1] NA NA NA NA 15 15 28 28 45 45
Imagine that my dataset is huge (contains 2500 time series observations) and the function is some econometric - statistical model not a simple one like the sum that I use here.
How can I do it? Any help ?
x <- seq(10)
expandapply <- function(x, start, by, FUN){
# set points to apply function up to
checkpoints <- seq(start, length(x), by)
# apply function to all windows
vals <- sapply(checkpoints, function(i) FUN(x[seq(i)]))
# fill in numeric vector at these points (assumes output is numeric)
out <- replace(rep(NA_real_, length(x)), checkpoints, vals)
# forward-fill the gaps
zoo::na.locf(out, na.rm = FALSE)
}
expandapply(x, start = 5, by = 2, FUN = sum)
#> [1] NA NA NA NA 15 15 28 28 45 45
Created on 2022-03-13 by the reprex package (v2.0.1)
Define nonNA as the positions which should not be NA. You can change x and nonNA to whatever you need.
Then assign w a vector of widths to use using zero for those components which are to be NA. Finally apply na.locf0.
(The two extreme cases are that if nonNA is seq_along(x) so that all elements are not to be NA'd out then this is the same as rollapplyr(x, seq_along(x), sum) and if nonNA is c() so that there are no non-NAs then it returns all NAs.)
library(zoo)
x <- 1:10
nonNA <- seq(5, length(x), 2)
w <- ifelse(seq_along(x) %in% nonNA, seq_along(x), 0)
na.locf0(rollapplyr(x, w, function(x) if (length(x)) sum(x) else NA, fill=NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Another way is to use a list for thewidth= argument of rollapply whose components contain the offsets. x and nonNA are from above.
L <- lapply(seq_along(x), function(x) if (x %in% nonNA) -seq(x-1, 0))
na.locf0(rollapplyr(x, L, sum, fill = NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Update
Simplified solution and added second approach.
I have two data frames called height.txt
ID: 1 2 3 4 5
Height: 67 60 62 55 69
and next data frame is weight.txt
ID: 1 2 4 5 6
Weight: 110 123 150 170 185
The goal is to merge these two data frames together, without using the merge() function in R, and the output should be in the image. How would I do this? This is for practice, I know merge() does the trick, but I am supposed to do this without it, it is for a class.
Edit:
Data in a copy&paste format.
ID <- scan(text = "1 2 3 4 5")
Height <- scan(text = "67 60 62 55 69")
df1 <- data.frame(ID, Height)
ID <- scan(text = "1 2 4 5 6")
Weight <- scan(text = "110 123 150 170 185")
df2 <- data.frame(ID, Weight)
It's a simple repeated use of match.
Create a data.frame with all the elements of the common column, ID, with no repetitions.
match the ID's of each of the dataframes with the ID of the result res.
Assign the other columns.
Remember to create each of the other columns before assigning values to them.
res <- data.frame(ID = unique(c(df1$ID, df2$ID)))
i <- match(df1$ID, res$ID)
j <- na.omit(match(res$ID, df1$ID))
res$Height <- NA
res$Height[i] <- df1$Height[j]
i <- match(df2$ID, res$ID)
j <- na.omit(match(res$ID, df2$ID))
res$Weight <- NA
res$Weight[i] <- df2$Weight[j]
res
# ID Height Weight
#1 1 67 110
#2 2 60 123
#3 3 62 NA
#4 4 55 150
#5 5 69 170
#6 6 NA 185
identical(res, merge(df1, df2, all = TRUE))
#[1] TRUE
Edit.
Answering to a question in a comment about how general this solution is. From help("merge"):
Details
merge is a generic function whose principal method is for data frames:
the default method coerces its arguments to data frames and calls the
"data.frame" method.
The method merge.data.frame in R 3.6.2 is 158 code lines long, this solution is not general at all.
Edit 2.
A function generalizing the code above could the following.
merge_by_one_col <- function(X, Y, col = "ID"){
common <- unique(c(X[[col]], Y[[col]]))
res <- data.frame(common)
names(res) <- col
i <- match(X[[col]], res[[col]])
j <- na.omit(match(res[[col]], X[[col]]))
for(new in setdiff(names(X), col)){
res[[new]] <- NA
res[[new]][i] <- X[[new]][j]
}
i <- match(Y[[col]], res[[col]])
j <- na.omit(match(res[[col]], Y[[col]]))
for(new in setdiff(names(Y), names(res))){
res[[new]] <- NA
res[[new]][i] <- Y[[new]][j]
}
res
}
merge_by_one_col(df1, df2)
I used cbind after rbinding the missing IDs from each data frame and sorting by ID.
df1_ <- rbind(df1, data.frame(ID=setdiff(df2$ID, df1$ID), Height=NA))
df2_ <- rbind(df2, data.frame(ID=setdiff(df1$ID, df2$ID), Weight=NA))
cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
6 6 NA 185
Edit:
Generalizing so that no column names are required (except the "by" column "ID")
n1 <- setdiff(df1$ID, df2$ID); n1
n2 <- setdiff(df2$ID, df1$ID); n2
df1a <- df1[rep(nrow(df1)+1, length(n1)),]; df1a
df2a <- df2[rep(nrow(df2)+1, length(n2)),]; df2a
df1a$ID <- n2
df2a$ID <- n1
df1_ <- rbind(df1, df1a)
df2_ <- rbind(df2, df2a)
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
rownames(res) <- 1:nrow(res)
res
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
NA 6 NA 185
Edit 2: Using rbind.fill from the plyr package:
library(plyr)
df1_ <- rbind.fill(df1, data.frame(ID=setdiff(df2$ID, df1$ID)))
df2_ <- rbind.fill(df2, data.frame(ID=setdiff(df1$ID, df2$ID)))
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
identical(res, merge(df1, df2, all=TRUE))
# TRUE
I have a vector of True and False values. The length of the vector is 1000.
vect <- [T T F T F F..... x1000]
I want loop over the first 100 (i.e 1:100) values and calculate the count of true and false values and store the result into some variable (e.g. True <- 51, False <- 49). Then loop over the next 100 values (101:200) and do the same computation as before, and so on till I reach 1000.
The code below is pretty standard but, instead of slicing the vector, it calculates sums for the entire vector.
count_True = 0
count_False = 0
for (i in vect){
if (i == 'T'){
count_True = count_True + 1
}
else {
count_false = count_false + 1
}
}
I am aware you can split the the vector by
vect_splt <- split(vect,10)
but is there a way to combine these to do what I wanted or any other way?
Does something like this work:
set.seed(42)
vect <- sample(rep(c(T, F), 500))
vect <- tibble(vect)
vect %>%
mutate(seq = row_number() %/% 100) %>%
group_by(seq) %>%
summarise(n_TRUE = sum(vect),
n_FALSE = sum(!vect))
# A tibble: 11 x 3
seq n_TRUE n_FALSE
<dbl> <int> <int>
1 0 42 57
2 1 56 44
3 2 50 50
4 3 55 45
5 4 43 57
6 5 48 52
7 6 48 52
8 7 54 46
9 8 51 49
10 9 53 47
11 10 0 1
We can use a split by table. With a grouping index created with gl, split the vector into a list of vectors and get the count with table and store it in a list
out <- lapply(split(vect, as.integer(gl(length(vect), 100, length(vect)))), table)
It can be converted to a single dataset by rbinding
out1 <- do.call(rbind, out)
data
set.seed(24)
vect <- sample(c(TRUE, FALSE), 1000, replace = TRUE)
I have a variable test in the structure:
> test <- c(9,87)
> names(test) <- c("VGP", "GGW")
> dput(test)
structure(c(9, 87), .Names = c("VGP", "GGW"))
> class(test)
[1] "numeric"
This is a very simplified version of the input vector, but I want an output as a vector of length 100 which contains the frequency of each number 1-100 inclusive. The real input vector is of length ~1000000, so I am looking for an approach that will work for a vector of any length, assuming only numbers 1-100 are in it.
In this example, the numbers in all positions except 9 and 87 will show up as 0, and the 9th and 87th vector will both say 50.
How can I generate this output?
If we are looking for a proportion inclusive of the values that are not in the vector and to have those values as 0, convert the vector to factor with levels specified and then do the table and prop.table
100*prop.table(table(factor(test, levels = 1:100)))
>freq<-vector(mode="numeric",length=100)
>for(i in X)
+{ if(i>=1 && i<=100)
+ freq[i]=freq[i]+1
+}
>freq
X is the vector containing 10000 elements
Adding an if condition could ensure that the values are in the range of [1,100].
Hope this helps.
If you have a numeric vector and just want to get a frequency table of the values, use the table function.
set.seed(1234)
d <- sample(1:10, 1000, replace = TRUE)
x <- table(d)
x
# 1 2 3 4 5 6 7 8 9 10
# 92 98 101 104 87 112 104 94 88 120
If there is a possibility of missing values, say 11 is a possibility in my example then I'd do the following:
y <- rep(0, 11)
names(y) <- as.character(1:11)
y[as.numeric(names(x))] <- x
y
# 1 2 3 4 5 6 7 8 9 10 11
92 98 101 104 87 112 104 94 88 120 0
I need to extract summed subsets of a data.frame row-by-row and use the output to return a new data.frame. However, I want to increase the number of columns to sum across by 4 each time. So, for example, I want to extract the 1st column by itself, then the sum of columns 2 to 6 on a row-by-row basis, then columns 7 to 15 and so on.
I have this code that returns the sum of a constant number of columns across a data.frame (by a maximum number of trials) into a new data.frame- I just need to find a way to add the escalating function.
t<- max(as.numeric(df[,c(5)]))
process.row <- function (x){
sapply(1:t,function(i){
return(sum(as.numeric(x[c((6+(i-1)*5):(10+(i-1)*5))]
)
)
)
})
}
t(apply(df,1,process.row)) -> collated.data
I've been really struggling with a way to do this so thanks very much for any help. I couldn't find an answer to this elsewhere so apologies if I've missed something.
I was thinking you wanted to sum the rows of the selected subset of columns. If so, perhaps this will help.
# fake data
mydf <- as.data.frame(matrix(sample(45*5), nrow=5))
mydf
# prepare matrix of start and ending columns
n <- 20
i <- 1:n
ncols <- 1 + (i-1)*4
endcols <- cumsum(ncols)
startcols <- c(1, cumsum(ncols[-length(endcols)])+1)
mymat <- cbind(endcols, startcols)
# function to sum the rows
myfun <- function(df, m) {
# select subset with end columns within the dimensions of the given df
subm <- m[m[, 2] <= dim(df)[2], ]
# sum up the selected columns of df by rows
sapply(1:dim(subm)[1], function(j)
rowSums(df[, subm[j, 1]:subm[j, 2], drop=FALSE]))
}
mydf
myfun(df=mydf, m=mymat)
What you are looking for is a function that gives x (the lower value of the series), which looks like this for the sequence-part i:
In r, the code looks like this:
# the foo part of the function
foo <- function(x) ifelse(x > 0, 1 + (x - 1) * 4, 0)
# the wrapper of the function
min.val <- function(i){
ifelse(i == 1, 1, 1 + sum(sapply(1:(i - 1), foo)))
}
# takes only one value
min.val(1)
# [1] 1
min.val(2)
# [1] 2
min.val(3)
# [1] 7
# to calculate multiple values, use it like this
sapply(1:5, min.val)
#[1] 1 2 7 16 29
If you want to get the maximum number, you can create another function, which looks like this
max.val <- function(i) min.val(i + 1) - 1
sapply(1:5, max.val)
#[1] 1 6 15 28 45
Testing:
# creating a series to test it
series <- 1:20
min.vals <- sapply(series, min.val)
max.vals <- sapply(series, max.val)
dat <- data.frame(min = min.vals, max = max.vals)
# dat
# min max
# 1 1 1
# 2 2 6
# 3 7 15
# 4 16 28
# 5 29 45
# 6 46 66
# 7 67 91
# 8 92 120
# 9 121 153
# 10 154 190
# 11 191 231
# 12 232 276
# 13 277 325
# 14 326 378
# 15 379 435
# 16 436 496
# 17 497 561
# 18 562 630
# 19 631 703
# 20 704 780
Does that give you what you want?