R dataframe with nested vector - r

Background
In R this works:
> df <- data.frame(a=numeric(), b=numeric())
> rbind(df, list(a=1, b=2))
a b
1 1 2
But if I want the list to have a vector, rbind fails:
> df <- data.frame(a=numeric(), b=vector(mode="numeric"))
> rbind(df, list(a=1, b=c(2,3)))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
And if I try to specify the vector length, declaring the dataframe fails:
> df <- data.frame(a=numeric(), b=vector(mode="numeric", length=2))
Error in data.frame(a = numeric(), b = vector(mode = "numeric", length = 2)) :
arguments imply differing number of rows: 0, 2
Finally, if I eschew declaring the dataframe and try rbind two lists directly, it looks like everything is working, but the datatypes are all wrong, and none of the columns appear to exist.
> l1 <- list(a=1, b=c(2,3))
> l2 <- list(a=10, b=c(20,30))
> obj <- rbind(l1, l2)
> obj
a b
l1 1 Numeric,2
l2 10 Numeric,2
> typeof(obj)
[1] "list"
> obj$a
NULL
> obj$b
NULL
> names(obj)
NULL
My setup
I have a embedded device that gathers data every 50ms and spits out a packet of data. In my script, I'm parsing a waveform that represents the states of that process (process previous frame and transmit, gather new data, dead time where nothing happens) with a state machine. For each packet I'm calculating the duration of the process period, the gathering data period which is subdivided into 8 or 16 acquisition cycles, where I calculate the time of each acquisition cycle, and the remaining dead time.
My list basically looks like `list(process=#, cycles=c(#,#,#,#), deadtime=#). Different packet types have different cycle lengths, so I pass that in as a parameter and I want the script to work on any packet time.
My question
Is there a way to declare a dataframe that does what I want, or am I using R in a fundamentally wrong way and I should break each cycle into it's own list element? I was hoping to avoid the latter as it will make treating the cycles as a group more difficult.
I will note that I've just started learning R so I'm probably doing some odd things with it.
Expected output
If I were to process 4 packets worth of signal with 3 acq. cycles each, this would be my ideal output:
df <- data.frame(processTime=numeric(), cyles=???, deadtime=numeric())
df <- rbind(df, list(processTime=0.05, cycles=c(0.08, 0.10, 0.07), deadtime=0.38)
etc...
processTime cycles deadtime
1 0.05 0.08 0.10 0.07 0.38
2 0.06 0.07 0.11 0.09 0.36
3 0.07 0.28 0.11 0.00 0.00
4 0.06 0.08 0.08 0.09 0.41

I'll take a different stab. Dealing with just your first 2 records.
processTime<-c(.05,.06)
cycles<-list(list(.08,.10,.07), list(.07,.09,.38))
deadtime<-c(.38,.36)
For cycles, we created a list element with a list that contains 3 elements in it. So cycles[[1]][1] would refer to .08, and cycles[[1]][2] would refer second element of the first list and cycles[[2]][3] would refer to the 3rd item in the second list.
If we use cbind to bind these we get the following:
test<-as.data.frame(cbind(processTime,cycles,deadtime))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
2 0.06 0.07, 0.09, 0.38 0.36
test$cycles[[1]] will return first list
test$cycles[[1]]
[[1]]
[[1]][[1]]
[1] 0.08
[[1]][[2]]
[1] 0.1
[[1]][[3]]
[1] 0.07
Whereas the 3rd element of the second list can be called with:
test$cycles[[2]][3]
[[1]]
[1] 0.38
You can also unlist later for calculations:
unlist(test$cycles[[2]])
[1] 0.07 0.09 0.38
To do this iteratively as you requested.
test<-data.frame()
processTime<-c(.05)
cycles<-list(list(.08,.10,.07))
deadtime<-c(.38)
test<-as.data.frame(cbind(processTime,cycles,deadtime))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
processTime<-c(.06)
cycles<-list(list(.07,.09,.38))
deadtime<-c(.36)
test<- rbind(test,as.data.frame(cbind(processTime,cycles,deadtime)))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
2 0.06 0.07, 0.09, 0.38 0.36

Related

Fill out multi-dimensional array using Julia

I am trying to fill out a multi-dimensional array. For example, X[1] is a vector contains all values k*h[1] where k=0,...,floor(Int,15/h[i])+1. I didn't arrive to solve this problem.
`h=[0.01 0.02 0.04 0.08 0.1 0.2 0.5 0.8]
X=[k*h[i] for k in 0:floor(Int,15/h[i])+1 for i in 1:8]`
I got this error
You can get a Vector of Vectors instead of a 2D Array as noted by #PrzemyslawSzufel. This is because X[1] has length that's different from X[2], that's different from X[3], etc.
The real error, then, is that i is not defined. The variable i is defined in the second comprehension, for i in 1:8, but, it's not accessible in the first comprehension. To solve this, we can insert the first comprehension inside 2 brackets to make it as one element of the second comprehension. Finally, we'll get X[1] as a vector containing all values k*h[1], etc.
Note: In Julia, h = [0.01 0.02 ...] is a Matrix{Float64} and not a Vector. So, you should use h = [0.01, 0.02, ...] instead for Vectors. Also, the range 0:floor(Int,15/h[i])+1 is similar to 0:15/h[i]+1 because the default step is 1, except that the second range has Float64 values.
h = [0.01, 0.02, 0.04, 0.08, 0.1, 0.2, 0.5, 0.8];
X = [[k * h[i] for k in 1:15/h[i]+1] for i in 1:8];
X[1]
1501-element Vector{Float64}:
0.01
0.02
0.03
0.04
0.05
0.06
0.07
...

Use apply() on a 1-dim vector to find the best threshold

My current mission: pick up some "good" columns from a incomplete matrix, trying to remove NAs while keeping real data.
My idea: I can calculate evey column's missing data NA%. For a given threshold t, all the NA% > t columns will be removed. The removed columns also contain some real data. In these columns, present/missing will show the "price" of deleting these columes. My idea is to search the lowest "price" to delete NA as much as possible, for each dataset.
I already wrote my function till the last 2 steps:
myfunc1 <- function(x){
return(sum(is.na(x))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
myfunc3(0.5, setA) # worked
threshold <- seq(from = 0, to = 0.95, by = 0.5)
apply(X = threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not worked. stuck here.
I have 10 datasets from setA to setJ, I want to test all thresholds from 0 to 0.95. I want a matrix as a return with 10 datasets as column and 20 rows threshold with every 0.05 interval.
Did I do this correctly? Are there better ideas, or already existing libraries that I could use?
----------edit: example-----------
setA <- data.frame(cbind(c(1,2,3,4,NA,6,7,NA), c(1,2,NA,4,5,NA,NA,8),c(1,2,3,4,5,6,NA,8), c(1,2,3,4,5,6,7,8),c(NA,NA,NA,4,NA,6,NA,NA)))
colnames(setA) <- sprintf("col%s", seq(1:5))
rownames(setA) <- sprintf("sample%s", seq(1:8))
View(setA)
myfunc1 <- function(x){
return(sum(is.na(x)))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
In setA, there are 8 samples. Each sample has 5 attributes to describe the sample. Unfortunately, some data are missing. I need to delete some column with too many NAs. First, let me calculate every column's NA% .
> apply(setA, MARGIN = 2, myfunc2)
col1 col2 col3 col4 col5
0.250 0.375 0.125 0.000 0.750
If I set the threshold t = 0.3, that means col2, col5 are considered too many NAs and need to be deleted, others are acceptable. If I delete the 2 columns, I also delete some real data. (I deleted 7 real data and 9 NAs, 7/9 = 0.78. This means I sacrifice 0.78 real data when I delete 1 NA)
> myfunc3(0.3, setA)
[1] 0.7777778
I want to try every threshold's result and then decide.
threshold <- seq(from = 0, to = 0.9, by = 0.1)
apply(X= threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not work
I manualy calculate setA part:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
price: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
At last I want a talbe like:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
setA: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
setB:
setC:
...
setJ:
Do I have the correct way with the problem?
-----------Edit---------------
I already solved the problem and please close the thread.

How to optimize recursive function for finding all permutations?

I wrote the following piece of code to find all permutations of a given vector:
perm <- function(v, r = NULL, P = NULL) {
l <- length(v)
if (l == 0) {
P <- rbind(P, r)
rownames(P) <- NULL
P
} else {
for (i in 1:l) {
new_r <- c(r, v[i])
new_v <- v[-i]
P <- perm(new_v, new_r, P)
}
P
}
}
P <- perm(1:9) # takes "forever" yet e.g. perm(1:7) is quite fast!?!
P
It does what it should but the problem is that it kind of runs forever if one uses vectors of length > 8 (as above).
My question
I don't really see the problem, I found some recursive implementations that don't look so different yet are much more efficient... So is there a simple way to optimize the code so that it runs faster?
As #akrun states, recursion in R is generally not that efficient. However, if you must have a recursive solution, look no further than gtools::permutations. Here is the implementation:
permGtools <- function(n, r, v) {
if (r == 1)
matrix(v, n, 1)
else if (n == 1)
matrix(v, 1, r)
else {
X <- NULL
for (i in 1:n) X <- rbind(X, cbind(v[i], permGtools(n - 1, r - 1, v[-i])))
X
}
}
By the way, to get the full source code, simply type gtools::permutations in the console and hit enter. For more information see How can I view the source code for a function?
And here are some timings:
system.time(perm(1:8))
user system elapsed
34.074 10.641 44.815
system.time(permGtools(8,8,1:8))
user system elapsed
0.253 0.001 0.255
And just for good measure:
system.time(permGtools(9, 9, 1:9))
user system elapsed
2.512 0.046 2.567
Why is the OP's implementation slower?
Skip to the summary if you don't to read the details.
For starters, we can simply see that the OP's implementation makes more recursive calls than the implementation in gtools. To show this, we add count <<- count + 1L to the top of each function (N.B. We are using the <<- assignment operator which searches through the parent environments first). E.g:
permGtoolsCount <- function(n, r, v) {
count <<- count + 1L
if (r == 1)
.
.
And now we test a few lengths:
iterationsOP <- sapply(4:7, function(x) {
count <<- 0L
temp <- permCount(1:x)
count
})
iterationsOP
[1] 65 326 1957 13700
iterationsGtools <- sapply(4:7, function(x) {
count <<- 0L
temp <- permGtoolsCount(x, x, 1:x)
count
})
iterationsGtools
[1] 41 206 1237 8660
As you can see, the OP's implementation makes more calls in every case. In fact, it makes about 1.58... times the amount of recursive calls.
iterationsOP / iterationsGtools
[1] 1.585366 1.582524 1.582053 1.581986
As we have stated already, recursion in R has a bad reputation. I couldn't find anything pinpointing exactly why this is the case other than R does not employ tail-recursion.
At this point, it seems hard to believe that making about 1.58 times more recursive calls would explain the 175 times speed up we saw above (i.e. 44.815 / 0.255 ~= 175).
We can profile the code with Rprof in order to glean more information:
Rprof("perm.out", memory.profiling = TRUE)
a1 <- perm(1:8)
Rprof(NULL)
summaryRprof("perm.out", memory = "both")$by.total
total.time total.pct mem.total self.time self.pct
"perm" 43.42 100.00 15172.1 0.58 1.34
"rbind" 22.50 51.82 7513.7 22.50 51.82
"rownames<-" 20.32 46.80 7388.7 20.30 46.75
"c" 0.02 0.05 23.7 0.02 0.05
"length" 0.02 0.05 0.0 0.02 0.05
Rprof("permGtools.out", memory.profiling = TRUE)
a2 <- permGtools(8, 8, 1:8)
Rprof(NULL)
summaryRprof("permGtools.out", memory = "tseries")$by.total
total.time total.pct mem.total self.time self.pct
"rbind" 0.34 100.00 134.8 0.18 52.94
"cbind" 0.34 100.00 134.8 0.08 23.53
"permGtools" 0.34 100.00 134.8 0.06 17.65
"matrix" 0.02 5.88 0.0 0.02 5.88
One thing that jumps out immediately (other than the time) is the huge memory usage of the OP's implementation. The OP's implementation uses roughly 15 Gb of memory whereas the gtools implementation only use 134 Mb.
Digging Deeper
In the above, we are simply looking at memory usage in a general view by setting the memory parameter to both. There is another setting called tseries that lets you look at the memory usage over time.
head(summaryRprof("perm.out", memory = "tseries"))
vsize.small vsize.large nodes duplications stack:2
0.02 4050448 25558992 49908432 2048 "perm":"perm"
0.04 98808 15220400 1873760 780 "perm":"perm"
0.06 61832 12024184 1173256 489 "perm":"perm"
0.08 45400 0 861728 358 "perm":"perm"
0.1 0 14253568 0 495 "perm":"perm"
0.12 75752 21412320 1436120 599 "perm":"perm"
head(summaryRprof("permGtools.out", memory = "tseries"))
vsize.small vsize.large nodes duplications stack:2
0.02 4685464 39860824 43891512 0 "permGtools":"rbind"
0.04 542080 552384 12520256 0 "permGtools":"rbind"
0.06 0 0 0 0 "permGtools":"rbind"
0.08 767992 1200864 17740912 0 "permGtools":"rbind"
0.1 500208 566592 11561312 0 "permGtools":"rbind"
0.12 0 151488 0 0 "permGtools":"rbind"
There is a lot going on here, but the thing to focus on is the duplications field. From the documentation for summaryRprof we have:
It also records the number of calls to the internal function duplicate in the time interval. duplicate is called by C code when arguments need to be copied.
Comparing the number of copies in each implementation:
sum(summaryRprof("perm.out", memory = "tseries")$duplications)
[1] 121006
sum(summaryRprof("permGtools.out", memory = "tseries")$duplications)
[1] 0
So we see that the OP's implementation requires many copies to be made. I guess this isn't surprising given that the desired object is a parameter in the function prototype. That is, P is the matrix of permutations that is to be returned and is constantly getting larger and larger with each iteration. And with each iteration, we are passing it along to perm. You will notice in the gtools implementation that this is not the case as it simply as two numeric values and a vector for its parameters.
Summary
So there you have it, the OP's original implementation not only makes more recursive calls, but also require many copies which in turn bogs down the memory for drastic blows to efficiency.
It may be better to use permGeneral from RcppAlgos
P <- perm(1:5) # OP's function
library(RcppAlgos)
P1 <- permuteGeneral(5, 5)
all.equal(P, P1, check.attributes = FALSE)
#[1] TRUE
Benchmarks
On a slightly longer sequence
system.time({
P2 <- permuteGeneral(8, 8)
})
#user system elapsed
# 0.001 0.000 0.001
system.time({
P20 <- perm(1:8) #OP's function
})
# user system elapsed
# 31.254 11.045 42.226
all.equal(P2, P20, check.attributes = FALSE)
#[1] TRUE
Generally, recursive function can take longer time as recursive calls to the function takes more execution time

I want to save the result of simulation from this programme

# library (energy)
RR=100
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I don't know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in rho_list){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=sum(energy<=0.1)/RR
average0.05=sum(energy<=0.05)/RR
average0.01=sum(energy<=0.01)/RR
}
I want the program to retain the results average0.1, average0.05 and average0.01 for every rho from the rho_list used for the simulation
but I am stuck because I don't know how to get the result
Your example is not reproducible, so I'm giving you some simulated data to demonstrate how to output the result.
rho_list=c(0,0.3,0.6)
result <- sapply(rho_list, FUN = function(rho, ...) {
average0.1 = runif(1)
average0.05 = runif(1)
average0.01 = runif(1)
c(rho = rho, a01 = average0.1, a0.05 = average0.05, a0.01 = average0.01)
}, RR = RR, n = n, a = a, b = b, miu1 = miu1, miu2 = miu2, m22 = m22, simplify = FALSE)
do.call("rbind", result)
rho a01 a0.05 a0.01
[1,] 0.0 0.0136175 0.08581583 0.07171591
[2,] 0.3 0.8334469 0.42103038 0.07857328
[3,] 0.6 0.8231120 0.40647485 0.65408540
One option would be to store the results in a list for each value of rho and then bind them into a single data frame. Here's an example. Note that since rho isn't defined in the set-up code, I've substituted the definition of m22 for m22 in the loop. Also, I've set RR=10 to save time in running the code.
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
rho_list=c(0, 0.3, 0.6)
energy_threshold = c(0.1, 0.05, 0.01) # Store energy thresholds in a vector
# Create a list of data frames. Each data frame contains the result for each
# of the three energy thresholds for one value of rho.
results = lapply(rho_list, function(rho) {
energy=rep(NA,RR)
for (i in 1:RR) {
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt((b^2)*(1-(rho^2)))*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
data.frame(rho, energy_threshold, result=sapply(energy_threshold, function(y) sum(energy <= y)/RR))
})
# Bind the three data frames into a single data frame
results = do.call(rbind, results)
And here's the output:
results
rho energy_threshold result
1 0.0 0.10 0.1
2 0.0 0.05 0.0
3 0.0 0.01 0.0
4 0.3 0.10 0.2
5 0.3 0.05 0.1
6 0.3 0.01 0.0
7 0.6 0.10 0.0
8 0.6 0.05 0.0
9 0.6 0.01 0.0
I stored the variables from the loop into a numeric vector and then used cbind() to store results. Here is the entire code :
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
average0.1 <- as.numeric()
average0.05 <- as.numeric()
average0.01 <- as.numeric()
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I dont know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in unique(rho_list)){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=rbind(average0.1, sum(energy<=0.1)/RR)
average0.05=rbind(average0.05, sum(energy<=0.05)/RR)
average0.01=rbind(average0.01, sum(energy<=0.01)/RR)
}

Transform Numbers - Trim Integral Part from Floating Point Number

Given a column of data (of the type 39600.432, 39600.433, etc) I would like to drop the integer part of the number and keep only the decimals (transforming 39600.432 into 432, and 39600.433 into 433). How can I do this?
Let's say your column is the vector x.
> x <- c(39.456, 976.902)
> x <- x - as.integer(x)
> x
[1] 0.456 0.902
That should work. You can then just multiply by 1000 to convert the current x to integers. You will need some more processing if you want 3.9 to become 9.
> x <- 1000*x
> x
[1] 456 902
Hope the helps!
Many good answers, here's one more using regular expressions.
> g <- c(134.3412,14234.5453)
> gsub("^[^\\.]*\\.", "", g)
[1] "3412" "5453"
To strip the integral part without a subtraction or regex, you can use the modulus operator.
x <- (10000:10010)/100
x
## [1] 100.00 100.01 100.02 100.03 100.04 100.05 100.06 100.07 100.08 100.09 100.10
x %% 1
## [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
%% 1 is meaningful in R. This does leave the value as fractional, which may not be ideal for your use.
You are looking for the floor function. But you could do as.integer as well.
Here is an approach using regular expressions
g<-c(134.3412,14234.5453)
r<-regexpr("[0-9]+$",g)
as.numeric(regmatches(g,r))
This should do it:
g <- c(134.3412,14234.5453)
h <- floor(g)
g - h

Resources