Related
I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1
In R, How do we use a vector instead of element in the lag function. i.e for Lag(x,k=2); instead of 2 I want to use a vector because I want to lag each row by a different value. So one row could have a lag of 3, while 1 could be 0 etc.
Example:
a #lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3
i think you may need to write your own function for this task. i wrote one that i think will be what you need, or perhaps point you in the right direction:
x1 <- c(75,98,65,45,78,94,123,54) #a fake data set for us to lag
y1 <- c(2,3,1,4,1,2,3,5) #vector of values to lag by
#the function below takes the data, x1, and lags it by y1
dynlag <- function(x,y) {
a1 <- x[length(x)-y]
return(a1)
}
#test out the function
dynlag(x1,y1)
hope this helps. :)
Here is a solution with index calculus:
D <- read.table(header=TRUE, text=
'a lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3')
i <- seq(length(D$a))
erg <- D$a[i - D$lags]
all.equal(erg, D$d)
I have a problem where I have k items, say {0,1}, and I have to enumerate all possible N draws, say N=3. That is, I am trying to find all possible samples of a given size from a vector, with replacement.
I can get there via the following loop approach:
for (i1 in c(0,1)){
for (i2 in c(0,1)){
for (i3 in c(0,1)){
print(paste(i1,i2i3,collapse="_"))
}}}
However, this feels like a kludge. Is there a better way to do this using base R?
You can use interaction:
items <- c(0, 1)
n <- 3
levels(interaction(rep(list(items), n), sep = "_"))
# [1] "0_0_0" "1_0_0" "0_1_0" "1_1_0" "0_0_1" "1_0_1" "0_1_1" "1_1_1"
If vec is your vector and n is the number of times you draw from vec, to enumerate all the possibilities, try:
expand.grid(replicate(n,vec,simplify=FALSE),
KEEP.OUT.ATTRS=FALSE,stringsAsFactors=FALSE)
For instance:
vec<-0:1
n<-3
expand.grid(replicate(n,vec,simplify=FALSE),
KEEP.OUT.ATTRS=FALSE,stringsAsFactors=FALSE)
# Var1 Var2 Var3
#1 0 0 0
#2 1 0 0
#3 0 1 0
#4 1 1 0
#5 0 0 1
#6 1 0 1
#7 0 1 1
#8 1 1 1
The number of permutations rises very quickly; try the above only for small vectors and small n.
I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1
t has 20 values, c has also 20 values 0, 1. I am interested in t matrix. Here I have a Loop, repeating 5 times. Every time sel give 20 values. I want to store there frequency in t.mat. But how can I get the required results, the resulting table may look like the below table
t <- 1:20
# c <- seq(0:1, 10)
t.mat <- array(dim = c(20, 5))
rep <- 5
for(mm in 1:rep){
sel <- sample(1:20, replace = TRUE)
tt <- t[sel]
# cc <- c[sel]
t.mat[, mm] = tt[1:20] # here the problem lies, I have no clue how
}
The output for the above may be look like below. But t will be of 20 values, I roughly give just six lines:
t v1 v2 v3 v4 v5
1 1 0 1 0 1
2 0 0 2 0 1
3 0 1 1 1 0
4 1 1 0 0 1
5 2 0 2 1 0
6 0 0 0 1 2
I'm guessing a little bit as to what you want, but it's probably this:
do.call(cbind, lapply(1:5, function(i)
tabulate(sample(t, replace = T), nbins = 20)))
sample generates the samples you want, tabulate counts frequencies (with max specified manually as it will not always occur in sample), lapply iterates the procedure 5 times, and finally do.call(cbind, binds it all together by column.