I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!
Related
Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)
I have a data frame g with 3 columns, a, b and X. I need to multiply X with each entry in a and add to b to create a new variable. Right now i'm using a for loop
for(i in 1:N) {
g$Eout[i] = mean((g$a[i]*g$X+g$b[i]-(g$X)^2)^2);
}
which is really slow in R. Is there anyway to do this faster?
Try this:
set.seed(2)
N <- 30
g <- data.frame(a=1:N,b=seq(1,2,length.out=N),X=seq(10,20,length.out=N))
g$new <- sapply(g$X, function(x) mean((g$a * x + g$b - x^2)^2))
head(g)
# a b X new
# 1 1 1.000000 10.00000 10735.67
# 2 2 1.034483 10.34483 11077.04
# 3 3 1.068966 10.68966 11416.58
# 4 4 1.103448 11.03448 11757.01
# 5 5 1.137931 11.37931 12101.40
# 6 6 1.172414 11.72414 12453.14
Since you want each value of X multiplying all values of g$a, etc, you need to resort to some vectorized goodness. (Using #thelatemail's suggested 3e4 takes about 7sec per sapply ...)
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I'm relatively new in R (~3 months), and so I'm just getting the hang of all the different data types. While lists are a super useful way of holding dissimilar data all in one place, they are also extremely inflexible for function calls, and riddle me with angst.
For the work I'm doing, I often uses lists because I need to hold a bunch of vectors of different lengths. For example, I'm tracking performance statistics of about 10,000 different vehicles, and there are certain vehicles which are so similar they can essentially be treated as the same vehicles for certain analyses.
So let's say we have this list of vehicle ID's:
List <- list(a=1, b=c(2,3,4), c=5)
For simplicity's sake.
I want to do two things:
Tell me which element of a list a particular vehicle is in. So when I tell R I'm working with vehicle 2, it should tell me b or [2]. I feel like it should be something simple like how you can do
match(3,b)
> 2
Convert it into a data frame or something similar so that it can be saved as a CSV. Unused rows could be blank or NA. What I've had to do so far is:
for(i in length(List)) {
length(List[[i]]) <- max(as.numeric(as.matrix(summary(List)[,1])))
}
DF <- as.data.frame(List)
Which seems dumb.
For your first question:
which(sapply(List, `%in%`, x = 3))
# b
# 2
For your second question, you could use a function like this one:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
list.to.df(List)
# a b c
# 1 1 2 5
# 2 NA 3 NA
# 3 NA 4 NA
Both of those tasks (and many others) would become much easier if you were to "flatten" your data into a data.frame. Here's one way to do that:
fun <- function(X)
data.frame(element = X, vehicle = List[[X]], stringsAsFactors = FALSE)
df <- do.call(rbind, lapply(names(List), fun))
# element vehicle
# 1 a 1
# 2 b 2
# 3 b 3
# 4 b 4
# 5 c 5
With a data.frame in hand, here's how you could perform your two tasks:
## Task #1
with(df, element[match(3, vehicle)])
# [1] "b"
## Task #2
write.csv(df, file = "outfile.csv")
I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)