this is probably very easy but I could use some help... I've built an agent-based model that runs 300 times for an arbitrary number of steps, depending on what happens inside the model. the resulting data is structured like this:
I want to isolate the runs that end in 150 steps or less and analyze them in their entirety (not just the final step). what's the best way to do this in R? Thanks!
First, use aggregate to get the number of steps per run:
n_steps <- aggregate(AT6$run, by=list(run=AT6$run), FUN=length)
Now compute a filter variable:
n_steps <- within(n_steps, filter <- x<=150)
Merge with the original data and keep only filtered runs:
AT6_f <- merge(AT6, n_steps)
AT6_f <- AT6_f[AT6_f$filter,]
And finally split the data.frame by runs:
result <- split(AT6_f, AT6_f$run)
Note: the result is a list, each element being a data.frame containing a single run. If you have a function f that analyses each run, you should pass the result above to it with this:
lapply(result, f)
Related
I have a data frame 90 observations and 124306 variables named KWR all numeric data. I want to run a Kruskal Wallis analysis within every column between groups. I added a vector with every different group behind my variables named "Group". To test the accuracy, I tested one peptide (named x2461) with this code:
kruskal.test(X2461 ~ Group, data = KWR)
Which worked out fine and got me a result instantly. However, I need all the variables to be analyzed. I used lapply while reading this post: How to loop Bartlett test and Kruskal tests for multiple columns in a dataframe?
cols <- names(KWR)[1:124306]
allKWR <- lapply(cols, function(x) kruskal.test(reformulate("Group", x), data = KWR))
However, after 2 hours of R working non stop, I quit the job. Is there any more efficient way of doing this?
Thanks in advance.
NB: first time poster, beginner in R
Take a look at kruskaltests in the Rfast package. For the KWR data.frame, it appears it would be something like:
allKWR <- Rfast::kruskaltests(as.matrix(KWR[,1:124306]), as.numeric(as.factor(KWR$Group)))
This was great - I got 50 columns and several hundred cases in 0.01 system time.
I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".
I have a numeric vector stock_data containing thousands of floating point numbers, I know i can sample them using
sample(stock_data, sample_size)
I want to take 100 different samples and populate them in a list of samples.
How do i do that without using a loop to append the samples to a list?
I thought of creating a list replicating the stock data 100 times then using lapply on them.
I tried:
all_repl <- as.list(rep(stock_data,100))
all_samples <- lapply(all_repl, sample, size=100)
But all_repl doesn't contain a list of data, it contains a single numeric vector which has replicated the data 100 times.
Can anyone suggest what's wrong and point out a better method to do what i want.
We can use replicate
replicate(100, sample(stock_data, sample_size))
Using simplify=FALSE get the output in a list. Using a reproducible example
replicate(5, sample(1:9, 5), simplify=FALSE)
I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.
library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)
This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.
Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length
MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)
More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.
# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")
# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)
The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.
I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.
Thanks in advance!
ANSWER (specific to example in question)
Block should be
MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]):
MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
Here is a solution using aggregate from base R
data(baseball, package = "plyr")
MyData <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.