repeated random samples from subsets within data

repeated random samples from subsets within data - r

I am an R novice and have been running into a brick wall with what should be a simple for loop process. Data consists of a list with dimensions of 81161 by 9. The observations are of individuals over time. The current need is to isolate an unique group of observations and randomly extract one of the observation data points. At this stage I have reviewed and been attempting a few options none of which are being executed property. First the for loop then apply.
To provide a better idea of the workflow I have outlined. This should be a relatively straight forward split-apply-combine. The apply is a sample with restriction to unique individual_days. To do this the code goes through the basic defining of over all dimensions, then defining of unique values, a sort and rank (from which the unique individuals_day are set to an ordianl scale and then these are linked back to the originsl data using the individuals_day as the key). From this point I have attempted two alternatives with the for loops --- first using a split by rank to provide DSrank$'1, 2, 3...n' (attempted to be used in example 2) and using the subset seen in example 1. A single sample would then be extracted at random and collated into a sub-dataset. From this point other analysis will be performed.
### example 1: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SDS[[i]] <- sample(nrow(SplitDS$[i]), 1, replace=FALSE)
`SDS[[i]]["Samples"] < i
}
head(SDS)
### example 2: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SubSDID <- subset(DSID, DSrank == 'i', )
`SDS <- sample(nrow(SubSDID), 1, replace=FALSE)
}
head(SDS)
### example 3: apply subset
bootstrap <- lapply(1:length(UID), function(i) {
`samp <- sample(1:nrow(DSID$DSrank, rep = TRUE)
`DSID$DSrank[sampl, ]
}
These have been based off examples I have found through CRAN, stackoverflow, and other R-code search results.
If you have any suggestions, tips, or tricks that you could share it would be greatly appreciated.
MB

Related

Need help in intrepreting a warning in R

I wrote a function in R, which parses arguments from a dataframe, and outputs the old dataframe + a new column with stats from each row.
I get the following warning:
Warning message:
In [[.data.frame(xx, sxx[j]) :
named arguments other than 'exact' are discouraged
I am not sure what this means, to be honest. I did spot checks on the results and seem OK to me.
The function itself is quite long, I will post it if needed to better answer the question.
Edit:
This is a sample dataframe:
my_df<- data.frame('ALT'= c('A,C', 'A,G'),
'Sample1'= c('1/1:35,3,0,35,3,35:1:1:0:0,1,0', './.:0,0,0,0,0,0:0:0:0:0,0,0'),
'Sample2'= c('2/2:188,188,188,33,33,0:11:11:0:0,0,11', '1/1:255,99,0,255,99,255:33:33:0:0,33,0'),
'Sample3'= c('1/1:219,69,0,219,69,219:23:23:0:0,23,0', '0/1:36,0,78,48,87,120:7:3:0:4,3,0'))
And this is the function:
multi_allelic_filter_v2<- function(in_vcf, start_col, end_col, threshold=1){
#input: must have gone through biallelic_assessment first
table0<- in_vcf
#ALT_alleles is the number of alt alleles with coverage > threshold across samples
#The following function calculates coverage across samples for a single allele
single_allele_tot_cov_count<- function(list_of_unparsed_cov,
allele_pos){
single_allele_coverage_count<- 0
for (i in 1:length(list_of_unparsed_cov)) { # i is each group of coverages/sample
single_allele_coverage_count<- single_allele_coverage_count+
as.numeric(strsplit(as.character(list_of_unparsed_cov[i]),
split= ',')[[1]])[allele_pos]}
return(single_allele_coverage_count)}
#single row function
#Now we need to reiterate on each ALT allele in the row
single_row_assessment<- function(single_row){
# No. of alternative alleles over threshold
alt_alleles0 <- 0
if (single_row$is_biallelic==TRUE){
alt_alleles0<- 1
} else {
alt_coverages <- numeric() #coverages across sample of each ALT allele
altcovs_unparsed<- character() #Unparsed coverages from each sample
for (i in start_col:end_col) {
#Let's fill altcovs_unparsed
altcovs_unparsed<- c(altcovs_unparsed,
strsplit(x = as.character(single_row[1,i]), split = ':')[[1]][6])}
#Now let's calculate alt_coverages
for (i in 1:lengths(strsplit(as.character(
single_row$ALT),',',fixed = TRUE))) {
alt_coverages<- c(alt_coverages, single_allele_tot_cov_count(
list_of_unparsed_cov = altcovs_unparsed, allele_pos = i+1))}
#Now, let's see how many ALT alleles are over threshold
alt_alleles0<- sum(alt_coverages>threshold)}
return(alt_alleles0)}
#Now, let's reiterate across each row:
#ALT_alleles is no. of alt alleles with coverage >threshold across samples
table0$ALT_alleles<- -99 # Just as a marker, to make sure function works
for (i in 1:nrow(table0)){
table0[i,'ALT_alleles'] <- single_row_assessment(single_row = table0[i,])}
#Now we now how many ALT alleles >= threshold coverage are in each SNP
return(table0)}
Basically, in the following line:
'1/1:219,69,0,219,69,219:23:23:0:0,23,0'
fields are separated by ":", and I am interested in the last two numbers of the last field (23 and 0); in each row I want to sum all the numbers in those positions (two separate sums), and output how many of the "sums" are over a threshold. Hope it makes sense...

OK,
I re-run the script with the same dataset on the same computer (same project, then new project), then run it again on a different computer, could not get the warnings again in any case. I am not sure what happened, and the results seem correct. Never mind. Thanks anyway for the comments and advice

colMeans failing within for loop R

I've been working on scraping data from the web and manipulating it in R, but I'm having some trouble when I begin to include the dreaded for loop. I am working with some arbitrary sport statistics, with the idea being that I can calculate the per game average for various stats for each player.
Each of these pieces works outside of the loop, but falls apart inside. Ideally, my code will do three things:
1) Scraping the data. I have a list of player names "names" (by row), and the unique piece of the url for each player in column 2. The website has a table named "stats" on each page, which is nice of them.
library(XML)
statMean <- matrix(ncol = 8, nrow = 20)
for(h in 1:20){
webname <- names[h,2]
vurl <- paste("http://www.pro-football-reference.com/players/",
webname, "/gamelog/2015")
tables <- readHTMLTable(vurl)
t1 <- tables$stats
2) Pick the columns I want and turn the values into numeric values.
temp <- t1[, c(9,10,12,13,14,15,17)]
temp <- sapply(temp, function(x) as.numeric(as.character(x)))
3) Calculate the mean of each column, bind the unique player name to the column means, and rbind that vector to a full table.
statMean <- rbind(statMean, c(webname, colMeans(temp)))
When I run through these steps outside of the Loop, it seems to work ... when I run the loop I inevitably get:
Error in colMeans(temp) : 'x' must be an array of at least two dimensions
I've looked at a number of For Loop questions on this site, and my code looks a lot different than how it started. Unfortunately, each time I try something new, I end up with a version of the above error.
Any help would be fantastic. Thanks.

How to efficiently iterate through a complicated function that outputs a dataframe?

I essentially need to iterate through a set of values for parameters A,B,C to generate a table of results that will help me analyze the importance of such parameters. This is for a program in R.
Let's say that:
A goes from rangeA = 1:10
B goes from rangeB = 11:20
C goes from rangeC = 21:30
The simplest (not most efficient) solution that I currently use goes something like this:
### here I create this empty dataframe because I add on each tmp calc later
res <- data.frame()
### here i just create a random dataframe for replicative purposes
dataset <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
ParameterAdjustment() <- function{
for(a in rangeA){
for(b in rangeB){
for(c in rangeC){
### this is a complicated calculation that is much more
### difficult than the replicable example below
tmp <- CalculateSomething(dataset,a,b,c)
### an example calculation
### EDIT NEW EXAMPLE CALCULATION
tmp <- colMeans(dataset+a*b*c)
tmp <- data.frame(data.frame(t(tmp),sd(tmp))
res <- rbind(res,tmp)
}
}
}
return(res)
}
My problem is that this works fine with my original dataset that runs calculations on a 7000x500 dataframe. However, my new datasets are much larger and performance has become a significant issue. Can anyone suggest or help with a more efficient solution? Thank you.

Not sure what language the above is, so not sure how relevant this is but here goes: Are you outputting/sending the data as you go or collecting all the display-results in memory then outputting them all in one go at the end? When I've encountered similar problems with large datasets and this approach has helped me out a few times. For example, sending 10,000s of data-points back to the client for a graph, rather than generating an array of all those points and sending that, I output to screen after each point and then free up the memory. It still takes a while but that's unavoidable. The important bit is that it doesn't crash.

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy

First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

make loop to create list of igraph objects in R

I'd like to create a list of Igraph objects with the data used for each Igraph object determined by another variable.
This is how I create a single Igraph object
netEdges <- NULL
for (idi in c("nom1", "nom2", "nom3")) {
netEdge <- net[c("id", idi)]
names(netEdge) <- c("id", "friendID")
netEdge$weight <- 1
netEdges <- rbind(netEdges, netEdge)
}
g <- graph.data.frame(netEdges, directed=TRUE)
For each unique value of net$community I'd like to make a new Igraph object. Then I would like to calculate measures of centrality for each object and then bring those measures back into my net dataset. Many thanks for your help!

Since the code you provide isn't completely reproducible, what follows is not guaranteed to run. It is intended as a guide for how to structure a real solution. If you provide example data that others can use to run your code, you will get better answers.
The simplest way to do this is probably to split net into a list with one element for each unique value of community and then apply your graph building code to each piece, storing the results for each piece in another list. There are several ways to doing this type of thing in R, one of which is to use lapply:
#Break net into pieces based on unique values of community
netSplit <- split(net,net$community)
#Define a function to apply to each element of netSplit
myFun <- function(dataPiece){
netEdges <- NULL
for (idi in c("nom1", "nom2", "nom3")) {
netEdge <- dataPiece[c("id", idi)]
names(netEdge) <- c("id", "friendID")
netEdge$weight <- 1
netEdges <- rbind(netEdges, netEdge)
}
g <- graph.data.frame(netEdges, directed=TRUE)
#This will return the graph itself; you could change the function
# to return other values calculated on the graph
g
}
#Apply your function to each subset (piece) of your data:
result <- lapply(netSplit,FUN = myFun)
If all has gone well, result should be a list containing a graph (or whatever you modified myFun to return) for each unique value of community. Other popular tools for doing similar tasks include ddply from the plyr package.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

repeated random samples from subsets within data - r

Related

Need help in intrepreting a warning in R

colMeans failing within for loop R

How to efficiently iterate through a complicated function that outputs a dataframe?

perform function on pairs of columns

make loop to create list of igraph objects in R

Categories

Resources