Bandwidth selection using NP package - r

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!

1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Related

Is there a way to run a wilcoxon test for variables with different lengths?

I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)

Create a new data frame of the means of randomly selected rows - looped

Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!
Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.

Loop and clear the basic function in R

I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.

Efficient function to return varying length vector from lookup table

I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources