I´m very very new at R and Rstudio and not great with programming and stats. I need to calculate a dissimilarity index, and I´m trying to use OasisR package. The function DIDuncan(x) computes an idex for every population group, but is does it in general for the entire data.frame. A need a calculation for each observation and each population gorup. According to:
# https://github.com/cran/OasisR/blob/419f40ff60eb1756a2b8ed0960c5c9e8cb90368d/R/SegFunctions.R
ISDuncan <- function(x) {
x <- segdataclean(as.matrix(x))$x
result <- vector(length = ncol(x))
for (i in 1:ncol(x))
result[i] <- 0.5 * sum(abs((x[,i]/sum(x[,i])) - ((rowSums(x)-x[,i])/sum((rowSums(x)-x[,i])))))
return(round(result, 4))
}
Can anyone help me? Thanks!!!
Jor
Thanks SOOOO much! results look good, but it works for the firts observarion? how can I get a matrix or data frame with every observation I have in the data.frame? I'm planning doing this for a large number of observarions
Anyway this is VERY helpfull!!!
Thank you for your interest and help, I'm very lost. I have an Excel sheet where every row is a district of a province. And I want to know ISDuncan value for the social groups in each district A, B and C
data example
What I hope is to get is a matrix where the rows represent the district and the columns the social groups, and every district has its own ISDuncan values. Now I'm trying with a small dataset, but I would do this analysis for a large amount of spatial units. Thanks!
Related
Seems like quite an easy problem to solve, but I can't seem to get my head around it in R.
I have dataset with the following columns:
'Biomass' where each row is a value of biomass for a particular species
'Count' where each row is the number of individual animals of that species counted
I need to create a histogram of biomasses, but if I use hist(DF$Biomass) I will get a histogram of the biomasses of the animals where each value is one animal.
I need to include the count, so that I have (for example) the weight frequencies of elephant x 2, giraffe x 56 etc..
you're not making my life easy :)
Is this what you want ?
DF <- data.frame(Biomass=c(200,200,1500),Count = c(36,20,2))
DF2 <- aggregate(Count ~ Biomass,DF,sum) # sum different occurrences for each Biomass value
barplot(DF2$Count,names.arg =DF2$Biomass) # presents them with a barplot, which is more appropriate than an histogram in the R sense here.
If I understood you right that is what you need :)
biomass<-c(1,5,7,6,3)
count<-c(1,2,1,3,4)
new<-NULL
for (i in 1:length(biomass))
{
new<-c(new, rep(biomass[i], count[i]))
}
new
hist(new)
So finally just type:
new<-NULL
for (i in 1:length(DF$Biomass))
{
new<-c(new, rep(DF$Biomass[i], DF$Count[i]))
}
hist(new)
I want to create a random subset of a data.table df that is very large (around 2 million lines).
The data table has a weight column, wgt that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.
Is there a faster/more efficient way to do this?
Thank you!
I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.
I am very new to the R interface but need to use the program in order to run the relevant analyses for my clinical doctorate thesis. So, apologies in advance if this is a novice question.
I have a matrix of beta methylation values with the following dimensions:485577x894. The row names of the matrix refer to cpg sites which range in non-numerical and non-ascending order (e.g. "cg00000029" "cg00000108" "cg00000109" "cg00000165"), while the column names refer to participant IDs which are also in non-numerical and non-ascending order (e.g. "11209" "14140" "1260" "5414").
I would like to identify which beta methylation values are > 0.5 so that I can exclude them from further analyses. In doing so, I need the data to stay in a matrix format. All attempts I have made to conduct this analysis have resulted in retrieval of integer variables rather than the data in a matrix format.
I would be so grateful if someone could please advise me of the code to conduct this analysis.
Thank you for your time.
Cheers,
Alicia
set.seed(1) # so example is reproduceable
m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6]
m[m>0.5]<-NA # anything > 0.5 set to NA
z <- na.omit(m) # remove all rows with any NA's
I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]
I'm trying to save a number of spectral measurements in a data.frame. Each measurement has a number of attributes as well as two channels of spectral data, each with 2048 data points. I would like to have each channel be a single point of data in the data frame.
Something like this:
timestamp type integration channel1 channel2
1 2011-10-02 02:00:01 D 2000 (spec) (spec)
2 2011-10-02 02:00:07 D 2000 (spec) (spec)
Where each (spec) is a vector of 2048 values. I simply cannot get it to work, and I now turn to you guys for help.
Thanks in advance.
You can add matrix as one of data.frame fields, so you have to put all vectors as matrix rows.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- matrix(rnorm(3*2048), nrow=3)
DF$channel2 <- matrix(rnorm(3*2048), nrow=3)
ncol(DF)# == 5
I think what you want is doable but I may not be fully understanding your question. Heed Joris's suggestion though as this may be a better way of storing your data. You can accomplish what you want by storing the vectors of 2048 values in a list that you then add to the data frame as a column. Your table wasn't easily imported (for me anyway) with read.table so I made up my own data frame and example.
DF <- data.frame(timestamp=1:3, type=LETTERS[1:3], integration=rep(2000, 3))
DF$channel1 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))
DF$channel2 <- list(c(rnorm(2048)), c(rnorm(2048)), c(rnorm(2048)))