I am new to R and am trying create a new dataframe of bootstrapped resamples of groups of different sizes. My dataframe has 6 variables and a group designation, and there are 128 groups of different Ns. Here is an example of my data:
head(PhenoM2)
ID Name PhenoNames Group HML RML FML TML FHD BIB
1 378607 PaleoAleut PaleoAleut 1 323.5 248.75 434.50 355.75 46.84 NA
2 378664 PaleoAleut PaleoAleut 1 NA 238.50 441.50 353.00 45.83 277.0
3 378377 PaleoAleut PaleoAleut 1 309.5 227.75 419.00 332.25 46.39 284.0
4 378463 PaleoAleut PaleoAleut 1 283.5 228.75 397.75 331.00 44.37 255.5
5 378602 PaleoAleut PaleoAleut 1 279.5 230.00 393.00 329.50 45.93 265.0
6 378610 PaleoAleut PaleoAleut 1 307.5 234.25 419.50 338.50 43.98 271.5
Pulling from this question - bootstrap resampling for hierarchical/multilevel data - and taking some advice from others (thanks!) I wrote the code:
resample.M <- NULL
for(i in 1000){
groups <- unique(PhenoM2$"Group")
for(ii in 1:128)
data.i.ii <- PhenoM2[PhenoM2$"Group"==groups[ii],]
resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii),replace=T),]
}
Unfortunately, this gives me the warning:
In resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii), replace = T),:
number of items to replace is not a multiple of replacement length
Which I understand, since each of the 128 groups has a different N and none of it is a multiple of 1000. I put in resample.M[i] to try and accumulate all of the 1000x resamples of the 128 groups into a single database, and I'm pretty sure the problem is here.
Nearly all of the examples of for loops I've read create a vector database - numeric(1000) - then plug in the information, but since I'm wanting all of the data (which include factors, integers, and numerics) this doesn't work. I tried making a matrix to put the info in (there are 2187 unique individuals in the dataframe):
resample.M <- matrix(ncol=2187000,nrow=10)
But it's giving me the same warning.
So, since I'm sure I'm missing something basic here, I have three questions:
How can I get this code to resample all of the groups (with replacement and based on their individual Ns)?
How can I get this code to repeat this resampling 1000x?
How can I get the resamples of every group into the same database?
Thank you so much for your insight and expertise!
I think you may have wanted to use double square bracket, to store the results in a list, i.e. resample.M[[i]] <- .... Apart from that it makes more sense to write PhenoM2$Group than PhenoM2$"Group" and also groups <- unique(PhenoM2$Group) can go outside of your for loop since you only need to compute it once. Also replace 1:128 by 1:length(groups) or seq_along(groups), so that you don't need to hard code the length of the vector.
Because you will often need to operate on data frames grouped by some variable, I suggest you familiarise yourself with a package designed to do that, rather than using for loops, which can be very slow. The best one for a beginner in R may be plyr, which has an easy syntax (although there are many possibilities, including the slightly more "advanced" packages like dplyr and data.table).
So for a subset d <- subset(PhenoM2, Group == 1), you already have the function you need to perform on it: function(d) d[sample(1:nrow(d), replace = TRUE),].
Now to go over all such subsets, perform this operation and then arrange the results in a new data frame named samples you do
samples <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
So what remains is to iterate this 1000 or however many times you want. You can use a for loop for this, storing the results in a list. Note that you need to use double square bracket [[ to set elements of the list.
n <- 1000 # number of iterations
samples <- vector("list", n) # list of length n to store results
for (i in seq_along(samples))
samples[[i]] <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
An alternative way would be to use the function replicate, that performs the same task many times.
Once you have done this, all resamples will be stored in a list. I am not sure what you mean by "How can I get the resamples of every group into the same database". If you want to group them in a single data frame, you do all.samples <- do.call(rbind, samples). In general, you can format your list of samples using do.call and lapply together with a function.
Related
I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)
Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.
I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.
I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra
What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.
The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106
I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]
I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.