In my case, the data consisted of 8 variables and 500 observations. When I used the leaps() code, instead of showing the $2^8 - 1$ submodels, the output showed only 10 models corresponding to those n, where $n \choose 2$ > 10. How can I get the entire output?
Related
I'm trying to generate a table with 15000 rows and 16 columns, however Julia loses or omits some variables.
I have tried different ways to run DataFrame but I get the following results:
df = DataFrame(periods=15000, households=5000, giniY=giniY)
15,000 rows × 3 columns
However, when I run with the 16 variables I get the following result
df = DataFrame(periods=15000, households=5000, gamma=gamma, delta=delta,
betta=betta, alfa=alfa, miz=miz, roz=roz,
phi=phi, rok=rok, mie=mie, roe=roe,
roez=roez)
1×13 DataFrame. Omitted printing of 3 columns
Your second df variable has 13 columns (I have aligned the code in my edit 4 variables per line so that it is clearly visible). Julia omits printing all columns if they would not fit the screen (imagine what would happen if you had a data frame with 10 000 columns and always tried to print them all).
In REPL Julia omits printing columns if they do not fit the screen unless you pass allcols=true keyword argument to show or create a custom IOContext that you pass to show that defines a non-standard output width. All this is explained in show documentation for DataFrame.
In Jupyter Notebook a similar thing happens, but by default the width of the output is governed by "COLUMNS" environment variable. The details how you can set it are explained at the beginning of the DataFrames.jl manual here.
I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.
I'm doing PRC using the vegan-package but run into trouble when I attempt to perform an Anova on the results. I get the following error-message:
Error in doShuffleSet(spln[[i]], nset = nset, control) :
number of items to replace is not a multiple of replacement length
The problem originates in the shuffleSet-function of the permute-package. I created a reproducible example below. The weird thing is that the shuffle-function does not cause trouble, but the shuffleSet-function does.
In my experiment 3 treatments were given to 4 animals. The animals received the treatments in different orders. On every day, 5 samples were collected over time.
I would like to permute my observations within animals and not between them. Therefore I use AnimalID as a block.
I would like to permute days (in my actual experiments animals received the same treatment multiple times) but keep the measurements within a day intact. Hence I chose to permute Days freely and have no permutations within Days.
require(permute)
TreatmentLevels=3
Animals=4
TimeSteps=5
AnimalID=rep(letters[1:Animals],each=TreatmentLevels*TimeSteps)
Time=rep(1:TimeSteps,Animals=TreatmentLevels)
#treatments were given in different order per animal.
Day=rep(c(1,2,3,2,3,1,3,2,1,2,3,1),each=TimeSteps)
Treatment=rep(rep(LETTERS[1:TreatmentLevels],each=TimeSteps),Animals)
dataset=as.data.frame(cbind(AnimalID,Treatment,Day,Time))
ctrl=how(blocks = dataset$AnimalID,plots = Plots(strata=dataset$Day,type = "free"),
within=Within(type="none"), nperm = 999)
#this works
shuffle(60,control=ctrl)
#this giveas an error
shuffleSet(60,nset=1,control=ctrl)
shuffleSet(60,nset=10,control=ctrl)
The problem seems to be in the block. Because this works
dataset$AnimalDay=factor(paste0(dataset$AnimalID,dataset$Day))
ctrl=how(plots = Plots(strata=dataset$AnimalDay,type = "free"),
within=Within(type="none"), nperm = 999)
#this works
shuffle(60,control=ctrl)
shuffleSet(60,nset=1,control=ctrl)
shuffleSet(60,nset=10,control=ctrl)
The key problem seems to be nset = 1: the permutation is generated and shuffleSet works, but printing the result fails because one set is dropped to a vector and print expects a matrix. You can get the permutation, you can use the permutation, but you cannot print it.
We got to fix this.
I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.
I have a simple 9 column file. I wan't to compute certain statistics for each column and then plot it (using gnuplot).
1) This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
2) In the output screen I can see that the operation is successful. Note that the number of columns/records is 8
* FILE:
Records: 8
Out of range: 0
Invalid: 0
Blank: 0
Data Blocks: 1
* COLUMNS:
Mean: 6.5000 491742.6625
Std Dev: 2.2913 703.4865
Sum: 52.0000 3.93394e+06
Sum Sq.: 380.0000 1.93449e+12
Minimum: 3.0000 [0] 490312.0000 [2]
Maximum: 10.0000 [7] 492643.5000 [7]
Quartile: 4.5000 491329.5000
Median: 6.5000 491911.1500
Quartile: 8.5000 492252.2500
Linear Model: y = 121.8 x + 4.91e+05
Correlation: r = 0.3966
Sum xy: 2.558e+07
3) Now I can access statistics on the first 2 columns by appending _x and _y like this
print stats_median_x
print stats_median_y
My questions are:
How can I access statistics (lets say medians) for the remaining 6 columns?
How could I plot lets say a line over all medians against some X axis?
I know that I can simply add a python script to pre-compute all this, but I would prefer to avoid it if there is an easy way to do it using gnuplot itself.
Thanks!
Short answer(s)
"How can I access statistics of the other column?"
with stats 'data'using n you will access to the nth column...
"How can I plot for example all medians?"
e.g. a set print and a do for cycle can create a data-file that you can use for the plot.
A working solution
set print "StatDat.dat"
do for [i=2:9] { # Here you will use i for the column.
stats 'data.dat' u i nooutput ;
print i, STATS_median, STATS_mean , STATS_stddev # ...
}
set print
plot "StatDat.dat" us 1:2 # or whatever column you want...
Some words more about it
Asking help to gnuplot itself with help stats it's possible to read a lot of interesting things :-).
Syntax:
stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]
This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot for details on the index, every, and using directives.
From the first highlighted sentence we can understand that it prepares statistics for one or maximum two column each time (It's a pity let's see in future...).
From the second highlighted sentence it's possible to read that it will follow the same syntax of the plot command:
so stats 'data'using 3 will give you the statistic of the 3rd column in x
and stats 'data' using 4:5 of the 4th and 5th in x,y...
Notes about your interpretations
You said
This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
Not really this is the statistic for the first two column excluding the first two lines, indeed their counter starts from 0 and not from 1.
As consequence of the above assumption/interpretation, when we read
Records: 8
it means that the lines computed where 8; your file had 10 (usable) lines, you specify every ::2 and you skip the first two, thus you have 8 records useful for the statistic.
Indeed so we can better understand when in help stats it is said
STATS_records # total number of in-range data records
implying "used to compute this statistic".
Tested on gnuplot 4.6 patchlevel 4
Working on gnuplot Version 5.0 patchlevel 1