Problems with DataFrame to generate a table of 15000 rows x 37 columns in Julia - julia

I'm trying to generate a table with 15000 rows and 16 columns, however Julia loses or omits some variables.
I have tried different ways to run DataFrame but I get the following results:
df = DataFrame(periods=15000, households=5000, giniY=giniY)
15,000 rows × 3 columns
However, when I run with the 16 variables I get the following result
df = DataFrame(periods=15000, households=5000, gamma=gamma, delta=delta,
betta=betta, alfa=alfa, miz=miz, roz=roz,
phi=phi, rok=rok, mie=mie, roe=roe,
roez=roez)
1×13 DataFrame. Omitted printing of 3 columns

Your second df variable has 13 columns (I have aligned the code in my edit 4 variables per line so that it is clearly visible). Julia omits printing all columns if they would not fit the screen (imagine what would happen if you had a data frame with 10 000 columns and always tried to print them all).
In REPL Julia omits printing columns if they do not fit the screen unless you pass allcols=true keyword argument to show or create a custom IOContext that you pass to show that defines a non-standard output width. All this is explained in show documentation for DataFrame.
In Jupyter Notebook a similar thing happens, but by default the width of the output is governed by "COLUMNS" environment variable. The details how you can set it are explained at the beginning of the DataFrames.jl manual here.

Related

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

Show KMeans cluster results with clusters as columns

My data has 40+ variables and I am creating a 3 cluster model on it.
I have built a kmeans model:
teen_clusters <- kmeans(interests_z, 3).
It works fine. It is getting an output that I can read is the issue.
When I screen print the model, it places the variables on the top (40 across) and the clusters as rows (3 deep). Very hard to read.
I want it the other way around. 3 cluster columns and 40 rows.
I have tried the below, but get the same thing. This does way too much screen wrap.
aggregate(interests_z,by=list(teen_clusters$cluster),FUN=mean)
Since we don't have your data lets use mtcars ...
ret <- kmeans(mtcars,3)
ret$centers # the default format
t(ret$centers) # transposed as you want
To see the components of ret use str(ret)

Determine when columns of a data.frame change value and return indices of the change

I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:
x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
The val0,val1,val2 columns are something like scores.
The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.
My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)
I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).
Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.
In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
Totally unreadable, but:
c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5
Basically, run diff on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.
Also, without the sapply:
c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources