SPSS: correlating two vectors - vector

I have two vectors in my dataset Vs = s1 to s10 and Vt= t1 to t10.
They describe two pictures and I want to know for each case what the correlation is.
However there is no such a function Cor(Vs, Vt) because Vectors are apparently not usable in the standard functions. There is even no mean(Vs)!
I tried to write syntax but failed also because the problem of missing variables (implementing pairwise deletion seems complex).
Any hint is welcome.
Is it possible to ask a question that is only seen by SPSS experts?

calculating the correlation in the present structure is probably feasible but would be pretty complex. I suggest restructuring the data, then all becomes easy:
The code assumes you have some line ID in the data, called lineNum.
If you don't, you'll need to create one using the first line.
compute lineNum=$casenum. /* this is only necessary if you don't have some other line ID.
varstocases /mame V_s from S1 to S10 /make V_t from V1 to V10 /index=pairNum(V_s).
sort cases by lineNum.
split file by lineNum.
correlations V_s with V_t. /* you can edit the code here to add features to the analysis.
split file off.
That's it. Now the results will appear in the output window - one correlation for each of the original lines. If you need to import the correlations back to the original data you can do that by using OMS control to capture the results into a new dataset and then matching it back to the original file.

Related

Stata tables/collect confidence interval in one cell

I work a lot with the new tables collect command in stata 17. Does anybody know how to get the confidence interval in one cell in the table vs. One column for lower bound and one column for the upper bound estimate?
Alternatively a quick fix in word (or excel though my final document is word. Saving the output in excel takes so long)
Is I see it there is no option to put it in one column, so maybe a layout work around?
From the stata documentation of the collect command, the quick start mentions
table (colname) (result), command(_r_b _r_ci: regress y x1 x2 x3). You should be able to use collect with it, but without a minimum reproducible example of your specific case, it is hard to verify if this works as intended in your case. For the general idea of a minimum reproducible example please see here and for specific advice on how to create a minimum reproducible example please see here.
Here is a general example that uses table, collect and putdocx to create a word document to get the confidence interval in one cell:
use https://www.stata-press.com/data/r17/nlsw88.dta
table (colname) (result), command(_r_b _r_ci: regress wage union occupation married age)
collect layout (colname) (result)
putdocx begin
putdocx collect
putdocx save Table, replace

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

How to automate a process by pulling elements from a data frame in R -looping with a string?

I am trying to automate a process instead of individually compute PPCC values for a large number of test cases. The details of my functions do not matter (though for reference I'm using Lmomco), my issue is either putting this into a loop or somehow using plyr or apply to repeat over and over. I do not know how to automate the string. For example I have sorted data by "M" parameter:
testx.100cv1<-by(x.cv1$first_year,x.cv1$M,sort)
I then apply a function here:
testexp<-lapply(testx.100cv1,parexp)
Now I want to do something to each "M", where in the example below, M = 1.02. Right now, I am manually changing this value and then recomputing for every M (and I have a lot of them). I'm looking for a way to write this M value into a loop so it reads it automatically.
exp<-quaexp(plotpos,testexp$'1.02')
PPCCexp<-cor(exp,testx.100cv1$'1.02')
I want to compute PPCC values for many distributions, so without automating, this will take over my life for a week.
Thanks!

R + Bioconductor : combining probesets in an ExpressionSet

First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.

Resources