I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.
Related
I'm brand new to programming and an picking up Rstudio as a stats tool.
I have a dataset which includes multiple questionnaires divided by weeks, and I'm trying to organize the data into meaningful chunks.
Right now this is what my code looks like:
w1a=table(qwest1,talm1)
w2a=table(qwest2,talm2)
w3a=table(quest3,talm3)
Where quest and talm are the names of the variable and the number denotes the week.
Is there a way to compress all those lines into one line of code so that I could make w1a,w2a,w3a... each their own object with the corresponding questionnaire added in?
Thank you for your help, I'm very new to coding and I don't know the etiquette or all the vocabulary.
This might do what you wanted (but not what you asked for):
tbl_list <- mapply(table, list(qwest1, qwest2, quest3),
list(talm1, talm2, talm3) )
names(tbl_list) <- c('w1a', 'w2a','w3a')
You are committing a fairly typical new-R-user error in creating multiple similarly named and structured objects but not putting them in a list. This is my effort at pushing you in that direction. Could also have been done via:
qwest_lst <- list(qwest1, qwest2, quest3)
talm_lst <- list(talm1, talm2, talm3)
tbl_lst <- mapply(table, qwest_lst, talm_lst)
names(tbl_list) <- paste0('w', 1:3, 'a')
There are other ways to programmatically access objects with character vectors using get or wget.
I know that there are many threads called this but either the advice within hasn't worked or I haven't understood it.
I have read what was an SPSS file into R.
I cleaned some variables and added new ones.
By this point the file size is 1,000 MB.
I wanted to write it into a CSV to look at it more easily but it just stops responding - file too big I guess.
So instead I want to create a subset of only the variables I need. I tried a couple of things
(besb <- bes[, c(1, 7, 8)])
data1 <- bes[,1:8]
I also tried referring to variables by name:
nf <- c(bes$approveGov, bes$politmoney)
All these attempts return errors with number of dimensions.
Therefore could somebody please explain to me how to create a reduced subset of variables preferably using variable names?
An easy way to subset variables from a data.frame is with the dplyr package. You can select variables with their bare names. For example:
library(dplyr)
nf <- select(bes, approveGov, politmoney)
It's fast for large data frames too.
I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.
So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).
I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.