How to get R to use a certain dataset for multiple commands without usin attach() or appending data="" to every command - r

So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).

Related

Assigning a list to column properties

I have a list containing values and I want to assign it to the column properties of a table in spotfire. I am currently using a for loop to do it. Is there a better approach to this, like assigning the entire list in one go?
As mentioned previously I am doing it currently using a for loop which can be seen below:
high=c(5,2,10)
low=c(3,1,0)
for(col in 1:ncol(temp)){
attributes(temp[,col])$SpotfireColumnMetaData$limits.whatif.upper=(high[col])[1]
attributes(temp[,col])$SpotfireColumnMetaData$limits.whatif.lower=(low[col)[1]
}
}
I have also tried just to do
attributes(temp2)$SpotfireColumnData$limits.whatif.upper=high
but that didnt seem to work.
So I want the column for limits.whatif.upper to be 5 for the first row, 2 for the second, and 10 for the third. As I said this code works, but I want to see if there is a faster way of doing it, since it seems that accessing the column property every time and changing it slows down the code a lot.The columns properties already exist so I am not creating new ones with this code.
It seems that python works faster than R with column properties. So if you need to do it faster, it may be better just to transfer the data over to python and do it from there. I dont have as much expierence in R, so it may just be poorly written R code as well.

Declaration of mass variables in column headings in R

I cannot figure out how to assign the column headers from my imported xlsx sheet as variables. I have several column headers, for example DAY_CHNG and INPUT_CHG. So far, I can only run gls(DAY_CHG~INPUT_CHG) by first assigning the values as variables by X<-mydata$DAY_CHG. Is there some command to get these variables assigned automatically when I import?
I had horrible problems getting the program up and running, by the way, due to firewalls at the firm for which I'm working, wondering if that's causing some of the issue.
Any help is much appreciated. Thanks!
attach(mydata) will allow you to directly use the variable names. However, attach may cause problems, especially with more complex data/analyses (see Do you use attach() or call variables by name or slicing? for a discussion)
An alternative would be to use with, such as with(mydata, gls(DAY_CHG~INPUT_CHG)
I would suggest using the $ in order to use the headers as variables and still be able to use other data sets. All that needs to be done is assign the data to an object such as your mydata and by putting a $ immediately following, you will be able to refer to your headers as variables.
As an example for your case, instead of creating a new object x, simply take what you assigned x to and put it directly into your command.
gls(mydata$DAY_CHG ~ mydata$INPUT_CHG)
when it becomes more complicated with more data sets this will allow you to have access to all of them still while not limiting yourself to the data set you attach()

How to automate a process by pulling elements from a data frame in R -looping with a string?

I am trying to automate a process instead of individually compute PPCC values for a large number of test cases. The details of my functions do not matter (though for reference I'm using Lmomco), my issue is either putting this into a loop or somehow using plyr or apply to repeat over and over. I do not know how to automate the string. For example I have sorted data by "M" parameter:
testx.100cv1<-by(x.cv1$first_year,x.cv1$M,sort)
I then apply a function here:
testexp<-lapply(testx.100cv1,parexp)
Now I want to do something to each "M", where in the example below, M = 1.02. Right now, I am manually changing this value and then recomputing for every M (and I have a lot of them). I'm looking for a way to write this M value into a loop so it reads it automatically.
exp<-quaexp(plotpos,testexp$'1.02')
PPCCexp<-cor(exp,testx.100cv1$'1.02')
I want to compute PPCC values for many distributions, so without automating, this will take over my life for a week.
Thanks!

How, in R, do I access numbered data sets with the loop variable?

Can someone please tell me how, in R, I can access numbered data sets with the loop variable?
So, if I have a long list of files in each of which I need to find all the places where a particular value is in the second column and take the corresponding value in the same row in the third column and list these all in one file, how might I do this? The files are named by the title of the folder, date, and time, respectively, in this fashion, "name_0619_0123". There are the same number of files per each day, and they are at the times every day. Therefore if there is a command that can somehow let me access a file in such a way that I can have a variable (dependent on the loop counting variable) in the string that I give for the file name in the command, I can access a different file per each loop iterations.
Any and all ideas please
Also, if there is a more appropriate place for me to ask this question, please let me know.
There are probably lots of ways to do this in R:
You can use a command line script (see the R documentation).
i.e.
R CMD BATCH "--args arg1 arg2" foo.R &
Where foo.R is your R script and the args can be the loop varaibles you are interested in.
Another way to do this is to use regular expressions to parse out information from your file names.
If you provide a more concrete example I'll be able to show you some more specific code.
Here are some guidelines if you can glob those files you need to process either with a pattern or picking up all of them.
You may generate a list of files with list.files, read them in one shot with lapply, read.csv, and fetch what you need into a data.frame with a single row. Then, using do.call, rbind, and your list of data.frames, you can combine everything into a single data.frame without even writing for explicitly.

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Resources