Rstudio - how to write smaller code

Rstudio - how to write smaller code - r

I'm brand new to programming and an picking up Rstudio as a stats tool.
I have a dataset which includes multiple questionnaires divided by weeks, and I'm trying to organize the data into meaningful chunks.
Right now this is what my code looks like:
w1a=table(qwest1,talm1)
w2a=table(qwest2,talm2)
w3a=table(quest3,talm3)
Where quest and talm are the names of the variable and the number denotes the week.
Is there a way to compress all those lines into one line of code so that I could make w1a,w2a,w3a... each their own object with the corresponding questionnaire added in?
Thank you for your help, I'm very new to coding and I don't know the etiquette or all the vocabulary.

This might do what you wanted (but not what you asked for):
tbl_list <- mapply(table, list(qwest1, qwest2, quest3),
list(talm1, talm2, talm3) )
names(tbl_list) <- c('w1a', 'w2a','w3a')
You are committing a fairly typical new-R-user error in creating multiple similarly named and structured objects but not putting them in a list. This is my effort at pushing you in that direction. Could also have been done via:
qwest_lst <- list(qwest1, qwest2, quest3)
talm_lst <- list(talm1, talm2, talm3)
tbl_lst <- mapply(table, qwest_lst, talm_lst)
names(tbl_list) <- paste0('w', 1:3, 'a')
There are other ways to programmatically access objects with character vectors using get or wget.

Related

In R, Create Summary Data Frame from Multiple Objects

I'm trying to create a "summary" data frame that holds some high-level stats about a few objects in my R project. I'm having trouble even accomplishing this simple task and I've tried using For loops and Apply functions with no luck.
After searching (a lot) on SO I'm seeing that For loops might not be the best performing option, so I'm open to any solution that gets the job done.
I have three objects: text1 text2 and text3 of class "Large Character (vectors)" (imagine I might be exploring these objects and will create a NLP predictive model from them). Each are > 250 MB in size (upwards of 1 million "rows" each) once loaded into R.
My goal: Store the results of object.size() length() and max(nchar()) in a table for my 3 objects.
Method 1: Use an Apply() Function
Issue: I haven't successfully applied multiple functions to a single object. I understand how to do simple applies like lapply(x, mean) but I'm falling short here.
Method 2: Bind Rows Using a For loop
I'm liking this solution because I almost know how to implement it. A lot of SO users say this is a bad approach, but I'm lacking other ideas.
sources <- c("text1", "text2", "text3")
text.summary <- data.frame()
for (i in sources){
text.summary[i ,] <- rbind(i, object.size(get(i)), length(get(i)),
max(nchar(get(i))))
}
Issue: This returns the error data length exceeds size of matrix - I know I could define the structure of my data frame (on line 2), but I've seen too much feedback on other questions that advise against doing this.
Thanks for helping me understand the proper way to accomplish this. I know I'm going to have trouble doing NLP if I can't even figure out this simple problem, but R is my first foray into programming. Oof!

Just try for example:
do.call(rbind, lapply(list(text1,text2,text3),
function(x) c(objectSize=c(object.size(x)),length=length(x),max=max(nchar(x)))))
You'll obtain a matrix. You can coerce to data.frame later if you need.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!

What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

call columns from inside a for loop in R

I basically want to be capable to call columns from inside a for loop (in reality two nested for loops), using past() and i (j..) value of the loop to access
my data frames columns wise in a flexible manner.
#for the showcase I use the standard cars example
r1 <- cars
r2 <- cars
# in case there are more data to consider I would want to add, ore remove further with out changing the rest
# here I am entering the "dimension" of what I want to compare for the showcase its only one
num_r <- 2 #total number of reactors in the experiment
for( i in 1:num_r)
{
# shoud create proxie variable to be processed further
assign(paste("proxi_r",i,sep="", colapse="") , do.call("matrix",
list(get(paste("r",i,"$speed",sep="", colapse="" )))))
# further operations of gluing and arranging data follow so they fit tests formatting requirements
}
which gives me:
Error in get(paste("r", i, "$speed", sep = "", colapse = "")) :
object 'r1$speed' not found
but when typ r1$speed it obviously exists??
Sofare I searched "R object dont exist inside loop", "using paste() to acces variables inside loop", "foor loops and objects","do.call inside loops" ....and similar...
Is there anything to circumvent get() so I don’t have to look into the topic of environments, so I can keep the flexibility of my loops so I don’t have re-edit my script every time I have a changed the experimental configuration, which is really time consuming and allows a lot of errors to sneak inside.
The size of the data have crashed excel with extensive use of excel macros, which everyone in the lab here is using, several times :) , so there is no going back to the convort zone.
I am now trying to dig into R programming with a R statics book, and a lot of googling and reading tutorials, so please forgive my naive approach, and my lousy English.
I would be very thankful for any tips, as I feel sort of stuck right now.

This is a common confusion. You've created an object name "r1$speed" , i.e. a complete character string. This is not the same as the object r1 subsetted by $speed .
Try using get(paste('r',i,collapse='',sep=''))$speed

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.

I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.

In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Undo command in R

I can't find something to the effect of an undo command in R (neither on An Introduction to R nor in R in a Nutshell). I am particularly interested in undoing/deleting when dealing with interactive graphs.
What approaches do you suggest?

You should consider a different approach which leads to reproducible work:
Pick an editor you like and which has R support
Write your code in 'snippets', ie short files for functions, and then use the facilities of the editor / R integration to send the code to the R interpreter
If you make a mistake, re-edit your snippet and run it again
You will always have a log of what you did
All this works tremendously well in ESS which is why many experienced R users like this environment. But editors are a subjective and personal choice; other people like Eclipse with StatET better. There are other solutions for Mac OS X and Windows too, and all this has been discussed countless times before here on SO and on other places like the R lists.

In general I do adopt Dirk's strategy. You should aim for your code to be a completely reproducible record of how you have transformed your raw data into output.
However, if you have complex code it can take a long time to re-run it all. I've had code that takes over 30 minutes to process the data (i.e., import, transform, merge, etc.).
In these cases, a single data-destroying line of code would require me to wait 30 minutes to restore my workspace.
By data destroying code I mean things like:
x <- merge(x, y)
df$x <- df$x^2
e.g., merges, replacing an existing variable with a transformation, removing rows or columns, and so on. In these cases, it's easy, especially when first learning R to make a mistake.
To avoid having to wait this 30 minutes, I adopt several strategies:
If I'm about to do something where there's a risk of destroying my active objects, I'll first copy the result into a temporary object. I'll then check that it worked with the temporary object and then rerun replacing it with the proper object.
E.g., first run temp <- merge(x, y); check that it worked str(temp); head(temp); tail(temp) and if everything looks good x <- merge(x, y)
As is common in psychological research, I often have large data frames with hundreds of variables and different subsets of cases. For a given analysis (e.g., a table, a figure, some results text), I'll often extract just the subset of cases and variables that I need into a separate object for the analysis and work with that object when preparing and finalising my analysis code. That way, I'm less likely to accidentally damage my main data frame. This assumes that the results of the analysis does not need to be fed back into the main data frame.
If I have finished performing a large number of complex data transformations, I may save a copy of the core workspace objects. E.g., save(x, y, z , file = 'backup.Rdata') That way, If I make a mistake, I only have to reload these objects.
df$x <- NULL is a handy way of removing a variable in a data frame that you did not want to create
However, in the end I still run all the code from scratch to check that the result is reproducible.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex