Correlations and what brackets indicate - r

I have this code, from Julian Farawy's linear models book:
round(cor(seatpos[,-9]),2)
I am unsure what [,-9],2 is doing - could someone please assist?

When you are learning new stuff nested functions can be difficult. This same computation could be accomplished in steps, which might be easier for you to see what KeonV and MrFlick are suggesting.
Here is an alternative way of doing this the same functions but easier steps to differentiate with simple explanations.
sub_seatpos<- seatpos[,-9]
this says take a subset of all rows and all columns EXCEPT column number nine and save it into sub_seatpos (this subseting was done in the initial code, but not saved into a new variable. This just makes seeing how each step works easier).
and reflects the bold portion below
round(cor(seatpos[,-9]),2)
cor_seatpos <- cor(sub_seatpos)
This takes the correlation for sub_seatpos and saves them into a variable named cor_seatpos. It reflects the part listed below in bold
round( cor( seatpos[,-9] ),2)
The final step just says round the correlation to 2 decimal places and would look like this in separate lines of code.
round(cor_seatpos, 2)
it is reflected in the bold below
round( cor(seatpos[,-9]),2)
What makes this confusing is that all of the functions are nested. As you become more proficient, this becomes less of a difficulty to read. But it can be confusing with new functions.

Related

LibreOffice Calc - How to reuse multi step formula?

I'm making a balance sheet, Sheet1 is for the ins and outs, and most values are added manually or simple formulas, and Sheet2 is where I created a formula, in the hopes of being able to reuse it.
I'm not an accountant to understand how I could make the calculations easier, and I'm a programmer, so I understand that the way I may be imagining the solution is likely impossible with the way Libreoffice Calc's formulas work.
So, to explain a bit.
On Sheet1, each column is a month, and the value is a tax that will appear one time each month, dependent on another value.
So, the base value is on ROW 17, and on 18, I would like that result to be set. For every month, of course
On Sheet2, I have the function, it contains 5 steps, with the values being reused a lot (hence, simplifying everything into one line would be hell).
This is the complex formula in question, D1 is the input, C6 is the output.
The formula below is the one used on C2, and repeated down to C5.
I would like to keep the constants as a table since it would be easier to update it in the future in case it suffer any changes.
I have been searching for a possible solution but found none, and I believe that it's likely because I'm looking for a solution like a programmer (use Sheet as a function), and I should seek sort of way, but I don't know how Calc works.
In regards to the calculation, I don't know the specific name, but the idea is, from 0 to A1, I have to B1% from A1-0, then from A2-A1, remove B2%, and so on.
Of course the formula's complexity comes from treating lower values, so for example, if D1 was 2K, then I would have to take 7.5% of R$ 96.02, and everything beyond is 0, since there is nothing remaining for them to calculate
Most of the descriptions I found on MULTIPLE.OPERATIONS were confusing, but I found one that made it much easier to understand.
The answer was to use this formula on Sheet1:
=MULTIPLE.OPERATIONS('Sheet2'.$C$6, 'Sheet2'.$D$1, C17)
I can just copy paste it to the side and the calculation will be executed properly.
To explain the arguments:
1 - where the result will appear
2 - the location of the main/first formula variable
3 - the location of dynamic variable you want to insert in that formula (So this is from Sheet1)
More arguments could be used if more variables were needed, but I just needed one.
This is the place with the best explanation I found for the function.
https://wiki.documentfoundation.org/Documentation/Calc_Functions/MULTIPLE.OPERATIONS

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Matrice help: Finding average without the zeros

I'm creating a Monte Carlo model using R. My model creates matrices that are filled with either zeros or values that fall within the constraints. I'm running a couple hundred thousand n values thru my model, and I want to find the average of the non zero matrices that I've created. I'm guessing I can do something in the last section.
Thanks for the help!
Code:
n<-252500
PaidLoss_1<-numeric(n)
PaidLoss_2<-numeric(n)
PaidLoss_3<-numeric(n)
PaidLoss_4<-numeric(n)
PaidLoss_5<-numeric(n)
PaidLoss_6<-numeric(n)
PaidLoss_7<-numeric(n)
PaidLoss_8<-numeric(n)
PaidLoss_9<-numeric(n)
for(i in 1:n){
claim_type<-rmultinom(1,1,c(0.00166439057698873, 0.000810856947763742, 0.00183509730283373, 0.000725503584841243, 0.00405428473881871, 0.00725503584841243, 0.0100290201433936, 0.00529190850119495, 0.0103277569136224, 0.0096449300102424, 0.00375554796858996, 0.00806589279617617, 0.00776715602594742, 0.000768180266302492, 0.00405428473881871, 0.00226186411744623, 0.00354216456128371, 0.00277398429498122, 0.000682826903379993))
claim_type<-which(claim_type==1)
claim_Amanda<-runif(1, min=34115, max=2158707.51)
claim_Bob<-runif(1, min=16443, max=413150.50)
claim_Claire<-runif(1, min=30607.50, max=1341330.97)
claim_Doug<-runif(1, min=17554.20, max=969871)
if(claim_type==1){PaidLoss_1[i]<-1*claim_Amanda}
if(claim_type==2){PaidLoss_2[i]<-0*claim_Amanda}
if(claim_type==3){PaidLoss_3[i]<-1* claim_Bob}
if(claim_type==4){PaidLoss_4[i]<-0* claim_Bob}
if(claim_type==5){PaidLoss_5[i]<-1* claim_Claire}
if(claim_type==6){PaidLoss_6[i]<-0* claim_Claire}
}
PaidLoss1<-sum(PaidLoss_1)/2525
PaidLoss3<-sum(PaidLoss_3)/2525
PaidLoss5<-sum(PaidLoss_5)/2525
PaidLoss7<-sum(PaidLoss_7)/2525
partial output of my numeric matrix
First, let me make sure I've wrapped my head around what you want to do: you have several columns -- in your example, PaidLoss_1, ..., PaidLoss_9, which have many entries. Some of these entries are 0, and you'd like to take the average (within each column) of the entries that are not zero. Did I get that right?
If so:
Comment 1: At the very end of your code, you might want to avoid using sum and dividing by a number to get the mean you want. It obviously works, but it opens you up to a risk: if you ever change the value of n at the top, then in the best case scenario you have to edit several lines down below, and in the worst case scenario you forget to do that. So, I'd suggest something more like mean(PaidLoss_1) to get your mean.
Right now, you have n as 252500, and your denominator at the end is 2525, which has the effect of inflating your mean by a factor of 100. Maybe that's what you wanted; if so, I'd recommend mean(PaidLoss_1) * 100 for the same reasons as above.
Comment 2: You can do what you want via subsetting. Take a smaller example as a demonstration:
test <- c(10, 0, 10, 0, 10, 0)
mean(test) # gives 5
test!=0 # a vector of TRUE/FALSE for which are nonzero
test[test!=0] # the subset of test which we found to be nonzero
mean(test[test!=0]) # gives 10, the average of the nonzero entries
The middle three lines are just for demonstration; the only necessary lines to do what you want are the first (to declare the vector) and the last (to get the mean). So your code should be something like PaidLoss1 <- mean(PaidLoss_1[PaidLoss_1 != 0]), or perhaps that times 100.
Comment 3: You might consider organizing your stuff into a dataframe. Instead of typing PaidLoss_1, PaidLoss_2, etc., it might make sense to organize all this PaidLoss stuff into a matrix. You could then access elements of the matrix with [ , ] indexing. This would be useful because it would clean up some of the code and prevent you from having to type lots of things; you could also then make use of things like the apply() family of functions to save you from having to type the same commands over and over for different columns (such as the mean). You could also use a dataframe or something else to organize it, but having some structure would make your life easier.
(And to be super clear, your code is exactly what my code looked like when I first started writing in R. You can decide if it's worth pursuing some of that optimization; it probably just depends how much time you plan to eventually spend in R.)

using value of a function & nested function in R

I wrote a function in R - called "filtre": it takes a dataframe, and for each line it says whether it should go in say bin 1 or 2. At the end, we have two data frames that sum up to the original input, and corresponding respectively to all lines thrown in either bin 1 or 2. These two sets of bin 1 and 2 are referred to as filtre1 and filtre2. For convenience the values of filtre1 and filtre2 are calculated but not returned, because it is an intermediary thing in a bigger process (plus they are quite big data frame). I have the following issue:
(i) When I later on want to use filtre1 (or filtre2), they simply don't show up... like if their value was stuck within the function, and would not be recognised elsewhere - which would oblige me to copy the whole function every time I feel like using it - quite painful and heavy.
I suspect this is a rather simple thing, but I did search on the web and did not find the answer really (I was not sure of best key words). Sorry for any inconvenience.
Thxs / g.
It's pretty hard to know the optimum way of achieve what you want as you do not provide proper example, but I'll give it a try. If your variables filtre1 and filtre2 are defined inside of your function and you do not return them, of course they do not show up on your environment. But you could just return the classification and make filtre1 and filtre2 afterwards:
#example data
df<-data.frame(id=1:20,x=sample(1:20,20,replace=TRUE))
filtre<-function(df){
#example function, this could of course be done by bins<-df$x<10
bins<-numeric(nrow(df))
for(i in 1:nrow(df))
if(df$x<10)
bins[i]<-1
return(bins)
}
bins<-filtre(df)
filtre1<-df[bins==1,]
filtre2<-df[bins==0,]

How can I structure and recode messy categorical data in R?

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean.
The Coding Scheme
I'm analyzing data from a university science course exam. We're looking at patterns in
student responses, and we developed a coding scheme to represent the kinds of things
students are doing in their answers. A subset of the coding scheme is shown below.
Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...).
What the Raw Data Looks Like
I've created an anonymized, raw subset of my actual data which you can view here.
Part of my problem is that those who coded the data noticed that some students displayed
multiple patterns. The coders' solution was to create enough columns (reason1, reason2,
...) to hold students with multiple patterns. That becomes important because the order
(reason1, reason2) is arbitrary--two students (like student 41 and student 42 in my
dataset) who correctly applied "dependency" should both register in an analysis, regardless of
whether 3a appears in the reason column or the reason2 column.
How Can I Best Structure Student Data?
Part of my problem is that in the raw data, not all students display the same
patterns, or the same number of them, in the same order. Some students may do just one
thing, others may do several. So, an abstracted representation of example students might
look like this:
Note in the example above that student002 and student003 both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of my data.
My (Practical) Questions
Should I concatenate reason1, reason2, ... into one column?
How can I (re)code the reasons in R to reflect the multiplicity for some students?
Thanks
I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. If I haven't been specific enough, please let me know and I'll do my best to be clearer.
Make it "long":
library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results
What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.
you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. You'll need to clean up some of the question marks and extra stuff first though.
x <- ddply(data, c("split_column1", "split_column3" etc),
summarize(result_df, stats you want from result_df))
What's the (bigger picture) question you're attempting to answer? Why is this information interesting to you?
Are you just trying to find patterns such as 'if the student does this, then they also likely do this'?
Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives.
Interesting problem though!

Resources