closed/fixed:Interpertation of basic R code - r

I have a basic question in regards to the R programming language.
I'm at a beginners level and I wish to understand the meaning behind two lines of code I found online in order to gain a better understanding. Here is the code:
as.data.frame(y[1:(n-k)])
as.data.frame(y[(k+1):n])
... where y and n are given. I do understand that the results are transformed into a data frame by the function as.data.frame() but what about the rest? I'm still at a beginners level so pardon me if this question is off-topic or irrelevant in this forum. Thank you in advance, I appreciate every answer :)

Looks like you understand the as.data.frame() function so let's look at what is happening inside of it. We're looking at y[1:(n-k)]. Here, y is a vector which is a collection of data points of the same type. For example:
> y <- c(1,2,3,4,5,6)
Try running that and then calling back y. What you get are those numbers listed out. Now, consider the case you want to just call out the number 1 in that vector. How would you do that? Well, this is where the brackets come into play. If you wanted to just call the number 1 in y:
> y[1]
[1] 1
Therefore, the brackets are a way of calling out or indexing specific items in the vector. Note that the indexing starts at the value 1 and goes up to the number of items in the vector, or length. One last thing before we go back to the example you gave. What if we want to index the numbers 1, 2, and 3 from the vector but not the rest?
> y[1:3]
[1] 1 2 3
This is where the colon comes into play. It allows us to reference a subset of the numbers. However, it will reference all the numbers between the index left of the colon and right of it. Try this out for yourself in R! Play around and see what happens.
Finally going back to your example:
y[1:(n-k)]
How would this work based on what we discussed? Well, the colon means that we are indexing all values in the vector y from two index values. What are those values? Well, they are the numbers to the left and right of the colon. Therefore, we are asking R to give us the values from the first position (index of 1) to the (n-k) position. Therefore, it's important to know what n and k are. If n is 4 and k is 1 then the command becomes:
y[1:3]
The same logic can apply to the second as.data.frame() command in your question. Essentially, R is picking out different numbers from a vector y and multiplying them together.
Hope this helps. The best way to learn R is to play around with a command, throw different numbers at it, guess what will happen, and then see what happens!

Related

Find the index of the last occurence of fulfilled criteria in a matrix in r

I have an array (x) in R of size 30x11x10.
x=array(-2:20, c(30,11,10))
Each 'grid' or matrix represents a day of data for a month (30 days represented here). I want to find the index (i,j,k) of when the last occurrence of a number less than 2 occurs. Ideally, I would also like the value returned too. If this was in Matlab, I could just use [i,j,k]=find(x(x<2)) but I don't see an exact equivalent for this in R.
I have looked at 'match' as suggested in other posts here, but it seems to find elements when they are specified, but not when a criteria (x<2) is given?
I tried this:
xxx<-match(x,x<2,0) but it returns a long vector of integers that don't appear to show what I am looking for.
Then I tried:xxx<-match(x,x[x<2],0) which looks a bit more promising, but still isn't what I want (to be honest I'm not sure what the output is indexing).
I think I'm probably asking a foolish question here because if I want 3 indices and the value returned, then I should be assigning them to something preemptively right (which I'm not doing)? Can anyone offer any advice?

Matrice help: Finding average without the zeros

I'm creating a Monte Carlo model using R. My model creates matrices that are filled with either zeros or values that fall within the constraints. I'm running a couple hundred thousand n values thru my model, and I want to find the average of the non zero matrices that I've created. I'm guessing I can do something in the last section.
Thanks for the help!
Code:
n<-252500
PaidLoss_1<-numeric(n)
PaidLoss_2<-numeric(n)
PaidLoss_3<-numeric(n)
PaidLoss_4<-numeric(n)
PaidLoss_5<-numeric(n)
PaidLoss_6<-numeric(n)
PaidLoss_7<-numeric(n)
PaidLoss_8<-numeric(n)
PaidLoss_9<-numeric(n)
for(i in 1:n){
claim_type<-rmultinom(1,1,c(0.00166439057698873, 0.000810856947763742, 0.00183509730283373, 0.000725503584841243, 0.00405428473881871, 0.00725503584841243, 0.0100290201433936, 0.00529190850119495, 0.0103277569136224, 0.0096449300102424, 0.00375554796858996, 0.00806589279617617, 0.00776715602594742, 0.000768180266302492, 0.00405428473881871, 0.00226186411744623, 0.00354216456128371, 0.00277398429498122, 0.000682826903379993))
claim_type<-which(claim_type==1)
claim_Amanda<-runif(1, min=34115, max=2158707.51)
claim_Bob<-runif(1, min=16443, max=413150.50)
claim_Claire<-runif(1, min=30607.50, max=1341330.97)
claim_Doug<-runif(1, min=17554.20, max=969871)
if(claim_type==1){PaidLoss_1[i]<-1*claim_Amanda}
if(claim_type==2){PaidLoss_2[i]<-0*claim_Amanda}
if(claim_type==3){PaidLoss_3[i]<-1* claim_Bob}
if(claim_type==4){PaidLoss_4[i]<-0* claim_Bob}
if(claim_type==5){PaidLoss_5[i]<-1* claim_Claire}
if(claim_type==6){PaidLoss_6[i]<-0* claim_Claire}
}
PaidLoss1<-sum(PaidLoss_1)/2525
PaidLoss3<-sum(PaidLoss_3)/2525
PaidLoss5<-sum(PaidLoss_5)/2525
PaidLoss7<-sum(PaidLoss_7)/2525
partial output of my numeric matrix
First, let me make sure I've wrapped my head around what you want to do: you have several columns -- in your example, PaidLoss_1, ..., PaidLoss_9, which have many entries. Some of these entries are 0, and you'd like to take the average (within each column) of the entries that are not zero. Did I get that right?
If so:
Comment 1: At the very end of your code, you might want to avoid using sum and dividing by a number to get the mean you want. It obviously works, but it opens you up to a risk: if you ever change the value of n at the top, then in the best case scenario you have to edit several lines down below, and in the worst case scenario you forget to do that. So, I'd suggest something more like mean(PaidLoss_1) to get your mean.
Right now, you have n as 252500, and your denominator at the end is 2525, which has the effect of inflating your mean by a factor of 100. Maybe that's what you wanted; if so, I'd recommend mean(PaidLoss_1) * 100 for the same reasons as above.
Comment 2: You can do what you want via subsetting. Take a smaller example as a demonstration:
test <- c(10, 0, 10, 0, 10, 0)
mean(test) # gives 5
test!=0 # a vector of TRUE/FALSE for which are nonzero
test[test!=0] # the subset of test which we found to be nonzero
mean(test[test!=0]) # gives 10, the average of the nonzero entries
The middle three lines are just for demonstration; the only necessary lines to do what you want are the first (to declare the vector) and the last (to get the mean). So your code should be something like PaidLoss1 <- mean(PaidLoss_1[PaidLoss_1 != 0]), or perhaps that times 100.
Comment 3: You might consider organizing your stuff into a dataframe. Instead of typing PaidLoss_1, PaidLoss_2, etc., it might make sense to organize all this PaidLoss stuff into a matrix. You could then access elements of the matrix with [ , ] indexing. This would be useful because it would clean up some of the code and prevent you from having to type lots of things; you could also then make use of things like the apply() family of functions to save you from having to type the same commands over and over for different columns (such as the mean). You could also use a dataframe or something else to organize it, but having some structure would make your life easier.
(And to be super clear, your code is exactly what my code looked like when I first started writing in R. You can decide if it's worth pursuing some of that optimization; it probably just depends how much time you plan to eventually spend in R.)

Understanding the logic of R code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am learning R through tutorials, but I have difficulties in "how to read" R code, which in turn makes it difficult to write R code. For example:
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
vs
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
While I know what these lines of code do, I cannot read or interpret the logic of each line of code. Whether I read left to right or right to left. What strategies should I use when reading/writing R code?
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Don't let lines of code like this ruin writing R code for you
I'm going to be honest here. The code is bad. And for many reasons.
Not a lot of people can read a line like this and intuitively know what the output is.
The point is you should not write lines of code that you don't understand. This is not Excel, you do not have but 1 single line to fit everything within. You have a whole deliciously large script, an empty canvas. Use that space to break your code into smaller bits that make a beautiful mosaic piece of art! Let's dive in~
Dissecting the code: Data Frames
Reading a line of code is like looking at a face for familiar features. You can read left to right, middle to out, whatever -- as long as you can lock onto something that is familiar.
Okay you see data.combined. You know (hope) it has rows and columns... because it's data!
You spot a $ in the code and you know it has to be a data.frame. This is because only lists and data.frames (which are really just lists) allow you to subset columns using $ followed by the column name. Subset-by the way- just means looking at a portion of the overall. In R, subsetting for data.frames and matrices can be done using single brackets[, within which you will see [row, column]. Thus if we type data.combined[1,2], it would give you the value in row 1 of column 2.
Now, if you knew that the name of column 2 was name you can use data.combined[1,"name"] to get the same output as data.combined$name[1]. Look back at that code:
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Okay, so now we see our eyes should be locked on data.combined[SOMETHING IS IN HERE?!]) and slowly be picking out data.combined[ ?ROW? , Oh the "name" column]. Cool.
Finding those ROW values!
which(duplicated(as.character(data.combined$name)))
Anytime you see the which function, it is just giving you locations. An example: For the logical vector a = c(1,2,2,1), which(a == 1) would give you 1 and 4, the location of 1s in a.
Now duplicated is simple too. duplicated(a) (which is just duplicated(c(1,2,2,1))) will give you back FALSE FALSE TRUE TRUE. If we ran which(duplicated(a)) it would return 3 and 4. Now here is a secret you will learn. If you have TRUES and FALSES, you don't need to use the which function! So maybe which was unnessary here. And also as.character... since duplicated works on numbers and strings.
What You Should Be Writing
Who am I to tell you how to write code? But here's my take.
Don't mix up ways of subsetting: use EITHER data.frame[,column] or data.frame$column...
The code could have been written a little bit more legibly as:
dupes <- duplicated(data.combined$name)
dupe.names <- data.combines$name[dupes]
or equally:
dupes <- duplicated(data.combined[,"name"])
dupe.names <- data.combined[dupes,"name"]
I know this was lengthy but I hope it helps.
An easier way to read any code is to break up their components.
dup.names <-
as.character(
data.combined[which(
duplicated(
as.character(
data.combined$name
)
)
), "name"]
)
For each of the functions - those parts with rounded brackets following them e.g. as.character() you can learn more about what they do and how they work by typing ?as.character in the console
Square brackets [] are use to subset data frames, which are stored in your environment (the box to the upper right if you're using R within RStudio contains your values as well as any defined functions). In this case, you can tell that data.combined is the name that has been given to such a data frame in this example (type ?data.frame to find out more about data frames).
"Unwrapping" long lines of code can be daunting at first. Start by breaking it down into parenthesis , brackets, and commas. Parenthesis directly tacked onto a word indicate a function, and any commas that lie within them (unless they are part of another nested function or bracket) separate arguments which contain parameters that modify the way the function behaves. We can reduce your 2nd line to an outer function as.character and its arguments:
dup.names <- as.character(argument_1)
Just from this, we know that dup.names will be assigned a value with the data type "character" off of a single argument.
Two functions in the first line, file.path() and dir.create(), contain a comma to denote two arguments. Arguments can either be a single value or specified with an equal sign. In this case, the output of file.path happens to perform as argument #1 of dir.create().
file.path(argument_1,argument_2)
dir.create(argument_1,argument_2)
Brackets are a way of subsetting data frames, with the general notation of dataframe_object[row,column]. Within your second line is a dataframe object, data.combined. You know it's a dataframe object because of the brackets directly tacked onto it, and knowing this allows you to that any functions internal to this are contributing to subsetting this data frame.
data.combined[row, column]
So from there, we can see that the internal functions within this bracket will produce an output that specifies the rows of data.combined that will contribute to the subset, and that only columns with name "name" will be selected.
Use the help function to start to unpack these lines by discovering what each function does, and what it's arguments are.

how to view head of as.data.frame in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a huge data set with 20 columns and 20,000 rows, according to the manual of a program I use, we have to put the data as a data frame, though I'm not I understand what it does.. and I can't seem to view the head data frame I created.
I wrote in Bold the part that I don't understand, I'm very new with R, can a kind mind explain to me how the following works?
First I read the CSV file
vData = read.csv("my_matrix.csv");
1) Here we create the data frame as per the manual, what does -c(1:8) do exactly??
dataExpr0 = as.data.frame(t(vData[, -c(1:8)]))
2) Here, to understand what the above part does, I tried to view only the header of the data frame, with the following line, but it display the first 2 columns for the 20,000 rows of data. Is there a way to view only the first 2 rows?
head(dataExpr0, n = 2)
Let's disect what your call is doing, from the inside out.
Basic Indexing
When indexing a data.frame or matrix (assuming 2 dimensions), you access a single element of it with the square bracket notation, as you're seeing. For instance, to see the value in the fourth row, fifth column, you'd use vData[4,5]. This can work with ranges of rows and/or columns as well, such as vData[1:4,5] returning the first 4 rows and the 5th column as a vector.
Note: the range 1:4 can also be an arbitrary vector of numbers, such as vData[c(1,2,5),c(4,8)] which returns a 3 by 2 matrix.
BTW: by default, when the resulting slice/submatrix has one of its dimensions reduced to 1 (as in the latter example), R will drop it to the lower structure (e.g., matrix -> vector -> scalar). In this case, it will drop vData[1:4,5] to a vector. You can prevent this from happening by adding what appears to be a third dimension to the square brackets: vData[1:4,5,drop=FALSE], meaning "do not drop the simplified dimension". Now, you should get a matrix of 4 rows and 1 column in return.
You can read a much more thorough explanation of how to subset data.frames by reading (for example) some of the "Hadleyverse". If you do that, I highly encourage you to make it an interactive session: play in R as you read, to help cement the methods.
Negative Indexing
Negative indices mean "everything except what is listed". In your example, you are subsetting the data to extract everything except columns 1:8. So your vData[,-c(1:8)] is returning all rows and columns 9 through 20, a 20K by 12 matrix. Not small.
Transposition
You probably already know what t() does: transpose the matrix so that it is now 12 by 20K.
A word of warning: if all of your data.frame columns are of the same class (e.g., 'character', 'logical'), then all is fine. However, the fact that data.frames allow disparate types of data in different columns is not a feature shared by matrices. If one data.frame column is different than the others, they will be converted to the highest common format, e.g., logical < integer < numeric < character.
Back to a data.frame
After you transpose it (which converts to a matrix), you convert back to a data.frame, which may or may not be necessary depending on how to intend to deal with the data later. For instance, if the row names are not meaningful, then it may not be that useful to convert into a data.frame. That's relatively immaterial, but I'm a fan of not over-converting things. I'm also a fan of using the simpler data structure, and matrices are typically faster than data.frames.
Head
... merely gives you the top n rows of a data.frame or matrix. In your case, since you transposed it, it is now 20K columns wide, which may be a bit unwieldy on the command line.
Alternatives
Based on what I provided earlier, perhaps you just want to look at the top few rows and first few columns? dataExpr0[1:5,1:5] will work, as will (identically) head(dataExpr0[,1:5], n=5).
More Questions?
I strongly encourage you to read more of the Hadleyverse and become a little more familiar with subsetting and basic data management. It is fundamental to using R, and StackOverflow is not always patient enough to answer baseline questions like this. This forum is best suited for those who have already done some research, read documentation and help pages, and tried some code, and only after that cannot figure out why it is not working. You provided some basic code with is good, but SO is not ideally suited to teach how to start with R.

Cumulative sum for n rows

I have been trying to produce a command in R that allows me to produce a new vector where each row is the sum of 25 rows from a previous vector.
I've tried making a function to do this, this allows me to produce a result for one data point.
I shall put where I haver got to; I realise this is probably a fairly basic question but it is one I have been struggling with... any help would be greatly appreciated;
example<-c(1;200)
fun.1<-function(x)
{sum(x[1:25])}
checklist<-sapply(check,FUN=fun.1)
This then supplies me with a vector of length 200 where all values are NA.
Can anybody help at all?
Your example is a bit noisy (e.g., c(1;200) has no meaning, probably you want 1:200 there, or, if you would like to have a list of lists then something like rep, there is no check variable, it should have been example, etc.).
Here's the code what I think you need probably (as far as I was able to understand it):
x <- rep(list(1:200), 5)
f <- function(y) {y[1:20]}
sapply(x, f)
Next time please be more specific, try out the code you post as an example before submitting a question.

Resources