Find the index of the last occurence of fulfilled criteria in a matrix in r - r

I have an array (x) in R of size 30x11x10.
x=array(-2:20, c(30,11,10))
Each 'grid' or matrix represents a day of data for a month (30 days represented here). I want to find the index (i,j,k) of when the last occurrence of a number less than 2 occurs. Ideally, I would also like the value returned too. If this was in Matlab, I could just use [i,j,k]=find(x(x<2)) but I don't see an exact equivalent for this in R.
I have looked at 'match' as suggested in other posts here, but it seems to find elements when they are specified, but not when a criteria (x<2) is given?
I tried this:
xxx<-match(x,x<2,0) but it returns a long vector of integers that don't appear to show what I am looking for.
Then I tried:xxx<-match(x,x[x<2],0) which looks a bit more promising, but still isn't what I want (to be honest I'm not sure what the output is indexing).
I think I'm probably asking a foolish question here because if I want 3 indices and the value returned, then I should be assigning them to something preemptively right (which I'm not doing)? Can anyone offer any advice?

Related

closed/fixed:Interpertation of basic R code

I have a basic question in regards to the R programming language.
I'm at a beginners level and I wish to understand the meaning behind two lines of code I found online in order to gain a better understanding. Here is the code:
as.data.frame(y[1:(n-k)])
as.data.frame(y[(k+1):n])
... where y and n are given. I do understand that the results are transformed into a data frame by the function as.data.frame() but what about the rest? I'm still at a beginners level so pardon me if this question is off-topic or irrelevant in this forum. Thank you in advance, I appreciate every answer :)
Looks like you understand the as.data.frame() function so let's look at what is happening inside of it. We're looking at y[1:(n-k)]. Here, y is a vector which is a collection of data points of the same type. For example:
> y <- c(1,2,3,4,5,6)
Try running that and then calling back y. What you get are those numbers listed out. Now, consider the case you want to just call out the number 1 in that vector. How would you do that? Well, this is where the brackets come into play. If you wanted to just call the number 1 in y:
> y[1]
[1] 1
Therefore, the brackets are a way of calling out or indexing specific items in the vector. Note that the indexing starts at the value 1 and goes up to the number of items in the vector, or length. One last thing before we go back to the example you gave. What if we want to index the numbers 1, 2, and 3 from the vector but not the rest?
> y[1:3]
[1] 1 2 3
This is where the colon comes into play. It allows us to reference a subset of the numbers. However, it will reference all the numbers between the index left of the colon and right of it. Try this out for yourself in R! Play around and see what happens.
Finally going back to your example:
y[1:(n-k)]
How would this work based on what we discussed? Well, the colon means that we are indexing all values in the vector y from two index values. What are those values? Well, they are the numbers to the left and right of the colon. Therefore, we are asking R to give us the values from the first position (index of 1) to the (n-k) position. Therefore, it's important to know what n and k are. If n is 4 and k is 1 then the command becomes:
y[1:3]
The same logic can apply to the second as.data.frame() command in your question. Essentially, R is picking out different numbers from a vector y and multiplying them together.
Hope this helps. The best way to learn R is to play around with a command, throw different numbers at it, guess what will happen, and then see what happens!

how to view head of as.data.frame in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a huge data set with 20 columns and 20,000 rows, according to the manual of a program I use, we have to put the data as a data frame, though I'm not I understand what it does.. and I can't seem to view the head data frame I created.
I wrote in Bold the part that I don't understand, I'm very new with R, can a kind mind explain to me how the following works?
First I read the CSV file
vData = read.csv("my_matrix.csv");
1) Here we create the data frame as per the manual, what does -c(1:8) do exactly??
dataExpr0 = as.data.frame(t(vData[, -c(1:8)]))
2) Here, to understand what the above part does, I tried to view only the header of the data frame, with the following line, but it display the first 2 columns for the 20,000 rows of data. Is there a way to view only the first 2 rows?
head(dataExpr0, n = 2)
Let's disect what your call is doing, from the inside out.
Basic Indexing
When indexing a data.frame or matrix (assuming 2 dimensions), you access a single element of it with the square bracket notation, as you're seeing. For instance, to see the value in the fourth row, fifth column, you'd use vData[4,5]. This can work with ranges of rows and/or columns as well, such as vData[1:4,5] returning the first 4 rows and the 5th column as a vector.
Note: the range 1:4 can also be an arbitrary vector of numbers, such as vData[c(1,2,5),c(4,8)] which returns a 3 by 2 matrix.
BTW: by default, when the resulting slice/submatrix has one of its dimensions reduced to 1 (as in the latter example), R will drop it to the lower structure (e.g., matrix -> vector -> scalar). In this case, it will drop vData[1:4,5] to a vector. You can prevent this from happening by adding what appears to be a third dimension to the square brackets: vData[1:4,5,drop=FALSE], meaning "do not drop the simplified dimension". Now, you should get a matrix of 4 rows and 1 column in return.
You can read a much more thorough explanation of how to subset data.frames by reading (for example) some of the "Hadleyverse". If you do that, I highly encourage you to make it an interactive session: play in R as you read, to help cement the methods.
Negative Indexing
Negative indices mean "everything except what is listed". In your example, you are subsetting the data to extract everything except columns 1:8. So your vData[,-c(1:8)] is returning all rows and columns 9 through 20, a 20K by 12 matrix. Not small.
Transposition
You probably already know what t() does: transpose the matrix so that it is now 12 by 20K.
A word of warning: if all of your data.frame columns are of the same class (e.g., 'character', 'logical'), then all is fine. However, the fact that data.frames allow disparate types of data in different columns is not a feature shared by matrices. If one data.frame column is different than the others, they will be converted to the highest common format, e.g., logical < integer < numeric < character.
Back to a data.frame
After you transpose it (which converts to a matrix), you convert back to a data.frame, which may or may not be necessary depending on how to intend to deal with the data later. For instance, if the row names are not meaningful, then it may not be that useful to convert into a data.frame. That's relatively immaterial, but I'm a fan of not over-converting things. I'm also a fan of using the simpler data structure, and matrices are typically faster than data.frames.
Head
... merely gives you the top n rows of a data.frame or matrix. In your case, since you transposed it, it is now 20K columns wide, which may be a bit unwieldy on the command line.
Alternatives
Based on what I provided earlier, perhaps you just want to look at the top few rows and first few columns? dataExpr0[1:5,1:5] will work, as will (identically) head(dataExpr0[,1:5], n=5).
More Questions?
I strongly encourage you to read more of the Hadleyverse and become a little more familiar with subsetting and basic data management. It is fundamental to using R, and StackOverflow is not always patient enough to answer baseline questions like this. This forum is best suited for those who have already done some research, read documentation and help pages, and tried some code, and only after that cannot figure out why it is not working. You provided some basic code with is good, but SO is not ideally suited to teach how to start with R.

Locating minimum on table/output

(R studio)
Basic question about how to go about finding the minimum of a value on a table.
The table is longer than this (nsplit goes to 1532), which is why I'm looking for a search function.
In the picture basically I'd like to find the minimum value of "xerror", and after that I'd like to find "nsplit" at the minimum of "xerror"
I'd definitely appreciate any help.
You can use the following code (assuming the name of your data frame is d):
d[which(d$xerror==min(d$xerror)),]
With this code you can find values of every other variables (including "nsplit") at the minimum value of "xerror". You can also see which observation it is at the left most line of the output.

Faster R code for fuzzy name matching using agrep() for multiple patterns...?

I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.
Any suggestions would be greatly appreciated.
I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.
What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.
Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply
cl<-makeCluster(20)
So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.
The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.
truedupevf<-data.frame(unlist(parSapply(cl,
s4dupe$fuzzydob,agrep,value=T,
max.distance=2,s4dupe$fuzzydob)))
I hope this helps.

Searching an ordered "list" matching condition when nothing matches the condition, list length = 1

I have a sorted list with 3 columns, and I'm searching to see if the second column matches 2 or 4, then returning the first column's element if so, and putting that into a function.
noOutliers((L1LeanList[order(L1LeanList[,1]),])[(L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4),1])
when nothing matches the condition. I get a
Error in ((L1LeanList[order(L1LeanList[, 1]), ])[1, ])[(L1LeanList[order(L1LeanList[, :
incorrect number of dimensions
due to the fact that we effectively have List[List[all false]]
I can't just sub out something like L1LLSorted<-(L1LeanList[order(L1LeanList[,1]),]
and use L1LLSorted[,2] since this returns an error when the list is of length exactly 1
so now my code would need to look like
noOutliers(ifelse(any((L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4)),0,
(L1LeanList[order(L1LeanList[,1]),])[(L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4),1])))
which seems a bit ridiculous for the simple thing I'm requesting.
while writing this I realized that I can end up putting all this error checking into the noOutliers function itself so it looks like
noOutliers(L1LeanList,2,2,4) which will look much better, a necessity since slightly varying versions of this appear in my code dozens of times. I can't help but wonder, still, if theres a more elegant way to write the actual function.
for the curious, noOutliers finds a mean of the 30th-70th percentile in the sorted data set like so
noOutliers<-function(oList)
{
if (length(oList)<=20) return ("insufficient data")
cumSum<-0
iterCount<-0
for(i in round(length(oList)*3/10-.000001):round(length(oList)*7/10+.000001)+1)#adjustments deal with .5->even number rounding r mishandling
{ #and 1-based indexing (ex. for a list 1-10, taking 3-7 cuts off 1,2,8,9,10, imbalanced.)
cumSum<-cumSum+oList[i]
iterCount<-iterCount+1
}
return(cumSum/iterCount)
}
Let's see...
foo <- bar[(bar[,2]==2 | bar[,2]==4),1]
should extract all the first-column values you want. Then run whatever function you want on foo perhaps with the caveat "if (length(foo) < 1) then {exit, or skip, or something} "

Resources