value from unique elements

value from unique elements - r

I'm writing a QC program in R to handle data from an instrument that reports its own error codes. The codes are reported as bit values, so
0
means "all OK", while:-
1, 2, 4, 8, 16, 32, 64, 128
Each represent a unique error. Multiple errors can occur simultaneously, in which case the codes are summed to give a new number, e.g:-
error "2" + error "32" = code "34"
And because these sums are each unique, any given code value can be broken down into its constituent errors. I'm looking for a way to program the identification of errors from these codes. I'm struggling with an approach, but everything I can think of involves either look-up-tables or a big stack of loops... neither of which seems very elegant.
Rather than re-invent the wheel, I'm wondering if there's an R function that already exists to do this.
Has anyone come across this sort of problem before?

You could convert the number to bits, and use that representation to find the errors.
2^(which(intToBits(34)==1)-1)
returns
2 32
Hope this helps!

Related

summary of row of numbers in R

I just hope to learn how to make a simple statistical summary of the random numbers fra row 1 to 5 in R. (as shown in picture).
And then assign these rows to a single variable.
enter image description here
Hope you can help!

When you type something like 3 on a single line and ask R to "run" it, it doesn't store that anywhere -- it just evaluates it, meaning that it tries to make sense out of whatever you've typed (such as 3, or 2+1, or sqrt(9), all of which would return the same value) and then it more or less evaporates. You can think of your lines 1 through 5 as behaving like you've used a handheld scientific calculator; once you type something like 300 / 100 into such a calculator, it just shows you a 3, and then after you have executed another computation, that 3 is more or less permanently gone.
To do something with your data, you need to do one of two things: either store it into your environment somehow, or to "pipe" your data directly into a useful function.
In your question, you used this script:
1
3
2
7
6
summary()
I don't think it's possible to repair this strategy in the way that you're hoping -- and if it is possible, it's not quite the "right" approach. By typing the numbers on individual lines, you've structured them so that they'll evaluate individually and then evaporate. In order to run the summary() function on those numbers, you will need to bind them together inside a single vector somehow, then feed that vector into summary(). The "store it" approach would be
my_vector <- c(1, 3, 7, 2, 6)
summary(my_vector)
The importance isn't actually the parentheses; it's the function c(), which stands for concatenate, and instructs R to treat those 5 numbers as a collective object (i.e. a vector). We then pass that single object into my_vector().
To use the "piping" approach and avoid having to store something in the environment, you can do this instead (requires R 4.1.0+):
c(1, 3, 7, 2, 6) |> summary()
Note again that the use of c() is required, because we need to bind the five numbers together first. If you have an older version of R, you can get a slightly different pipe operator from the magrittr library instead that will work the same way. The point is that this "binding" part of the process is an essential part that can't be skipped.
Now, the crux of your question: presumably, your data doesn't really look like the example you used. Most likely, it's in some separate .csv file or something like that; if not, hopefully it is easy to get it into that format. Assuming this is true, this means that R will actually be able to do the heavy lifting for you in terms of formatting your data.
As a very simple example, let's say I have a plain text file, my_example.txt, whose contents are
1
3
7
2
6
In this case, I can ask R to parse this file for me. Assuming you're using RStudio, the simplest way to do this is to use the File -> Import Dataset part of the GUI. There are various options dealing with things such as headers, separators, and so forth, but I can't say much meaningful about what you'd need to do there without seeing your actual dataset.
When I import that file, I notice that it does two things in my R console:
my_example <- read.table(...)
View(my_example)
The first line stores an object (called a "data frame" in this case) in my environment; the second shows a nice view of how it's rendered. To get the summary I wanted, I just need to extract the vector of numbers I want, which I see from the view is called V1, which I can do with summary(my_example$V1).
This example is probably not helpful for your actual data set, because there are so many variations on the theme here, but the theme itself is important: point R at a file, as it to render an object, then work with that object. That's the approach I'd recommend instead of typing data as lines within an R script, as it's much faster and less error-prone.
Hopefully this will get you pointed in the right direction in terms of getting your data into R and working with it.

How can I circumvent the 2^31 error thrown by the table() function?

I really tried my best searching through stackoverflow for a solution but unfortunatelly I couldn't find a suitable question. Therefore, I have to raise a question on my own.
I'm working with a data set containing sessionID's and topics. I wanted to find out, how many items of specific topics have been purchased together. Thankfully, a stack overflow member had a great idea, using a combination of the table() function and the crossprod() function.
topicPairs <- crossprod(table(as.data.frame(transactions)))
You can look this up here: How can I count, how many Items have been in one session together?
For the topics (or genres) this approach worked really well and the final matrix was really small in terms of storage usage.
However, now I want to find out, how many artists have been purchased together in different sessions. Therefore, I just replace the genres (I have 360 of them) with the artists (here, I have 35727) and apply this 'table-crossprod-combination'. Unfortunately, R throws the following error message:
attempt to make a table with >= 2^31 elements
I also understood, what happened: The table function generates one entry per session and genre. Since I only have 360 different genres, this is no problem because the number of sessions multiplied by the number of gernes is less than 2^31. On the other hand, I have 35727 different artists. If I multiply this number by the number of sessions I exceed the number of 2^31 elements.
This is actually really sad, since the solution is so smart and easy and it worked really well. Therefore, I want to ask you, if there is a way to circument this problem. Sure, my datasset is quite big ... but there are people using much bigger data sets.
Perheps, I have to split the data set up in smaller subsets and merge them together in a final step. But this is not that easy, since there are some artists which appear e.g. in subset 1 but not in subset 2. Therefore, I cannot simply add the matrices elementwise.
It would be awesome, if you could provide a solution for this problem since it drives me crazy, beeing that close to the perfect solution.
Thank you very much in advance!

When your results matrix is likely to be sparse, in that there is a high percentage of zeros, it is worth using sparse matrices to save space, if possible.
So for your data:
sessionID <- c(1, 2, 2, 3, 4, 4, 5, 6, 6, 6)
topic <- c("rock", "house", "country", "rock", "r'n'b", "pop", "classic", "house", "rock", "country")
transactions <- cbind(sessionID, topic)
You can use xtabs to return a sparse matrix (instead of the dense matrix returned by table), and use the Matrix package to find the crossproduct of this and which will retain the sparsity.
tab <- xtabs(~ sessionID + topic, data=transactions, sparse=TRUE)
Matrix::crossprod(tab)

numbers are outside valid range when computing MACD and error handling

I'm trying to compute ( MACD - signal ) / signal of prices of Russel 1000 (which is an index of the 1000 US large cap stocks). I keep getting this error message and simply couldn't figure out why :
Error in EMA(c(49.85, 48.98, 48.6, 49.15, 48.85, 50.1, 50.85, 51.63, 53.5, :n = 360 is outside valid range: [1, 198]
I'm still relatively new in R although I'm proficient in Python. I suppose I could've used "try" to just work around this error, but I do want to understand at least what the cause of it is.
Without further ado, this is the code :
N<-1000
DF_t<- data.frame(ticker=rep("", N), macd=rep(NA,N),stringsAsFactors=FALSE)
stock<-test[['Ticker']]
i<-0
for (val in stock){dfpx=bdh(c(val), c("px_last"),start.date=as.Date("2018-1-
01"),end.date=as.Date("2019-12-30"))
macd<- MACD( dfpx[,"px_last"], 60, 360, 45, maType="EMA")
num<-dim(macd)[1]
ma<-(macd[num,][1]-macd[num,][2])/macd[num,][2]
i=i+1
DF_t[i,]<-list(val,ma)
}
For your information,bdh() is a Bloomberg command to fetch historic data.dfpx[] is a dataframe.MACD() is a function that takes a time series of prices and outputs a matrix,where the first column are the MACD values and the second column are the signal values.
Thank you very much! Any advice would be really appreciated. Btw, the code works with a small sample of a few stocks but it will cause the error message when I try to apply it to the universe of one thousand stocks. In addition, the number of data points is about 500, which should be large enough for my setup of the parameters to compute MACD.

Q : "...and error handling"
If I may add a grain of salt onto this, the error-prevention is way better than any ex-post error-handling.
For this, there is a cheap, constant O(1) in both [TIME]- and [SPACE]-Domains step, that principally prevents any such error-related crashes :
Just prepend to the instantiated the vector of TimeSERIES data with that many constant and process-invariant value cells, that make it to the maximum depth of any vector-processing, and any such error or exception is principally avoided :
processing-invariant value, in most cases, is the first know value, to be repeated that many times, as needed back, in the direction of time towards older ( not present ) bars ( yes, not relying on NaN-s and how NaN-s might get us in troubles in methods, that are sensitive to missing data, which was described above. Q.E.D. )

For those who are interested, I have found the cause of the error: some stocks have missing prices. It's simple like that. For instance, Dow US Equity has only about 180 daily prices (for whatever reason) over the past one and a half years, which definitely can't be used to compute a moving average of 360 days.
I basically ran small samples till I eventually pinpointed what caused the error message. Generally speaking, unless you are trying to extract data of above 6,000 stocks or so and you are querying say 50 fields, you are Okay. A rule of thumb for daily usage limit of Bloomberg is said to be around 500,000 for a school console. A PhD colleague working in a trading firm also told me professional consoles of Bloomberg are more forgiving.

How do I change numeric data that is reading in as a character in R?

I am trying to read in a csv file (exported from survey monkey).
I have tried survey <- read.csv("Survey Item Evaluation2.csv", header=TRUE, stringsAsFactors = FALSE)
I ran skim(survey), which shows it is reading in as characters.
str(survey) output: data.frame: 623obs. of 68 variables. G1 (which is a survey item) reads in as chr "1" "3" "4" "1"....
How do I change those survey item variables to numeric?

The correct answer to your question is given in the first two comments by two very well respected people with a combined reputation of over 600k. I'll post their very similar answer here:
as.numeric(survey$G1)
However, that is not very good advice in my opinion. Your question should really have been:
"Why am I getting character data when I'm sure this variable should be numeric?"
To which the answer would be: "Either your not reading the data correctly (does the data start at row 3), or there is non-numeric (garbage) data among the numeric data (for example NA is entered as . or some other character), or certain people entered a , instead of a . to represent decimal point (such as nationals of Indonesia and some European countries), or they entered a thin thousand separator instead of a comma, or some other unknown reason which needs further investigation. Maybe a certain group of people enter text instead of numbers for their age (fifty instead of 50), or they put a . at the end of the data, for example 62.5. instead of 62.5 for their age (older folks were taught to always end a sentence with a period!). In these last two cases, certain groups (elderly) will have missing data and your data is then missing not at random (MNAR), a big bias in your analysis".
I see this all too often and I worry that new users of R are making terrible mistakes due being given poor advice, or because they didn't learn the basics. Importing data is the first step of analysis. It can be difficult because data files come in all shapes and sizes - there is no global standard. Data is also often entered without any quality control mechanisms. I'm glad that you added the stringsAsFactors=FALSE argument in your command to import the data. Someone gave you good advice there. But that person forgot to advise you not to trust your data, especially if it was given to you by someone else to analyse. Always check every variable carefully before the analysis. This can take time, but it can be worth the investment.
Hope that helps at least someone out there.

Data compression scheme, math

I have about 42,000 lists of 24 random numbers, all in the range [0, 255]. For example, the first list might be [32, 15, 26, 27, ... 11]. The second list might be [44, 44, 18, 19, .. 113]. How can I choose a number from each of the lists so that (so I will end up with a new list of about 42,000 numbers) such that this new list is most compressible using ZIP?
-- this question has to do with math, data compression

The ZIP file format uses DEFLATE for its compression algorithm. So you need to consider how that algorithm works and pick data such that the algorithm finds it easy to compress. According to the wikipedia article, there are two stages of compression. The first uses LZ77 to find repeated sections of data and replace them with short references. The second uses Huffman coding to take the remaining data and strip out redundancy across the whole block. This is called entropic coding - if the information isn't very random (has low entropy) the code replaces common things with short symbols, increasing the entropy.
In general, then, lists with lots of repeated runs (i.e., [111,2,44,93,111,2,44,93...]) will compress well in the first pass. Lists with lots of repeated numbers within other random stuff (i.e., [111,34,43,50,111,34,111,111,2,34,22,60,111,98,2], where 34 and 111 show up often) will compress well in the second pass.
To find suitable numbers, I think the easiest thing to do is just sort each list, then merge them, keeping the merge sorted, until you get to 42000 output numbers. You'll get runs as they happen. This won't be optimal, you might have the number 255 in each input list and you'd miss them using this technique, but it would be easy.
Another approach would be to histogram the numbers into 256 bins. Any bins that stand out indicate numbers that should be grouped. After that, I guess you have to search for sequences. Again, sorting the inputs will probably make this easier.
I just noticed you had the constraint that you have to pick one number from each list. So in both cases you could sort each list then remove duplicates.
Additionally, Huffman codes can be generated using a tree, so I wonder if there's some magic tree structure you could put the numbers into that would automatically give the right answer.

This smells NP-complete to me, but I am nowhere near able to prove it. On the outside, there are approximately 7.45e+57968 (!) possible configurations to test. It doesn't seem that you can opt out of a particular configuration early, as an incompressible initial section could be greatly compressible later on.
My best guess for "good" compression would be to count the number of occurrences of each number across the entire million-element set and select from each list the numbers with the most occurrences. For example, if every list has 42 present in it, selecting that only would give you a very-compressible array of 42,000 instances of the same value.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex