r matrix online structure taking too long - r

This question comes in the sequence of a previous one I asked this week.
But generally my problem goes as follows:
I have a datastream of records entering in R via a socket and I want to do some analyses.
They come sequentially like this:
individual 1 | 1 | 2 | timestamp 1
individual 2 | 4 | 10 | timestamp 2
individual 1 | 2 | 4 | timestamp 3
I need to create a structure to maintain those records. The main idea is discussed in the previous question but generally I've created a structure that looks like:
*var1* *var2* *timestamp*
- individual 1 | [1,2,3] | [2,4,6] | [timestamp1, timestamp3...]
- individual 2 | [4,7,8] | [10,11,12] | [timestamp2, ...]
IMPORTANT - this structure is created and enlarged at runtime. I think this is not the best choice as it takes too long creating. The main structure is a matrix and inside each pair individual variable I have lists of records.
The individuals are on great number and vary a lot over time. So without going through some records I don't have enough information to make a good analyse. Thinking about creating some king of cache at run time on R by saving the records of individuals to disk.
My full database has an amount of approximately 100 GB. I want to analyse it mainly by seasonal blocks within each individual (dependent on the timestamp variable).
The creation of my structure takes too long as I enlarge the amount of records I'm collecting.
The idea of using a matrix of data with lists inside each pair individual - variable was adapted from using a three dimensional matrix because I don't have observations at the same timestamps. Don't know if it was a good idea.
If anyone has any idea on this matter I would appreciate it.

Related

Impossible calculated field

I have been trying for the past several hours to write a calculated field in Google Data Studio.
I need to know how to get a percentage calculation on some events. Table below:
|Event Label|Event Action |Total Events|
|-----------|----------------|------------|
|CTA 1 |Link Displayed | 100 |
|CTA 1 |Link Clicked | 20 |
I want to get the conversion, which means dividing 20 by 100 but I can't seem to write a calculated which does that. I feel like I've tried everything e.g.:
sum((total events(link clicked)) / total events(link displayed)))
And the like. Please help!
Thanks
This function is not available yet through the data studio (note that data studio is brand new). For this you have to use the API which I would strongly recommend!
Here you could use r and the sqldf-package that would provide you the data as the data studio (very simple sql querys). The same package you have in phyton.

How to count occurrence of value and percentage of a subset in tableau public?

I have a set of data in the following format:
Resp | Q1 | Q2
P1 | 4 | 5
P2 | 1 | 2
P3 | 4 | 3
P4 | 6 | 4
I'd like to show the count and % of people who gave an answer greater than 3. So in this case, the output would be:
Question | Count | Percent
Q1 | 3 | 75%
Q2 | 2 | 50%
Any suggestions?
Although it sounds like a fairly easy thing, it is a bit more complicated.
Firstly your data is not row based so you will have to pivot it.
Load your data into Tableau
In the DataSource Screen choose column Q1 and Q1, right click on them and chosse "Pivot"
Name the column with the answers "Answers" (just for clarity.
You should get a table that looks like this:
Now you need to create a calculated field (I called it Overthreshold to check for your condition:
if [Answer] > 3 then
[Answer]
End
At this point you could substitute the 3 with a parameter in case you want to easily change that condition.
You can already drop the pills as follows to get the count:
Now if you want the percentage it gets a bit more complicated, since you have to determine the count of the questions and the count of the answers > 3 which is information that is stored in two different columns.
Create another Calculated field with this calculation COUNT([Overthreshold]) / AVG({fixed [Question]:count([Answer])})
drop the created pill onto the "text" field or into the columns drawer and see the percentage values
right click on the field and choose Default Propertiess / Number Format to have it as percentage rather than a float
To explain what the formular does:
It takes the count of the answers that are over the threshold and devides it by the count of answers for each question. This is done by the fixed part of the formular which counts the rows that have the same value in the Question column. The AVG is only there because Tableau needs an aggregeation there. Since the value will be the same for every record of the question, you could also use MIN or MAX.
It feels like there should be an eassier solution but right now I cannot think of one.
Here is a variation on #Alexander's correct answer. Some folks might find it slightly simpler, and it at least shows some of the Tableau features for calculating percentages.
Starting as in Alexander's answer, revise Overtheshold into a boolean valued field, defined as Answer > 3
Instead of creating a second calculated field for the percentage, drag Question, Overthreshold and SUM(Number Of Records) onto the viz as shown below.
Right click on SUM(Number of Records) and choose Quick Table Calculation->Percentage of Total
Double click Number of Records in the data pane on the left to add it to the sheet, which is a shortcut for bringing out the Measure Names and Measure Values meta-fields. Move Measure Names from Rows to Columns to get the view below, which also uses aliases on Measure Names to shorten the column titles.
If you don't want to show the below threshold data, simply right click on the column header False and choose Hide. (You can unhide it if needed by right clicking on the Overthreshold field)
Finally, to pretty it up a bit, you can move Overthreshold to the detail shelf (you can't remove it from the view though), and adjust the number formatting for the fields being displayed to get your result.
Technically, Alexander's solution uses LOD calculations to compute the percentages on the server side, while this solution uses Table calculations to compute the percentage on the client side. Both are useful, and can have different performance impacts. This just barely nicks the surface of what you can do with each approach; each has power and complexity that you need to start to understand to use in more complex situations.

Finding partitioning solution in R with pre-written Simulated annealing package?

I have a data set, which consists of a number of elements -- divided into two distinct categories (with an equal number of elements for each category) -- and with two continuous variables describing them, like so:
ID | Category | Variable_1 | Variable_2
--------------------------------------------
1 | Triangle | 4.3522 | 5.2321
2 | Triangle | 3.6423 | 6.3223
3 | Circle | 5.2331 | 3.2452
4 | Circle | 2.6334 | 7.3443
... | ... | ... | ...
Now, what I want to do is to create a list of one-to-one parings so that every element of category Triangle is paired to an element of category Circle, and so that the combined distance between the points within each pairing in a 2D space defined by Variable_1 and Variable_2 is as small as possible. In other words, if I had to travel from each Triangle element to a Circle element (but never to the same Circle element twice), I want to find out how to minimize the total traveling distance (see illustration below).
Since I'm not really in the mood of trying to brute force this problem, I've been thinking that Simulated annealing probably would be a suitable optimisation method to use. I'd also like to work in R.
The good news is that I've found a couple of packages for doing Simulated annealing within R, for example GenSA and optim. The bad news is that I don't really know how to utilize these packages for my specific input needs. That is, as input I would like to specify a list of numbers denoting elements of a certain category in my list and in what order they should be paired to the other set of elements belonging to the other category. However, this would mean that I, in my Simulated annealing algoritm, only would like to use integers and that I never would like the same integer to appear twice, something that seems to go against how the packages above are implemented.
Is there some way that I could make effective use of some pre-written Simulated annealing package for R, or do I need to write my own methods for this problem?

How do I print every row in its own separate .pdf file in r?

I have tabulated data. I have to write some code to dynamically generate some .pdf reports. Once I know how to make R read and publish only 1 row at a time, I will be using Sweave to format it and make it look nice.
For example, if my data set looks like this:
Name | Sport | Country
Ronaldo | Football | Portugal
Federer | Tennis |Switzerland
Woods | Golf | USA
My output would be composed of three .pdf files. The first one would say "Ronaldo plays football for Portugal"; and so on for the other two rows.
I have started with a for-loop but every forum I have trawled through talks about the advantages of the -apply functions over it but I don't know how to make it apply on every row of the data.
PS: This is my first post on stackoverflow.com. Excuse me if I am not following the community rules here. I will try my best to ensure that the question conforms to the guidelines based on your feedback.

Creating New Variables in R that relate to

I have 7 different variable in an excel spreadsheet that I have imported into R. They each are columns with a size of 3331. They are:
'Tribe' - there are 8 of them
'Month' - when the sampling was carried out
'Year' - the year when the sampling was carried out
'ID" - an identifier for each snail
'Weight' - weight of a snail in grams
'Length' - length of a snail shell in millimetres
'Width' - width of a snail shell in millimetres
This is a case where 8 different tribes have been asked to record data on a suspected endangered species of snail to see if they are getting rarer, or changing in size or weight.
This happened at different frequencies between 1993 and 1998.
I would like to know how to be able to create a new variables to the data so that if I entered names(Snails) # then it would list the 7 given variables plus any added variable that I have.
The dataset is limited to the point where I would like to add new variables. Such as, knowing the counts per month of snails in any given month.
This would rely on me using - Tribe,Month,Year and ID. Where if an ID (snail identifier) was were listed according to the rates in any given month then I would be able to sum them to see if there are any changes in counts. I have tried:
count=c(Tribe,Year,Month,ID)
count
But, after doing things like that, R just has a large list of that is 4X the size of the dataset. I would like to be able to create a given new variable that is of column size n=3331.
Or maybe I would like to create a simpler variable so I can see if a tribe collected at any given month. I don't know how I can do this.
I have looked at other forums and searched but, there is nothing that I can see that helps me in my case. I appreciate any help. Thanks
I'm guessing you need to organise your variables in a single structure, such as a data.frame.
See ?data.frame for the help file.
To get you started, you could do something like:
snails <- data.frame(Tribe,Year,Month,ID)
snails
# or for just the first few rows
head(snails)
Then this would have your data looking similar to your Excel file like:
Tribe Year Month ID
1 1 1 1 a
2 2 2 2 b
3 3 3 3 c
<<etc>>
Then if you do names(snails) it will list out your column names.
You could possibly avoid some of this mucking about by just importing your Excel file either directly from Excel, or saving as a csv (comma separated values) file first and then using read.csv("name_of_your_file.csv")
See http://www.statmethods.net/input/importingdata.html for some more specifics on this.
To tabulate your data, you can do things like...
table(snails$Tribe)
...to see the number of snail records collected by each tribe. Or...
table(snails$Tribe,snails$Year)
...to see the trends in each tribe by each year. The $ character will let you access the named variable (column) inside a data.frame in the same way you are currently using the free floating variables. This might seem like more work initially, but it will pay off greatly when you need to do some more involved analysis.
Take for example if you want to only analyse the weights from tribe "1", you could do:
snails$Weight[snails$Tribe==1]
# mean of these weights
mean(snails$Weight[snails$Tribe==1])
There are a lot more things I could explain but you would probably be better served by reading an excellent website like Quick-R here: http://www.statmethods.net/management/index.html to get you doing some more advanced analysis and plotting.

Resources