I'm trying to make a Shiny app that uses some Geographical data I have stored in MySQL. The data currently contains a list of Longitudes and Latitudes (of client location), client age, client gender, and other various demographics. For example:
+--------+-------+--------+---------+
| Member | Long | Lat | Gender |
+--------+-------+--------+---------+
| A | 34 | -118 | M |
| B | 34 | -118 | F |
| C | 41 | -74 | M |
| D | 39 | -77 | M |
+--------+-------+--------+---------+
I want to use leaflet to create a bubble map for the locations - a bigger bubble at a location means that more clients are present at that area. It seems like the function addCircles does this pretty well, but the problem is that I don't have a count of the number of clients at each location to assign to the radius parameter- each row in my table represents information for a particular client, not a location. Is there a way to obtain that info?
My best guess is to create a new table where each row represents counts for a different longitude and latitude and then use a count column to count the number of times that location appears among clients, but I'm not sure if this is the best way since it involves creating a new table, and I would have to create additional columns like "number of males" and "number of females" for every factor I want to account for. And what if I wanted to adjust my map for females who are in the age range of 40-50? The number of columns I would have to create would easily exceed 100..
Related
I am working with a dataset that was obtained from a global environment and loaded into R. It has been saved as a CSV and is being read in R as a data frame from that CSV. This dataset (survey_df) has almost 3 million entries, I am trying to join this dataset based on a column ID (repeated multiple times since there are multiple entries per id) to what originally was a shapefile and is now loaded in R as a data frame shapefile_df . This data frame has 60,000 unique entries, each representing a geometry in a country. We expect to have many entries per geometry in most cases. I am using a simple left_join which should in theory join these two datasets together. I am running into an issue where they are not fully joining together, only some entries are. I have tried inner,fully and right join as well as merge and I keep getting the same issue. I made a full_join and a copy of the id columns to compare the ones that are not joining and I do not see any patterns. They seem to be the same id, they are not joining for some reason. I tried formatting them as.character and as.factor and yet nothing. Below I pasted a sample of the join/unjoined df.
Matched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
0901200010229 | 0901200010229 | 0901200010229
0901500010729 | 0901500010729 | 0901500010729
090050001087A | 090050001087A | 090050001087A
0900600010467 | 0900600010467 | 0900600010467
0901400010897 | 0901400010897 | 0901400010897
0901200011960 | 0901200011960 | 0901200011960
Unmatched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
01903900010480 | 01903900010480 | NA
070470001010A | NA | 070470001010A
0704700010117 | NA | 0704700010117
0704700010140 | NA | 0704700010140
0705200010672 | NA | 0705200010672
0705200010742 | NA | 0705200010742
Most of the entries that are unmatched are like the first row where shapefile_df_id is NA. However, there are a few where survey_id_copy is NA. This field is simply a mutate of survey_df_id and in theory should not be any different yet they are. Any idea what could be causing this? I suspect this is a formatting issue but as a said, using as. hasn't fixed the issue. I am using tidyverse and read.csv. Any help?
Background
This is a theoretical problem similar to an actual problem I am having.
Imagine I have a database of millions of grades for students with rows looking something like the following:
student_id | first_name | last_name | maths_grade | english_grade | physics_grade | chemistry_grade | biology_grade
-------------------------------------------------------------------------------------------------------------------
15643 | John | Smith | 68 | 87 | 54 | 36 | 98
13465 | Alice | Jones | 87 | 54 | 52 | 84 | 23
....
The Problem
I want to find students with the most similar results so I can pair them up as study partners, that way they can both learn from each other, unlike a typical mentor/mentee relationship where only one student learns.
When a given student submits a request for a study partner, I want to find the student in our database that has the most similar grades to the student who made a request.
All subject grades have equal weighting.
The Required Solution
Since this database has millions of grades, simply looking through the entire database and computing a 'similarity score' for the given student won't work, it would be too slow and wouldn't scale.
Instead, we need to add a grade_fingerprint field to the database that contains a 'similarity fingerprint' of the students' grades. We can then index the table on this field and quickly find the two closest students to our given student.
This grade_fingerpriny would be similar to a hash code in that it acts as a summary of the table entry, however, two objects with similar properties do not have a similar hash code, which makes them unsuitable.
Accepted Answers
The name of a sub-domain of computer science that covers the fingerprinting of objects in a way that allows for easy searching for smiliar objects
The pseudocode of an algorithm that can compute a 'fingerprint' of an object that can be used to compare object similarity
I have two tables that are linked via a relation (edit -> data table properties -> relations). One contains some raw data, and the other contains aggregated data (calculation on the value).
You can see some examples below. Here, data are linked on "category" column.
RAW DATA
category | id | value
---------+----+------
A | 1 | 10
A | 2 | 20
A | 3 | 30
A | 4 | 30
B | 1 | 20
B | 2 | 20
COMPUTED DATA
category | any_calculation //aggregation of raw data based on category
---------+----------------
A | 10
B | 20
To do the calculation, I use a R/TERR function that take raw data as an input, and that output computed data.
Then I display raw data in a scatter plot (one per category), and I add a curve that is taken from the column "any_calculation" of the computed data.
My main problem is that my table with computed data isn't filled by the R/TERR script. The cause is, in my opinion, the cyclic dependency between those two tables.
Do you have any idea/workaround/fix ?
I should also add that I can't do the calculation in the scatter plot (huge calculation). I use Spotfire 7.8.0.
It seems like a table can't be modified/edited by different sources, that is to say multiple scripts (R and Python) can't have the same table as an output.
To fix my problem, I created a new table in one of my script. Then I created a relation between this table and the other one from the other script.
I'm new to Power Bi, followed most of the tutorial on MS but haven't figured yet how creat a graph that resembles this graphic I did with Excel - Pivot Graph, using as source the same data table.
What I need to recreate in Power Bi is a column graph with the most requested (pre-orders requests % of total sum) products in different price ranges.
Pivot Graph
Table ie.
| Date | Product | 3 to 5 Eur | 5 to 8 Eur | 8 to 11 Eur |
----------------------------------------------------------
| mar17| Coffe | 12 | 7 | 2 |
| mar17| Milk | 15 | 3 | 1 |
| mar17| Honey | 17 | 0 | 5 |
| mar17| Sugar | 20 | 9 | 8 |
Thank in advance for the help.
Bests,
Alberto
Edit - Thanks to Mike Honey for pointing out the original request was for % of grand total. I have added an additional step to accomplish this and cleaned up some existing steps.
When I imported your sample data into Power BI, I got this (looking at the data in the Query Editor window).
From there, Select the Data and Product columns and then click on Transform -> Unpivot Columns -> Unpivot Other Columns...
... which results in this.
Just to clean this up, I renamed the Attribute and Value columns and changed the data type of the Value column. In the end, it looks like this.
Then just click on Home -> Close & Apply to get back in the Report Editor window, where you can create a graph and configure it as shown such:
Axis:
Price Range
Product
Value:
Quantity
Then click of the forked, drill-down arrow in the top left corner of the graph to show Price Range and Product.
Which looks like this.
Next, while not necessary I feel that it is very nice, with the graph selected, click on the paint roller icon and expand the X-Axis category. In there, turn off Concatenate labels.
Finally, to get the bars to be % grand total, simply right click on Quantity in the Value section of the graph's fields and then select Show value as -> Percent of grand total.
To get the final results that look like this.
I have two columns in my google sheet that corresponds to 1) the frequencies of the elements and 2) their respective 'values'.
What I want is a diagram that holds the different frequencies on the x-axis, and for each frequency I want the y-axis to hold that specific frequency's value (and if there are more than one element with that frequency I want it to plot their mean value).
Two elements can share the same frequency and/or the same score, and that's why I want the mean-functionality added aswell.
If the following data would be my values:
280 6
280 4
250 2
240 1
230 3
Forgive my ascii-skills, but I'd want the graph to plot the following in that case:
^
.
.
|
| |
| |
| | |
| | | |
| | | | |
___230___240___250___260___270___280___...>
I'm not entirely familiar with Google Sheets yet and I'm not really sure how to accomplish this.
I think a pivot table should serve. Your LH column in Rows, RH in Values, with Summarise by AVERAGE. Then chart the results (select what in the image is B14:B17, Insert..., Chart and accept the first recommendation):