Providing summary on data by Month in R - r

I have a dataset that contains information on incidents reported (either Fraud or Error). Information includes the unique ID number, date of the incident, the type of incident, the size, and whether the case is open or closed and, if closed, the closed date.
I am looking to create a script of code that will produce a new table that groups the data by month so that for each month we can see how many new incidents there have been, as well as the average size.
Additionally I would like to be able to see how many cases are open still in that month and Ideally break this down by the type of incident.
It is the counting both the new cases and the open cases that I am having the most issue with
any help would be appreciated
| Erro Type| Date of Error | ID | Size | Open/closed| Date Closed
| Fraud | 12/05/2021 | 233 | 5,000 | Open | -
| Fraud | 19/07/2020 | 194 | 23,000 | Closed | 24/01/22
| Error | 23/05/2021 | 241 | 9,000 | Open | -

Related

Join and merge are only working partially and completely in R

I am working with a dataset that was obtained from a global environment and loaded into R. It has been saved as a CSV and is being read in R as a data frame from that CSV. This dataset (survey_df) has almost 3 million entries, I am trying to join this dataset based on a column ID (repeated multiple times since there are multiple entries per id) to what originally was a shapefile and is now loaded in R as a data frame shapefile_df . This data frame has 60,000 unique entries, each representing a geometry in a country. We expect to have many entries per geometry in most cases. I am using a simple left_join which should in theory join these two datasets together. I am running into an issue where they are not fully joining together, only some entries are. I have tried inner,fully and right join as well as merge and I keep getting the same issue. I made a full_join and a copy of the id columns to compare the ones that are not joining and I do not see any patterns. They seem to be the same id, they are not joining for some reason. I tried formatting them as.character and as.factor and yet nothing. Below I pasted a sample of the join/unjoined df.
Matched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
0901200010229 | 0901200010229 | 0901200010229
0901500010729 | 0901500010729 | 0901500010729
090050001087A | 090050001087A | 090050001087A
0900600010467 | 0900600010467 | 0900600010467
0901400010897 | 0901400010897 | 0901400010897
0901200011960 | 0901200011960 | 0901200011960
Unmatched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
01903900010480 | 01903900010480 | NA
070470001010A | NA | 070470001010A
0704700010117 | NA | 0704700010117
0704700010140 | NA | 0704700010140
0705200010672 | NA | 0705200010672
0705200010742 | NA | 0705200010742
Most of the entries that are unmatched are like the first row where shapefile_df_id is NA. However, there are a few where survey_id_copy is NA. This field is simply a mutate of survey_df_id and in theory should not be any different yet they are. Any idea what could be causing this? I suspect this is a formatting issue but as a said, using as. hasn't fixed the issue. I am using tidyverse and read.csv. Any help?

R shiny leaflet bubble map by Count

I'm trying to make a Shiny app that uses some Geographical data I have stored in MySQL. The data currently contains a list of Longitudes and Latitudes (of client location), client age, client gender, and other various demographics. For example:
+--------+-------+--------+---------+
| Member | Long | Lat | Gender |
+--------+-------+--------+---------+
| A | 34 | -118 | M |
| B | 34 | -118 | F |
| C | 41 | -74 | M |
| D | 39 | -77 | M |
+--------+-------+--------+---------+
I want to use leaflet to create a bubble map for the locations - a bigger bubble at a location means that more clients are present at that area. It seems like the function addCircles does this pretty well, but the problem is that I don't have a count of the number of clients at each location to assign to the radius parameter- each row in my table represents information for a particular client, not a location. Is there a way to obtain that info?
My best guess is to create a new table where each row represents counts for a different longitude and latitude and then use a count column to count the number of times that location appears among clients, but I'm not sure if this is the best way since it involves creating a new table, and I would have to create additional columns like "number of males" and "number of females" for every factor I want to account for. And what if I wanted to adjust my map for females who are in the age range of 40-50? The number of columns I would have to create would easily exceed 100..

Tidy one unique identifier across several time period tables in r

I want to import data and tidy it in r. I have achieved some of the results I want using functions in Excel, but it is tedious and must be redone by hand each time I get a new Excel file with updated data. I have an Excel file with separate worksheets for each time period. This Excel file is updated multiple times each year, keeping the same style but adding additional data, including adding additional time period worksheets. Each worksheet follows the same format, as follows:
Student_ID| Major_ID | Gender | Age | Semester_Registered | Marital_Status | Home_State
20130001 | 10022 | M | 22 | 3 | S | AZ
20130002 | 10022 | F | 23 | 5 | M | CA
20140001 | 10022 | M | 21 | 3 | M | CA
20140004 | 10034 | F | 24 | 4 | S | AZ
This would be the example for the first few records of a given time period worksheet, let's say 2016_Semester_1. Student ID is assigned to a student when they register for classes and serves as a unique identifier. Major_ID corresponds to a table with Major_ID and Major_Name and Campus. The codes stay the same for each worksheet, but a student can change majors or change campus, thus Major_ID could be different for a given student from one time period to another. Gender and age are self-explanatory. Semester_Registered is a number from 1 to 8. When a student first registers for classes, they are in Semester_Registered 1, then their second semester in their first year they should move on to 2, their first semester of their sophomore year they should be in 3, all the way to 8 in the second semester of their senior year. However, some students do not move through the sequence of semesters at the normal rate, for example if they have to repeat a semester due to failed courses or if they have to leave the university for a time in order to earn more money before returning later and continuing their studies. Marital_Status is either S for Single, M for Married, D for Divorced or W for Widowed. Home_State is the two letter abbreviation for the US State the student is from, mainly needed to see if the student qualifies for in-state tuition rates, but also useful for reports to see where most students come from to focus marketing activities on those states.
The Excel workbook that I have contains a worksheet for each academic semester from 2014_1 to 2019_1. I want to consolidate the data and tidy it in two main ways. First, I want to make new tables for each Freshman class, including only those who were in Semester_Registered 1 in the 2014_1 semester in one table, in the 2015_1 semester in another table, up through 2019_1. The headers for the data I want in these tables would like like this:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Major_ID_End | Age_End | Marital_Status_End | Still_Enrolled
All of the records in a given table would have the same First_Semester value, such as 2014_1 or 2015_1. Student_ID is the identifier. Maojor_ID_Start is the Major_ID the student had in First_Semester. Gender could probably be collected onlly once from First_Semester. Age_Start and Marital_Status_Start are their respective values as listed in First_Semester. Final_Semester_Registered needs to look through each time period worksheet until it finds that the given Student_ID no longer appears on the list of registered students; for students who graduate, this should be the time period when Semester_Registered equals 8, but some students drop out before graduation so this would show in which time period they were last registered before dropping out. Final_Semester_Registered is shows the value of Semester_Registered in Final_Semester_Time, which should be 8 if the student graduated but if not it will show how far the student advanced in their studies before dropping out. Graduated_On_Time is either true or false, true if the student shows up with Semester_Registered 8 exactly 4 years after First_Semester year, such as a student who graduated in started their freshman year in 2014_1 and graduated at the end of 2018_2. Graduated_Late is also true or false, and is true if the student reached Semester_Registered 8 at some point after 4 years after their First_Semester year. Major_ID_End shows the last registered Major_ID for the last semester that the given Student_ID shows up in the list of registered students, and is useful to compare with Major_ID_Start to see if the student changed majors. Age_End and Marital_Status_End registered their respective values in the time period of Final_Semester_Time. Still_Enrolled is true or false, and it is true if the Student_ID is still present in the latest time period worksheet, at present this would be 2019_1 but it would be ideal to have this update in the future to use the latest time period sheet included in the data (since, for example, in a few months we will put in new data which will include 2019_2).
Second, I want a table simply showing Student_ID of students who are no longer registered in the latest time period. This would have column headers as follows:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Dropped_Out | Major_ID_End | Age_End | Marital_Status_End
The columns are the same as the other example, except for Dropped_Out, which is true or false and it is true if the student has a Final_Semester_Registered less than 8. The key point here is that this table should only include those Student_IDs where Still_Enrolled is false, and serves as a consolidated list of all of the students who used to be enrolled in the university but are no longer enrolled, allowing for analysis between those who graduated on time, those who graduated late, and those who dropped out.
I have achieved some of these results using Excel, but it is a drawn out and manual process, which must be re-done every time the data file updates. Excel has also become fairly slow in loading the file and updating the formula calculations, so I would like to move this to the r statistical software. For reference, though, here are some of the formulas I used in Excel, to give an idea of what might be adaptable into r.
I have a consolidated table with each Student_ID as a row and it includes columns like:
Student_ID | Major_ID_2014_1 | Major_ID_2014_2 | Major_ID_2015_1 | Semester_Registered_2014_1 | Semester_Registered_2014_2 | Semester_Registered_2014_2 | Final_Semester_Time | Final_Semester_Registered | Age_Start | Age_End
This is abbreviated, since it includes both Major_Id and Semester_Registered columns from 2014_1 up through 2019_1, but here in my example I a just showing up to 2015_1 to give the idea.
The formula for Major_ID_2041_1 is =IFERROR(INDEX(Semester_2014_1,MATCH(Student_ID_Cell,Student_IDs_2014_1,0)),"") where Semester_2014_1 and Student_IDs_2014_1 are named ranges from the worksheet of the time period 2014_1 including the relevant rows. A similar formula uses a different named data set for the rows related to Semester_Registered. Then I can use something like =IF(SUMPRODUCT(1/COUNTIF(F3:R3,F3:R3))<3,FALSE,TRUE) on the range of cells for Major_ID from 2014_1 to 2019_1 (each in its own column) to see if the Major_ID changed (meaning the student changed majors or changed campuses) and I can use a MAX() formula for the range of columnns for Semester_Registered to find the highest semester the student reached. A formula like =LOOKUP(2,1/(V3:AH3<>""),$V$2:$AH$2) which goes over the same range of columns for Seester_Registered where the second row has a header like 2014_1, 2014_2, etc. returns the last column that is not blank (thus the last column the student was registered in). This can then be used with the an INDIRECT() in order to reference a named data set (I had to manually name all the data sets in each worksheet by time period) like =IFERROR(INDEX(INDIRECT(CONCATENATE("DATA_",AK3)),MATCH(T3,INDIRECT(CONCATENATE("Student_IDs_",AK3)),0),4),"") where AK3 contains the Final_Semester_Time, like 2014_1.

Selenium IDE, identify row in table based on 3 columns

I am trying to find a row in a table which contains specific values on three columns.
I have tried methods in #paul trmbrth's answer to find XPath to identify cell in table based on other column. Worked fine for 2 columns, but didn't worked with 3. I didn't find any example for cases with more than 2 values.
VEHICLE CATEGORY | CATEGORY | SUBCATEGORY
A | Exteriors | Badges
A | Exteriors | Badges
A | Exteriors | Mirrors
A | Interiors | Wheels
A | Interiors | Rears
Want cell with the combination that contains:
A | Exteriors | Mirrors
I have tried but no success:
//tr[contains(td[1], 'A')]/td[2][contains(., 'Exterior')] td[3][contains(., 'Mirror')]
//tr[contains(td[1], 'A')]/td[2][contains(., 'Exterior')] /td[3][contains(., 'Mirror')]
css=tr([td:contains('A')][td:contains('Exterior')][td:contains('Mirror')])
css=tr([td:contains('A')][td:contains('Exterior')][td:contains('Mirror')])
Can anyone help?
I think you have a couple of typos:
//tr[contains(td[1], '1') and contains(td[2], 'Eve') and contains(td[3], 'Jackson')]
But I'm not 100% this is most efficient, but it will work.

Getting summary data from graphite

I have created a dashboard using the data published from my application using statsd with a graphite backend. This has worked great for building nice data visualizations. (Kudos to etsy and others!)
Now I need to make a summary dashboard that will display a grid, more-or-less, that shows each stat with a count. (No graphs on this page but clicking on the stat name will take you to the graph.)
So, for example, I am collecting statistics for how many messages each node in our cluster recieves, processes successfully, fails to process. What I need is something like the following:
| Node Name | Messages Recieved | Successful | Failed |
| ------------- |:-----------------:| -----------:| ---------:|
| Node1 | 1126 | 1120 | 6 |
| Node2 | 1155 | 1100 | 55 |
| Node3 | 1124 | 1119 | 5 |
| Node4 | 1204 | 1198 | 6 |
I have a timespan selector on the toolbar and based on that selection these numbers should be updated to reflect the selected timespan. I'm having a hard time getting numbers that seem to aligne with what I expect. And in some scenarios with the summarize function I am getting decimal values back which does not seem to make sense to me.
Any help or guidance would be greatly appreciated.

Resources