I'm trying to create the following warning in Google Spreadsheet: when I add, in columns Name and Date, a combination of values which are already present then in the column Result I should receive the message Duplicate date.
Here is an example:
Name | Date | Result
Alex | 27/11/2013
John | 28/11/2013
Alan | 29/11/2013
Val | 30/11/2013
Jack | 2/12/2013
Alex | 27/11/2013 |Duplicate date
I know how to raise a "warning" if a duplicated Date exists, by changing the ColumnC cell text "Date" into that message, but I don't know how to pair the Name and Date values.
I use this =IF:
=IF(COUNTA(B2:B)>COUNTA(UNIQUE(B2:B));"Duplicate date";"Date")
Please try:
=If(ArrayFormula(SUMPRODUCT((A:A=A3)*(B:B=B3))>1);"Duplicate date";"")
in C3 and copied down (assuming John is in A4).
Note however this detects duplicates (ie both your first and last rows above, rather than merely the repetition that the last is of the first row).
Related
I want to import data and tidy it in r. I have achieved some of the results I want using functions in Excel, but it is tedious and must be redone by hand each time I get a new Excel file with updated data. I have an Excel file with separate worksheets for each time period. This Excel file is updated multiple times each year, keeping the same style but adding additional data, including adding additional time period worksheets. Each worksheet follows the same format, as follows:
Student_ID| Major_ID | Gender | Age | Semester_Registered | Marital_Status | Home_State
20130001 | 10022 | M | 22 | 3 | S | AZ
20130002 | 10022 | F | 23 | 5 | M | CA
20140001 | 10022 | M | 21 | 3 | M | CA
20140004 | 10034 | F | 24 | 4 | S | AZ
This would be the example for the first few records of a given time period worksheet, let's say 2016_Semester_1. Student ID is assigned to a student when they register for classes and serves as a unique identifier. Major_ID corresponds to a table with Major_ID and Major_Name and Campus. The codes stay the same for each worksheet, but a student can change majors or change campus, thus Major_ID could be different for a given student from one time period to another. Gender and age are self-explanatory. Semester_Registered is a number from 1 to 8. When a student first registers for classes, they are in Semester_Registered 1, then their second semester in their first year they should move on to 2, their first semester of their sophomore year they should be in 3, all the way to 8 in the second semester of their senior year. However, some students do not move through the sequence of semesters at the normal rate, for example if they have to repeat a semester due to failed courses or if they have to leave the university for a time in order to earn more money before returning later and continuing their studies. Marital_Status is either S for Single, M for Married, D for Divorced or W for Widowed. Home_State is the two letter abbreviation for the US State the student is from, mainly needed to see if the student qualifies for in-state tuition rates, but also useful for reports to see where most students come from to focus marketing activities on those states.
The Excel workbook that I have contains a worksheet for each academic semester from 2014_1 to 2019_1. I want to consolidate the data and tidy it in two main ways. First, I want to make new tables for each Freshman class, including only those who were in Semester_Registered 1 in the 2014_1 semester in one table, in the 2015_1 semester in another table, up through 2019_1. The headers for the data I want in these tables would like like this:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Major_ID_End | Age_End | Marital_Status_End | Still_Enrolled
All of the records in a given table would have the same First_Semester value, such as 2014_1 or 2015_1. Student_ID is the identifier. Maojor_ID_Start is the Major_ID the student had in First_Semester. Gender could probably be collected onlly once from First_Semester. Age_Start and Marital_Status_Start are their respective values as listed in First_Semester. Final_Semester_Registered needs to look through each time period worksheet until it finds that the given Student_ID no longer appears on the list of registered students; for students who graduate, this should be the time period when Semester_Registered equals 8, but some students drop out before graduation so this would show in which time period they were last registered before dropping out. Final_Semester_Registered is shows the value of Semester_Registered in Final_Semester_Time, which should be 8 if the student graduated but if not it will show how far the student advanced in their studies before dropping out. Graduated_On_Time is either true or false, true if the student shows up with Semester_Registered 8 exactly 4 years after First_Semester year, such as a student who graduated in started their freshman year in 2014_1 and graduated at the end of 2018_2. Graduated_Late is also true or false, and is true if the student reached Semester_Registered 8 at some point after 4 years after their First_Semester year. Major_ID_End shows the last registered Major_ID for the last semester that the given Student_ID shows up in the list of registered students, and is useful to compare with Major_ID_Start to see if the student changed majors. Age_End and Marital_Status_End registered their respective values in the time period of Final_Semester_Time. Still_Enrolled is true or false, and it is true if the Student_ID is still present in the latest time period worksheet, at present this would be 2019_1 but it would be ideal to have this update in the future to use the latest time period sheet included in the data (since, for example, in a few months we will put in new data which will include 2019_2).
Second, I want a table simply showing Student_ID of students who are no longer registered in the latest time period. This would have column headers as follows:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Dropped_Out | Major_ID_End | Age_End | Marital_Status_End
The columns are the same as the other example, except for Dropped_Out, which is true or false and it is true if the student has a Final_Semester_Registered less than 8. The key point here is that this table should only include those Student_IDs where Still_Enrolled is false, and serves as a consolidated list of all of the students who used to be enrolled in the university but are no longer enrolled, allowing for analysis between those who graduated on time, those who graduated late, and those who dropped out.
I have achieved some of these results using Excel, but it is a drawn out and manual process, which must be re-done every time the data file updates. Excel has also become fairly slow in loading the file and updating the formula calculations, so I would like to move this to the r statistical software. For reference, though, here are some of the formulas I used in Excel, to give an idea of what might be adaptable into r.
I have a consolidated table with each Student_ID as a row and it includes columns like:
Student_ID | Major_ID_2014_1 | Major_ID_2014_2 | Major_ID_2015_1 | Semester_Registered_2014_1 | Semester_Registered_2014_2 | Semester_Registered_2014_2 | Final_Semester_Time | Final_Semester_Registered | Age_Start | Age_End
This is abbreviated, since it includes both Major_Id and Semester_Registered columns from 2014_1 up through 2019_1, but here in my example I a just showing up to 2015_1 to give the idea.
The formula for Major_ID_2041_1 is =IFERROR(INDEX(Semester_2014_1,MATCH(Student_ID_Cell,Student_IDs_2014_1,0)),"") where Semester_2014_1 and Student_IDs_2014_1 are named ranges from the worksheet of the time period 2014_1 including the relevant rows. A similar formula uses a different named data set for the rows related to Semester_Registered. Then I can use something like =IF(SUMPRODUCT(1/COUNTIF(F3:R3,F3:R3))<3,FALSE,TRUE) on the range of cells for Major_ID from 2014_1 to 2019_1 (each in its own column) to see if the Major_ID changed (meaning the student changed majors or changed campuses) and I can use a MAX() formula for the range of columnns for Semester_Registered to find the highest semester the student reached. A formula like =LOOKUP(2,1/(V3:AH3<>""),$V$2:$AH$2) which goes over the same range of columns for Seester_Registered where the second row has a header like 2014_1, 2014_2, etc. returns the last column that is not blank (thus the last column the student was registered in). This can then be used with the an INDIRECT() in order to reference a named data set (I had to manually name all the data sets in each worksheet by time period) like =IFERROR(INDEX(INDIRECT(CONCATENATE("DATA_",AK3)),MATCH(T3,INDIRECT(CONCATENATE("Student_IDs_",AK3)),0),4),"") where AK3 contains the Final_Semester_Time, like 2014_1.
I have this Kusto code that I have been trying to develop and any help would be greatly appreciated.
The objective is to count to the first occurrence of the CurrentOwningTeamId in the OwningTeamId column.
I packed the Owning Team number and parsed the value into a column of its own. I need to count the owning teams until I get to the current owning team.
Columns are (example):
Objective: Count to the first occurrence of the CurrentOwningTeam value in the OwningTeamId column using Kusto (Application Insights code):
[CODE]
OwningTeamId, CurrenOwningTeam, CreateDate, RequestType
155523 **888888** 2017-07-02 PRIMARY
256924 **888888** 2017-08-02 TRANSFER
**888888** **888888** 2017-09-02 TRANSFER
954005 **888888** 2017-10-02 TRANSFER
**888888** **888888** 2017-11-02 TRANSFER
155523 **888888** 2017-12-02 TRANSFER
954005 **888888** 2017-13-02 TRANSFER
**888888** **888888** 2017-14-02 TRANSFER
[/CODE]
I think you can match the current owning team with the countof() function, but I don't know how to go about it using regex. Note: values are different with each owning team on every incident, is why I capture the owning team on the incident first and try to count the very first instance of the CurrentOwningTeam number in the OwningTeamId column. In other words I want to count the number of times it takes to get to the very first owning team. In this case, it would be three.
Note: OwningTeamId's and CurrentOwningTeam can change on every incident, I first capture the CurrentOwningTeam then try to match in the OwningTeamId column.
Note: This is just one incident, but I am trying to do multiple Incidents.
Below is how I got the Current Owning Team Value.
[/CODE]
| extend CurrentOwningTeam=pack_array(OwningTeamId)
| parse CurrentOwningTeam with * "[" CurrentOwningTeam:int "]" *
| serialize CurrentOwningTeam
[/CODE]
I tried using row_number() but it will not work for multiple incidents, only per incident, so I have to use count or countof functions or another way of doing it.
Thanks for clarification. Here is a suggestion for a query that counts ordered by-time rows until certain condition is reached (count is contextual using IncidentId key).
datatable(IncidentId:string, OwningTeamId:string, CurrentOwningTeam:string, CreateDate:datetime, RequestType:string)
[
'Id1','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id1','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-09),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id1','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-14),'TRANSFER',
// Id2
'Id2','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id2','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id2','999999','888888',datetime(2017-02-09),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id2','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-14),'TRANSFER',
]
| order by IncidentId, CreateDate asc
| extend c= row_cumsum(1, IncidentId!=prev(IncidentId))
| where OwningTeamId == CurrentOwningTeam
| summarize arg_min(CreateDate, c) by IncidentId
Result:
IncidentId CreateDate c
Id1 2017-02-09 00:00:00.0000000 3
Id2 2017-02-11 00:00:00.0000000 5
Here are the links to the docs that point how to find earliest record using arg_min() aggregation, and link to the row_cumsum() (cumulative sum) function.
https://learn.microsoft.com/en-us/azure/kusto/query/arg-min-aggfunction
https://learn.microsoft.com/en-us/azure/kusto/query/rowcumsumfunction
I figured it out by using the RowNumber directly into grouping inside the table, then finally summing to get my total count.
[CODE]
| serialize Id
| extend RowNumber=row_number(1, (Id) ==Id)
| summarize TotalOwningTeamChanges=sum(RowNumber) by Id
[/CODE]
Then after that I got the Minimum Date to extract the entire data set to the first instance of the current OwningTeamName.
[CODE]
//Outside the scope of the table.
| extend ExtractFirstOwningTeamCreateDate=CreateDate2
| extend VeryFirstOwningTeamCreateDate=MinimumCreateDate
| where FirstOwningTeamRow == true or MinimumCreateDate <=
ExtractFirstOwningTeamCreateDate
| serialize VeryFirstOwningTeamCreateDate
[/CODE]
Basic requirements:
I have a table with a bunch of attributes (20-30), but only 3 are used in querying: User, Category, and Date, and would be structured something like this...
User | Category | Date | ...
1 | Red | 5/15
1 | Green | 5/15
1 | Red | 5/16
1 | Green | 5/16
2 | Red | 5/18
2 | Green | 5/18
I want to be able to query this table in the following 2 ways:
Most recent rows (based on Date) by User. e.g., User=1 returns the 2 rows from 5/16 (3rd and 4th row)
Most recent rows (based on Date) by User and Category. e.g., User=1, Category=Red returns the 5/16 row only (the 3rd row).
Is the best way to model this with a HASH on User, RANGE on Date, and a GSI with HASH on User+Category and RANGE on Date? Is there anything else that might be more efficient? If that's the path of least resistance, I'd still need to know how many rows to return, which would require doing a count against distinct categories or something?
I've decided that it's going to be easier to just change the way I'm storing the documents. I'll move the category and other attributes into a sub-document so I can easily query against User+Date and I'll do any User+Category+Date querying with some client-side code against the User+Date result set.
I have a database that is the output for a python script involving a basic game. When the code saves to the database, it saves it to a table called points with the data: name, account_name, time, score. What I want is for the data be saved into a second table when sorted by name, I will then do the same with account_name. Some of the points table:
name |account_name | time | score
oliver |Oliver | 10:29:14-01:04:2017 | 250
oliver |Oliver | 10:29:20-01:04:2017 | 500
dave |Oliver | 10:29:34-01:04:2017 | 250
What I want is for the data to be sorted into a table called name, where the score is totalled for all records with the same name and a column keeps track of how many entries have been merged(In this case, it will be equal to number of games played). For example:
name | totalpoints | totalgames
oliver| 750 | 2
dave | 250 | 1
I will use this format to do the same with account_name. I have found information on how to group and sum the data but not into a second table. Thank you in advance.
first, create your table by:
CREATE TABLE `stats` (
`name` TEXT PRIMARY KEY ON CONFLICT REPLACE,
`totalpoints` INTEGER,
`totalgames` INTEGER
);
then insert into your table with:
INSERT INTO stats
SELECT games.name, SUM(games.score) AS totalpoints, COUNT(*) AS totalgames
FROM games
GROUP BY games.name
I am creating a report and have a field that has multiple values representing different data values. i.e 4-Completeness 5-accuracy etc... What I need to do is make multiple columns where that field is filtered down to one value. The problem is I get the error if I try and edit the query item in the report of 'Boolen value expression as query item is not supported' How do I fix?
example:
ID column | Data Value = 4 | Actual Data | Data Value = 5
EDIT:
I currently have a case when [Data value] = 4 then [percentage] for the different columns but I am still getting wrong output. I am getting
ID1 | 45% | | |
ID1 | | 35% | |
ID1 | | | 67% |
I need all of ID1 to be in one row.
You can fix this by totaling by ID which will combine all three rows in your example to one:
total([Measure] for [ID])
Change each of the three percentage columns to use this expression, substituting their respective data item for [Measure].
Normally, you don't want to total percentages, but this is an exception. Since only one row has actual data, the total will match that row and the other two null values will not be included in the total.
Simple way would be to do it for each data value in three queries and join them on ID1