I am developing a report for a University. They require around 15 different groupings of their choice (e.g. Campus, Faculty, Course, School, Program, Major, Minor, Nationality, Mode of Study... and the list goes on).
They require the Headcount and EFTS (Equivalent Full Time Student) for each of these groupings.
They may require a single grouping, a selection in any order of groupings, or all groupings.
A solution was provided here: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014834406#77777777-0000-0000-0000-000014837873
The solutions suggests to use conditional blocks and multiple lists. However that would mean I would have 50+ lists, every possible combination of my groups (e.g. Campus + Faculty , Campus + School, School + Faculty, Faculty + Campus .... )
Is these any way for users to dynamically select the order of their groupings, and which groups to exclude/include? Thanks
Related
Let's say I have a dataset of ACT test scores. Each "observation" is a student's results from taking the ACT. The ACT has five subjects: reading, English, math, science, and writing (plus a composite score). Each test subject has a scale score, a national percentile rank, and a college readiness indicator (Y or N).
My question is (and always seems to be since I work a lot with assessment data), which format is "tidy"?
where each row is a unique student test + subject combination with a subject column and then scaleScore, percentile, and readiness columns for each value.
where each row is a unique student test with all the subjects and their respective values listed out in separate columns.
Or where I have something like the first option but put into six tables one for each subject with a key to join on?
I've been working in SQL + Excel for a while, but I want to expand my EDA skills in R. Any help would be much appreciated! The key focus is on subsequent visualization with ggplot. I'm guessing the answer may just be "it depends" with a willingness to gather and spread for different plotting purposes.
Columns being student, test, subject, scaleScore, percentile, readiness.
Student and test variables would identify each observation.
Subject is a variable. Reading, English, math, etc. are values of the subject variable. This is essentially the heart of the tidy approach, which tends to be deep, not wide, and lends itself to joining, grouping, plotting, and so forth.
OR to make it really tidy, score and scoreType are variables, and their respective values are included as observations.
Either way, in one table the student and test would be repeated on multiple rows. But this serves to illustrate the tidy perspective. Clearly, normalized tables are a worthy consideration, in terms of the big picture.
Please consider the following situation:
I measure values every hour (time) (campaign from few month to ~10 years)
off several species (1 to 10)
with several instruments (1 to 5)
on several measurement sites (~70)
and each site has several sampling levels (1 to 5)
and each value has a flag indicating if it is valid or not
I am looking for the fastest and simplest way to store these data, considering the fact that the database/files/whatever should be readable and writeable with R.
Note that:
Some experiments consist of measuring for a very long time few species, for a single instrument and sampling level,
Some experiments consist of comparing the same few-months timeframe for a lot of sites (~70)
Some sites have many sampling levels and/or instruments (which will be compared)
The storage system must be readable (and if possible writeable) in parallel
What I tried so far:
MySQL data base, with 1 table per site/species, each table containing the folowing columns: time, sampling level, instrument, value and flag. Of course, as the number of site is growing, the number of table is also growing. And comparing sites is painfull, as it requires a lot of requests. Moreover, sampling level and instrument are repeated a lot of time within the table, this inefficiently occupies space.
NetCDF files: interesting for their ability to store multi-dimensional data, they good to store a set of data but are not practical to use for daily modification and not very "scalable".
Druid, a Multidimentional database management system, originally "business intelligence"-oriented. The principle is good, but it is much to heavy and slow for my application.
Thus, I am looking for a system which:
Take more or less the same time to retrieve
100 hours of data of 1 site, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 10 sites, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 1 site, 2 species, 1 instrument, 5 sampling levels, or
etc.
Allows parallel R/W
Minimize the time to write in and read from the database
Minimize used disk space
Allows easy addition of a new site, or instrument, or species, etc.
Works with R
A good system would be a kind of hypercube which allows complex request on all dimensions...
A relational database with a multi-column primary key (or candidate key) is well suited to store this kind of multi-dimensional data. From your description, it seems that the appropriate primary key would be time, species, instrument, site, and sampling_level. The flag appears to be an attribute of the value, not a key. This table should have indexes for all the columns that you will use to select data for retrieval. You may want additional tables to store descriptions or other attributes of the species, instruments, and sites. The main data table would have foreign keys into each of these.
I am working on a data set in R having dimensions
dim(adData)
[1] 15844717 11
Out of 11 features,
one feature is having 273596(random integers used as id) unique values out of 15844717.
second feature is having 884353(random integers used as id) unique values out of 15844717.
My confusion is whether to convert them into factors or not because categorical variables with large number of levels will create a problem at the time of modelling or please suggest how to treat them.
I am new to Data Science and never worked on large data sets before.
~300k categories for one variable is sure to cause computational issues. I would first take a step back and examine the nature of this variable and its relevance to the prediction at hand. Without knowing the source of the data, it is hard to give specific advice.
If it is truly a categorical variable, it would be silly to leave the ids as numeric variables since the scale and order of the ids are likely meaningless.
Is it possible to group the levels into fewer but still meaningful categories?
Example 1: If the ids were zipcodes in the United States, there are potentially 40,000 unique values. These can be grouped into states or regions, reducing the number of levels to 50 or fewer.
Example 2: If the ids were product ids from an e-commerce site, they could be grouped by product category or sub-category. There would be much fewer distinct values to work with.
Another option is to examine the relative frequency of each category. If there are a few very common categories, with thousands of rare categories, you leave the common levels in tact and group the rare levels into an 'other' category.
This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]
Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())
Lets say that I have the following data frame with a user id and location as the two columns. A user id can have multiple locations. I'm interested in finding the could of each possible location sequence based on the user id.
So if my data looked like this:
places = data.frame(user_id=c(1,1,2,3,3,3,4,4,5,5,5,5),
location=c("home","school","work","home","school","work",
"lunch","airport","gym","breakfast","work","home"))
places
I want to find the following:
freq = data.frame(location_path=c("home - school", "work", "home - school - work",
"lunch - airport", "gym - breakfast - work - home"),
count=c(1,1,1,1,1))
freq
This second data frame tells me that the 'home' and 'school' pairing occurred twice in that order. Furthermore, the home, school, and work pairing occurred only once as well.
Of course there could be instances where the pairing occurs multiple times. In the following case the home, school, and work pairing would have a count of 2.
places = data.frame(user_id=c(1,1,2,3,3,3,4,4,5,5,5,5,6,6,6),
location=c("home","school","work","home","school","work",
"lunch","airport","gym","breakfast","work","home",
"home","school","work"))
places