Julia DataFrames: Problems with Split-Apply-Combine strategy - julia

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?

I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

Related

Creating a large frequency table from 'scratch' with specific ratios / values?

I have a problem that I can't figure out have to solve.
I have 3 (tibble) data-frames with just names of diffrent populations.
df1 is all, unique, surnames in Sweden and a column with a count.
382.492 (unique names * the count) = 10002985 people in df1.
10002985 is then the total population in the this 'experiment'.
df2 is a list of all registered lawyers in Sweden.
6211 lawyers total in the population.
df3 is a list of all people with noble family surnames in Sweden
there are 542 unique names and 46851 people with noble surnames in the population.
We also know that in the lawyer subgroup there is:
106 people lawyer with a noble surname.
Now my problem is that I want to create just one df with all this info.
It should look like this:
The main idea is to create a df with one row per population: 10002985 rows.
noble and lawyer is then a dummy variable where 1 = yes, 0 = no. So for example: for the tot_pop, 46851 people should have noble = 1, and 106 out of that group should have lawyer = 1.
Notice that I don't really care what the names are - I just care about the ratios.
Notice also that the reason why I want to create a new data-frame without the names is because I think this is the only way to solve the problem, at least the easiest. But if anyone insists -- I can upload some sample data from each df.
In the end I want to run some probability tests.
Let me know if the question confusing. Also, let me know if this is a really dumb way to go about this :p
SOLUTION:
It was quite easy once I realized what I was looking for :)
There is probably a more elegant solution.
# pop
pop <- 1:10002985
# noble
n <- c(46851, 9956134)
noble <- rep(1:0, n)
# attorney
a <- c(106,46745, 46745, 9909389)
attorney <- rep(c(1,0,1,0), a)
final_data <- tibble(pop, noble, attorney)

Change distribution of fields in a categorical variable by given distribution

Let's say we have a data.frame such as the one below:
City
NYC
Boston
NYC
NYC
Providence
Boston
NYC
I want to write the simplest possible function
redistribute <- function(data, column, unique_value, decrease_by) {
#data = dataframe provided by user
#column = column of the respective dataframe
#unique_value = fields contained within the respective column of the respective dataframe
#decrease_by = the desired "portion" or "distribution" of the unique_value within column.
}
Edit:
I will rephrase the question, as it seems to be slightly confusing.
I need to calculate the frequency of the (argument unique_value) within the column. For example, that would be 4/7 or 0.57 for NYC in the City column.
Decrease the number of occurrences of the unique_value so that the frequency reaches the one provided by the user in the function argument. For example, from a 0.57 to (argument decreased_by) for NYC. So.. from 0.57 to 0.10 for instance.
Replace the original fields originally occupied by the unique_value with the different values in the column. Do this randomly. For example, we remove the first occurrence of 'NYC' field to reduce the overall frequency of the unique value 'NYC' from 0.5 to 0.1, and replace it with some random city 'Boston' for example.
So the expected outcome would be:
City
NYC
Boston
Boston
Providence
Boston
Providence
Boston
I'd like to avoid doing a dozen transformations. I'm looking for the most logical/efficient approach.
What your I think your trying to do is really just putting together a few things into a function. Using your example, lets assume new_level is the percentage of that factor that you want in the new data.
city = c("NYC", "Boston", "NYC", "NYC", "Providence", "Boston", "NYC")
data = data.frame(city=city)
redistribute <- function(data, column, unique_value, new_level){
## Names of factors and size of data
fac_names <- levels(factor(data[,column]))
size <- nrow(data)
## Make new list using rep and sample with desired ratio
new_col <- c(rep(unique_value,
floor(new_level*size)),
sample(fac_names[which(fac_names!=unique_value)],
size=(size-floor(new_level*size)),
replace=TRUE))
## Mix up and assign to data frame
data[,column] <- sample(new_col)
return(data)
}
redistribute(data, column="city",
unique_value="NYC",
new_level=0.3)

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Summary statistics of retail prices grouped by categorical data

I need some help writing a function that takes three categorical inputs and returns a vector of summary statistics based on these inputs.
The data set contains information on retail goods that can be specified by their retail segment, brand name, and type of good along with its retail price and what it actually sold for.
Now I need to write a function that will take these inputs and average, count, and calculate whatever else is needed.
I have set the function up as follows (using made up data):
dataold = data.frame(segment=c("golf","tenis","football","tenis","golf","golf"),
brand=c("x","y","z","y","x","a"),
type=c("iron","ball","helmet","shoe","driver","iron"),
retail=c(124,.60,80,75,150,108),
actual=c(112,.60,72,75,135,100))
retailsum = funtion(segment,brand,type){
datanew = dataold[which(dataold$segment='segment' &
dataold$brand='brand' &
dataold$type='type'),c("retail","actaul")]
summary = c(dim(datanew)[1],colMeans(datanew))
return(summary)
}
The code inside the function braces works on its own, but once I wrap a function around it I start getting errors or it will just return 0 counts and NaN for the means.
Any help would be greatly appreciated. I have very little experience in R, so I apologize if this is a trivial question, but I have not been able to find a solution.
There are rather a lot of errors in your code, including:
misspelling of function
using single = (assignment) rather than == (equality test)
mistype of actual
hardcoding of segment, brand and type in your function, rather than referencing the arguments.
This is how your function could look like, i.e. it produces valid results:
retailsum <- function(data, segment,brand,type, FUN=colMeans){
x = with(data, data[segment==segment && brand==brand && type==type,
c("retail","actual")])
match.fun(FUN)(x)
}
retailsum(dataold, "golf", "x", "iron", colMeans)
retail actual
89.60000 82.43333
And here is a (possibly much more flexible) solution using the plyr package. This calculates your function for all combinations of segment, brand and type:
library(plyr)
ddply(dataold, .(segment, brand, type), colwise(mean))
segment brand type retail actual
1 football z helmet 80.0 72.0
2 golf a iron 108.0 100.0
3 golf x driver 150.0 135.0
4 golf x iron 124.0 112.0
5 tenis y ball 0.6 0.6
6 tenis y shoe 75.0 75.0
Andrie's solution is pretty complete already. (ddply is cool! Didn't know about that function...)
Just one addition, though: If you want to compute the summary values over all possible combinations, you can do this as a one-liner using R's onboard function by:
by(dataold, list(dataold$segment, dataold$brand, dataold$type),
function(x) summary(x[,c('retail', 'actual')])
)
That is not strictly what you asked for, but may still be instructive.

Resources