Change distribution of fields in a categorical variable by given distribution - r

Let's say we have a data.frame such as the one below:
City
NYC
Boston
NYC
NYC
Providence
Boston
NYC
I want to write the simplest possible function
redistribute <- function(data, column, unique_value, decrease_by) {
#data = dataframe provided by user
#column = column of the respective dataframe
#unique_value = fields contained within the respective column of the respective dataframe
#decrease_by = the desired "portion" or "distribution" of the unique_value within column.
}
Edit:
I will rephrase the question, as it seems to be slightly confusing.
I need to calculate the frequency of the (argument unique_value) within the column. For example, that would be 4/7 or 0.57 for NYC in the City column.
Decrease the number of occurrences of the unique_value so that the frequency reaches the one provided by the user in the function argument. For example, from a 0.57 to (argument decreased_by) for NYC. So.. from 0.57 to 0.10 for instance.
Replace the original fields originally occupied by the unique_value with the different values in the column. Do this randomly. For example, we remove the first occurrence of 'NYC' field to reduce the overall frequency of the unique value 'NYC' from 0.5 to 0.1, and replace it with some random city 'Boston' for example.
So the expected outcome would be:
City
NYC
Boston
Boston
Providence
Boston
Providence
Boston
I'd like to avoid doing a dozen transformations. I'm looking for the most logical/efficient approach.

What your I think your trying to do is really just putting together a few things into a function. Using your example, lets assume new_level is the percentage of that factor that you want in the new data.
city = c("NYC", "Boston", "NYC", "NYC", "Providence", "Boston", "NYC")
data = data.frame(city=city)
redistribute <- function(data, column, unique_value, new_level){
## Names of factors and size of data
fac_names <- levels(factor(data[,column]))
size <- nrow(data)
## Make new list using rep and sample with desired ratio
new_col <- c(rep(unique_value,
floor(new_level*size)),
sample(fac_names[which(fac_names!=unique_value)],
size=(size-floor(new_level*size)),
replace=TRUE))
## Mix up and assign to data frame
data[,column] <- sample(new_col)
return(data)
}
redistribute(data, column="city",
unique_value="NYC",
new_level=0.3)

Related

<R code>Create a subset of the dataset from Part1 with only the top 5 departments based on the number of employees working in that department

Question:
Initialize the city of Boston earnings dataset as shown below:
boston <- read.csv( "https://people.bu.edu/kalathur/datasets/bostonCityEarnings.csv", colClasses = c("character", "character", "character", "integer", "character"))
Generate a subset of the dataset from Boston earnings dataset with only the top 5 departments based on the number of employees working in that department. The top 5 departments should be computed using R code. Then, use %in% operator to create the required subset.
Use a sample size of 50 for each of the following.
Set the start seed for random numbers as the last 4 digits of 1000
a) Show the sample drawn using simple random sampling without replacement.
Show the frequencies for the selected departments.
Show the percentages of these with respect to sample size.
I've tried to write the code, but still don't know how to create the subset with the top 5 and don't know how to turn into percentage as well.
Thank you all for the help!
enter image description here

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

R Storing Values from For Loop to Data Frame

I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.
This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument

Averaging rows based upon a known, irregular relationship using R

I have data on energy companies whose jurisdiction overlaps in places. I want to be able to compute an average of sales for the places where these companies overlap. These companies will always overlap - so how can I use this information to calculate the averages just for those pairs? There are about 20 pairs of companies.
data <- data.frame(Company = c("Energy USA","Good Energy",
"Hydropower 4 U",
"Coal Town",
"Energy USA/Good Energy",
"Good Energy/Coal Town"),
Sales = c(100, 2500, 550, 6000, "?", "?"))
Company Sales
1 Energy USA 100
2 Good Energy 2500
3 Hydropower 4 U 550
4 Coal Town 6000
5 Energy USA/Good Energy ? (Answer: 1300)
6 Good Energy/Coal Town ? (Answer: 4250)
We use 'grep' to get index of 'Company' elements that have more than one entries i.e. separated by '/'. Then, split those elements by the delimiter (output will be a list), loop through the list with sapply, match the elements with the 'Company' column to get the position, use that to get the corresponding 'Sales' elements. As the 'Sales' column was factor, we need to convert it to numeric to get the mean. When we convert factor to numeric class, all non-numeric elements i.e. ? will be converted to NA. Replace those NA elements with the mean values.
i1 <- grepl('/', data$Company)
v1 <- sapply(strsplit(as.character(data$Company[i1]), '/'),
function(x) mean(as.numeric(as.character(data$Sales[match(x,
data$Company)]))))
data$Sales <- as.numeric(as.character(data$Sales))
data$Sales[is.na(data$Sales)] <- v1
data
# Company Sales
#1 Energy USA 100
#2 Good Energy 2500
#3 Hydropower 4 U 550
#4 Coal Town 6000
#5 Energy USA/Good Energy 1300
#6 Good Energy/Coal Town 4250
Without knowing how your original data is, it is hard to give a working answer. However, assuming your data has Company and Sales columns with multiple rows for each company, you can do something like this:
mean(data$Sales[data$Company %in% c('Energy USA', 'Good Energy')]])
mean(data$Sales[data$Company %in% c('Good Energy', 'Coal Town')]])
you could create a new column "jurisdiction" in "data", if your dataset is rather small..
MeansByJurisdiction <- tapply(data$sales, data$jurisdiction, mean)
then you could convert the vector to dataframe
MeansByJurisdiction <- data.frame(MeansByJurisdiction)
the rownames in the MeansByJurisdiction dataframe will be populated with the jurisdictions and you can extract them with a simple line of code:
MeansByJurisdiction$jurisdictions <- row.names(MeansByJurisdiction)

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

Resources