R: Interpolation multiple columns by group using target values column - r

Have a 15x6 dataframe:
df <- data.frame(PART = c("A7","A7","A7","A7","A7","A1","A1","A1","A1","A1","A7","A7","A7","A7","A7"),
LIMIT = c(50,50,50,50,50,55,55,55,55,55,52.5,52.5,52.5,52.5,52.5),
MEAS = c(14.008,19.053,22.244,24.554,25.521,18.495,22.3,24.867,26.825,27.169,15.299,20.239,23.384,25.606,26.516),
MEAS_TARGET = c(16.5,16.5,16.5,16.5,16.5,21.2,21.2,21.2,21.2,21.2,21.5,21.5,21.5,21.5,21.5),
INT = c(1.5,2.5,3.5,4.5,5,2,3,4,5,5.2,1.5,2.5,3.5,4.5,5),
COL = c(-31.845,-25.51,-21.377,-18.537,-17.546,-41.1,-39.294,-36.813,-33.779,-33.361,-53.589,-49.664,-46.836,-43.581,-40.64))
I am trying to group by PART and LIMIT columns and using linear interpolation find the missing values in INT and COL columns when MEAS = MEAS_TARGET and create the following 18x6 dataframe result:
result <- data.frame(PART = c("A7","A7","A7","A7","A7","A1","A1","A1","A1","A1","A7","A7","A7","A7","A7","A7","A1","A7"),
LIMIT = c(50,50,50,50,50,55,55,55,55,55,52.5,52.5,52.5,52.5,52.5,50,55,52.5),
MEAS = c(14.008,19.053,22.244,24.554,25.521,18.495,22.3,24.867,26.825,27.169,15.299,20.239,23.384,25.606,26.516,16.5,21.2,21.5),
MEAS_TARGET = c(16.5,16.5,16.5,16.5,16.5,21.2,21.2,21.2,21.2,21.2,21.5,21.5,21.5,21.5,21.5,16.5,21.2,21.5),
INT = c(1.5,2.5,3.5,4.5,5,2,3,4,5,5.2,1.5,2.5,3.5,4.5,5,1.99,2.7,2.9),
COL = c(-31.845,-25.51,-21.377,-18.537,-17.546,-41.1,-39.294,-36.813,-33.779,-33.361,-53.589,-49.664,-46.836,-43.581,-40.64,-28.716,-39.816,-48.53))
I tried to create NA rows for each group and use this and this but couldn't make it work. Any advice on this would be greatly appreciated.

We can use complete to include new rows where MEAS = MEAS_TARGET and interpolate INT and COL columns with zoo::na.approx.
library(dplyr)
library(tidyr)
df %>%
group_by(PART, LIMIT) %>%
complete(MEAS = unique(c(MEAS, MEAS_TARGET))) %>%
mutate(across(c(INT, COL), zoo::na.approx)) %>%
fill(MEAS_TARGET) %>%
ungroup

Related

Convert data frame columns to vectors

I have a dataframe named "Continents_tmap" where I want to return 3 vectors as per the following examples. Note:the "Values" needs to be Cases as per name in dataframe
labels = c("France","Germany","India", etc)
Parent = c("Europe","Europe","Asia",etc)
Values = c(100,345,456,etc)
My current code is as follows.
covid_1 <-
read.csv("C:/Users/Owner/Downloads/COVID-19 Activity.csv", stringsAsFactors = FALSE)
df1 <-
select(
covid_1,
REPORT_DATE,
COUNTRY_SHORT_NAME,
COUNTRY_ALPHA_3_CODE,
PEOPLE_DEATH_NEW_COUNT,
PEOPLE_POSITIVE_NEW_CASES_COUNT,
PEOPLE_DEATH_COUNT,
PEOPLE_POSITIVE_CASES_COUNT,
CONTINENT_NAME
)
Continents_tmap <- df1 %>%
group_by(Continent,Country.x) %>%
summarise(Deaths = sum(Deaths), Cases = sum(Positive_Cases))
Continents_tmap<- data.frame(Continents_tmap)

How Can I Turn 120/80 into Two Columns (120 and 80)?

I have a column of blood pressures which read as ###/##, all I want to do is splint the numerator into one column and the denominator into another column.
Please help?
library(dplyr)
library(stringr)
df = data.frame(
first_bp = c("120/80","90/60"),
id = c("0001234","0001235"),
amount = c(18.50, -18.50), stringsAsFactors = F)
df %>%
mutate(s0 = str_split(first_bp,"/")) %>%
rowwise() %>%
mutate(systole = as.numeric(s0[1]),
diastole = as.numeric(s0[2])) %>%
select(first_bp, id, amount, systole, diastole)
You can do it with .split() function.
Here is the example to do it:
blood_pressure = '120/80'
blood_pressure = blood_pressure.split('/')
numerator = blood_pressure[0]
denominator = blood_pressure[1]
print(numerator, denominator)
Output:
120 80

Is there a way of creating a loop that will create a new variable for each of the original 18 variables?

I have a data set with 4 variables, one of these variables is a dummy stating whether the individual graduated from a particular program (exits). I need to create a loop that will, for each of the 3 variables create two new variables (mean for dummy = 1 and mean for dummy = 0). This is my code, I want to make it more efficient, since afterwards I want to create a new data.frame for exits == 0 and substract both!.
summary_means_1 = bf %>%
filter(exits == 1) %>%
summarise(
v1_1 = as.double(mean(bf$v25_grad, na.rm = TRUE)),
v2_1 = as.double(mean(bf$v29_read, na.rm = TRUE)),
v3_1 = as.double(mean(bf$v30_math, na.rm = TRUE))
)
You can do this with the plyr package:
Say this is your data (simplified):
df <- data.frame(Dummy=sample(0:1, 10, T), V1=rnorm(10, 10), V2=rpois(10, 0.5))
This code will calculate the mean of each column, split by dummy:
library(magrittr)
library(plyr)
df %>%
group_by(Dummy) %>%
summarise(Mean_V1=mean(V1, na.rm = T),
Mean_V2=mean(V2, na.rm = T))
You'll need to add a new row in the summarise section for each column.
Using base R you can use colMeans with subsetted data:
colMeans(df[df$Dummy==0, -1])
colMeans(df[df$Dummy==1, -1])
Or you could combine them like this:
data.frame(Col=c("V1", "V2"),
Mean_0=colMeans(df[df$Dummy==0, -1]),
Mean_1=colMeans(df[df$Dummy==1, -1]))

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

R Calculate change using time and not lag

I'm trying to calculate change from one quarter to the next. It's a little complex as I'm grouping variables and I'd like to not use lag if possible. I'm having some difficultly finding a solution. Does anyone have a suggestion?
This is where I am now.
library(dplyr)
library(lubridate)
x <- mutate(x,QTR.YR = quarter(x$Month.of.Day.Timestamp, with_year = TRUE))
New.DF <- x %>% group_by(location_country, Device.Type,QTR.YR) %>% #Select grouping columns to summarise
summarise(Sum.Basket = sum(Baskets.Created), Sum.Visits = sum(Visits),
Sum.Checkout.Starts = sum(Checkout.Starts), Sum.Converted.Visit = sum(Converted.Visits),
Sum.Store.Find = sum(store_find), Sum.Time.on.Page = sum(time_on_pages),
Sum.Time.on.Product.Page = sum(time_on_product_pages),
Visit.Change = sum(Visits)/sum((lag(Visits)),4)-1)
View(New.DF)

Resources