Related
I am just starting off in R and I am stuck on a fairly simple problem. I have the following test dataset:
T1 <- data.frame(Make = c("Nissan","Nissan","Nissan","Nissan","Nissan","Nissan","Nissan",
"FORD","FORD","FORD","FORD","FORD","FORD"),
YearMonth = c("Apr-13","May-13","Jun-13","Jul-13","Aug-13","Sep-13","Oct-13","Apr-16","May-16",
"Jun-16","Jul-16","Aug-16","Sep-16")),
Value = c(10000,9500,8000,7500,6000,5000,4000,12000,11000,10000,8000,7000,5000))
I would like to create two extra columns to return "value in 3months" and "final value" so something like:
Any help would be greatly appreciated
We could do a lead by n = 2 after grouping
library(dplyr)
T1 %>%
group_by(Make) %>%
mutate(Value_in_3_months = lead(Value, 2), FinalValue = last(Value))
If we want to create an index base on the 'YearMonth' column, convert to yearmon class and do a match with the original column
library(zoo)
T1 %>%
mutate(YearMonth = as.yearmon(YearMonth, "%b-%y")) %>%
group_by(Make) %>%
mutate(Valuein3 = Value[match(YearMonth + 2/12, YearMonth)])
Have a 15x6 dataframe:
df <- data.frame(PART = c("A7","A7","A7","A7","A7","A1","A1","A1","A1","A1","A7","A7","A7","A7","A7"),
LIMIT = c(50,50,50,50,50,55,55,55,55,55,52.5,52.5,52.5,52.5,52.5),
MEAS = c(14.008,19.053,22.244,24.554,25.521,18.495,22.3,24.867,26.825,27.169,15.299,20.239,23.384,25.606,26.516),
MEAS_TARGET = c(16.5,16.5,16.5,16.5,16.5,21.2,21.2,21.2,21.2,21.2,21.5,21.5,21.5,21.5,21.5),
INT = c(1.5,2.5,3.5,4.5,5,2,3,4,5,5.2,1.5,2.5,3.5,4.5,5),
COL = c(-31.845,-25.51,-21.377,-18.537,-17.546,-41.1,-39.294,-36.813,-33.779,-33.361,-53.589,-49.664,-46.836,-43.581,-40.64))
I am trying to group by PART and LIMIT columns and using linear interpolation find the missing values in INT and COL columns when MEAS = MEAS_TARGET and create the following 18x6 dataframe result:
result <- data.frame(PART = c("A7","A7","A7","A7","A7","A1","A1","A1","A1","A1","A7","A7","A7","A7","A7","A7","A1","A7"),
LIMIT = c(50,50,50,50,50,55,55,55,55,55,52.5,52.5,52.5,52.5,52.5,50,55,52.5),
MEAS = c(14.008,19.053,22.244,24.554,25.521,18.495,22.3,24.867,26.825,27.169,15.299,20.239,23.384,25.606,26.516,16.5,21.2,21.5),
MEAS_TARGET = c(16.5,16.5,16.5,16.5,16.5,21.2,21.2,21.2,21.2,21.2,21.5,21.5,21.5,21.5,21.5,16.5,21.2,21.5),
INT = c(1.5,2.5,3.5,4.5,5,2,3,4,5,5.2,1.5,2.5,3.5,4.5,5,1.99,2.7,2.9),
COL = c(-31.845,-25.51,-21.377,-18.537,-17.546,-41.1,-39.294,-36.813,-33.779,-33.361,-53.589,-49.664,-46.836,-43.581,-40.64,-28.716,-39.816,-48.53))
I tried to create NA rows for each group and use this and this but couldn't make it work. Any advice on this would be greatly appreciated.
We can use complete to include new rows where MEAS = MEAS_TARGET and interpolate INT and COL columns with zoo::na.approx.
library(dplyr)
library(tidyr)
df %>%
group_by(PART, LIMIT) %>%
complete(MEAS = unique(c(MEAS, MEAS_TARGET))) %>%
mutate(across(c(INT, COL), zoo::na.approx)) %>%
fill(MEAS_TARGET) %>%
ungroup
I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?
Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)
Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')
I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")
I have working code below, which does what I am after, and does it fine for a test subset of +- 1.000 records. However, in the actual dataset, I have about half a million rows, where suddenly the code takes up over five minutes. Could anyone tell me why or how to improve the code?
The end result I need is to keep only the first value of duplicated ID's, but for each year this should be renewed (i.e. double values are fine if they are in different years, but not in the same year).
Test %>%
group_by(year, id) %>%
mutate(is_duplicate = duplicated(id)) %>%
mutate(oppervlakt = ifelse(is_duplicate == FALSE, oppervlakt, 0))%>%
select(-is_duplicate)
I think you could remove id from grouping and should get same results. See this example:
library(dplyr)
# some sample data:
n_rows <- 1E6
df <- data.frame(year = sample(x = c(2000:2018), size = n_rows, replace = TRUE),
id = sample(x = seq_len(1000), size = n_rows, replace = TRUE),
oppervlakt = rnorm(n = n_rows))
# Roughly 1 second:
system.time(df_slow <- df %>% group_by(year, id) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
# Roughly .1 second:
system.time(df_fast <- df %>% group_by(year) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
all.equal(df_slow, df_fast)
[1] TRUE