Replace NA with Value with Previous Value - r

I have a column in data frame that I created in R. After a certain month, the values become NA. I would like to replace the NAs with the record 12 months back. Is there a function in R for me to do this? Or do I have to do a loop?
So Jan-11 would then become 10, Feb-11 would become 11 and so forth.
EDIT:
I also tried:
for (i in 1:length(df$var)) {
df$var[i] <- ifelse(is.na(df$var[i]), df$var[i - 12],
df$var[i]) }
but the whole column ends up being NA.

Aha, from the last comment it sounds like you'd like a "chained" lag, where it uses the last value of that month that is available, however many years back you need to go.
Jan-11 will show the value 10, but when it comes to Jan-12, it shows
NA (when it should be 10).
Here's an approach that relies on first grouping by month, and then using tidyr::fill() to fill in from the last valid value for that month.
First, some fake data. (BTW it would be useful to include something like this in your question so that answerers don't have to retype your numbers or generate new ones.)
# Make fake data with 1 year values, 2 yrs NAs
library(lubridate)
set.seed(42);
data <- data.frame(
dates = seq.Date(from = ymd(20100101), to = ymd(20121201), by = "month"),
values = c(as.integer(rnorm(12, 10, 3)), rep(NA_integer_, 24))
)
# Group by months, fill within groups, ungroup.
library(tidyverse)
data_filled <- data %>%
group_by(month = month(dates)) %>%
fill(values) %>%
ungroup() %>%
arrange(dates)

I can't think of a way to do this without a loop, but this should give you what you need:
df <- data.frame(col1 = LETTERS[1:24],
col2 = c(rnorm(12), rep(NA, 12)))
for(i in 1:nrow(df)) {
if(is.na(df[i, 2])) {
df[i, 2] <- df[i - 12, 2]
}
}

Related

filter data set when both conditions are met

I have some data that requires that the first day of every year be removed (1/1/xx), but I want to leave all of the other days of January. It doesn't follow traditional date formats, but has a Month and day columns. I've tried subsetting several ways using "&" in an effort to make sure it only removes the row when both requirements are met, but it isn't working and it removes all days from January. Here's a reproducible example of what I have been trying.
Month <- sample(1:12, 1000, replace=TRUE)
Day <- sample(1:30, 1000, replace = TRUE)
X <- sample(1:100, 1000, replace = TRUE)
df.ex <- as.data.frame(cbind(Month, Day, X))
library(dplyr)
df.ex <- df.ex %>%
filter(Month != 1 & Day != 1)
This results in all January dates being removed and all of the 1st days of the months being removed.
Here, you need to use the ! operator to negate your whole condition:
df.ex %>%
filter(!(Month == 1 & Day == 1))

summarizing data.table - creating multiple columns subset by date in R

I have data about ID and the corresponding amount over multiple years. Something like this:
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:10000, 15))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02",
"2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22",
"2016-10-22","2017-10-25", "2018-10-22")
Now, I want to analyse every year of every ID. Specifically, I am interested in the amount. For one, I want to know the overall amount for every year. Then, i also want to know the overall amount for first 11 months of every year, first 10 months of every year, first 9 months of every year and first 8 months of every year. For this purpose I have calculated the cumSum for every ID per year as follows:
myData <- cbind(ID, amount, Date)
myData <- as.data.table(myData)
# createe cumsum per ID per Year
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData[order(clientID, clDate)]
myData[, CumSum := cumsum(amount), by =.(ID, year(Date))]
How can summarise the data.table such that i get columns amount9month, amount10month, amount11month for every ID in every year?
Between cumsum, by and dcast this is almost quite straightforward. The most difficult bit is dealing with those months without any data in. Hence this solution isn't as brief as it almost was, but it does do things the "data.table way" and avoids slow operations like looping through rows.
# Just sort the formatting out first
myData[, Date:=as.Date(Date)]
myData[, `:=`(amount = as.numeric(amount),
year = year(Date),
month = month(Date))]
bycols <- c('ID', 'year', 'month')
# Summarise all transactions for the same ID in the same month
summary <- myData[, .(amt = sum(amount)), by=bycols]
# Create a skeleton table with all possible combinations of ID, year and month, to fill in any gaps.
skeleton <- myData[, CJ(ID, year, month = 1:12, unique = TRUE)]
# Join the skeleton to the actual data, to recreate the data but with no gaps in
result.long <- summary[skeleton, on=bycols, allow.cartesian=TRUE]
result.long[, amt.cum:=cumsum(fcoalesce(amt, 0)), by=c('ID', 'year')]
# Cast the data into wide format to have one column per month
result.wide <- dcast(result.long, ID + year ~ paste0('amount',month,'month'), value.var='amt.cum')
NB. If you don't have fcoalesce, update your data.table package.
In which format do you want it? There are two simple options. You can get the requested result easily in two different formats:
# Prepare the data
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:1, 15, replace = TRUE))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02", "2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22", "2016-10-22","2017-10-25", "2018-10-22")
myData <- data.frame(ID, amount, Date)
# Add year column
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData$year <- format(myData$Date,"%Y")
Please note that I changed the amounts for testing purposes. Now two solutions.
# Format 1
by(myData$amount, list(myData$ID, myData$year), cumsum, simplify = TRUE)
# Format 2
aggregate(myData$amount, list(ID = myData$ID, Date = myData$year), cumsum)
However, you might want to have the result to be a new column in the data frame? You can solve it:
# Format: New column
myData <- myData[order(myData$year, myData$ID),] # sort by year and ID
myData$cumsum <- rep(0, nrow(myData))
for (r in 1:nrow(myData)) {
if (r > 1 && myData$year[r-1] == myData$year[r] && myData$ID[r-1] == myData$ID[r])
myData$cumsum[r] <- myData$cumsum[r-1] + myData$amount[r]
else
myData$cumsum[r] <- myData$amount[r]
}
I do not know a smooth solution with basic R. Maybe someone from the "dplr faction" has a neat trick up their sleeve?

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

Making new variables for every group of observation in R

I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.

Copying data between groups in a grouped df

I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.

Resources