Insert new rows of imputed data into data table by group [duplicate] - r

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Interpolate NA values in a data frame with na.approx
(3 answers)
Closed 2 years ago.
i have a data table and i would like to insert new rows imputing values between two years. This will be done over many ID groups. how do i go about replicating the data into the new rows?
# data table
dt <- data.table(ID=c(rep(1:3,each=3)),
attrib1=rep(c("sdf","gghgf","eww"),each=3),
attrib2=rep(c("444","222","777"),each=3),
Year = rep(c(1990, 1995, 1996), 3),
value = c(12,6,7,6,3,1,9,17,18))
so for all groups (ID), Year would go from 1990 to 1996 and the 2 values for 1990 & 1995 would be imputed linearly. All other attributes would remain the same and be copied into the new rows.
i've done this with a hideously long work around and attempted a custom function, but to no avail

You can use tidyr::complete to expand the years and zoo::na.approx for interpolation of the values.
library(dplyr)
dt %>%
group_by(ID, attrib1, attrib2) %>%
tidyr::complete(Year = min(Year):max(Year)) %>%
mutate(value = zoo::na.approx(value))
# ID attrib1 attrib2 Year value
# <int> <chr> <chr> <dbl> <dbl>
# 1 1 sdf 444 1990 12
# 2 1 sdf 444 1991 10.8
# 3 1 sdf 444 1992 9.6
# 4 1 sdf 444 1993 8.4
# 5 1 sdf 444 1994 7.2
# 6 1 sdf 444 1995 6
# 7 1 sdf 444 1996 7
# 8 2 gghgf 222 1990 6
# 9 2 gghgf 222 1991 5.4
#10 2 gghgf 222 1992 4.8
# … with 11 more rows

Related

Add a calculated column based on same and two other columns in r

I'm trying to add a calculated column based on values of the same and another column calculated from the values in a third column. There are three columns, year, id, and value. If the id for 2011 matches the id for 2005, then subtract the value of 2005 from the value of 2011. So the difference shows 10-11=-1, 20-5=15, and 30-16=14... and the remaining rows can be 0 or NA, it doesn't matter. The following table shows the resulting table with the new difference column.
I know I could split the data into two tables and then create the column via a simple subtraction if the two tables are ordered the same by year and id, but that's not an option for this particular problem. Tried thinking of how I could use case_when or ifelse but it's a mind-bender and can't get my head around it. There are examples I've found but they don't address this - they're mostly based on using a comparison between only two columns, or perhaps three... here, though, one of the values is from the same column. How can I address this?
Your help is appreciated greatly in advance.
Here is the code for the original table:
dat <- data.frame(year=c(2011,2011,2011,2005,2005,2005),
id=c(1,2,3,1,2,3),
value=c(10,20,30,11,5,6))
For situations where there are multiple ids in your comment to Ronak's answer, you can do:
library(tidyr)
library(dplyr)
dat2 |>
pivot_wider(id, values_from = value, names_from = year) |>
unnest(c(`2011`, `2005`)) |>
mutate(difference = `2011` - `2005`) |>
pivot_longer(c(`2011`, `2005`), names_to = "year")
# A tibble: 10 x 4
id difference year value
<dbl> <dbl> <chr> <dbl>
1 1 -1 2011 10
2 1 -1 2005 11
3 1 -1 2011 10
4 1 -1 2005 11
5 2 15 2011 20
6 2 15 2005 5
7 2 15 2011 20
8 2 15 2005 5
9 3 24 2011 30
10 3 24 2005 6
Arrange the data based on descending order of year value and for each id subtract the current value with the next one.
library(dplyr)
dat %>%
arrange(desc(year)) %>%
group_by(id) %>%
mutate(difference = value - lead(value)) %>%
#to get 0 instead of NA use the below one
#mutate(difference = value - lead(value, default = last(value))) %>%
ungroup
# year id value difference
# <dbl> <dbl> <dbl> <dbl>
#1 2011 1 10 -1
#2 2011 2 20 15
#3 2011 3 30 24
#4 2005 1 11 NA
#5 2005 2 5 NA
#6 2005 3 6 NA

Extracting data from different columns in a data frame based on row values

From each row in the data frame, df, I want to extract values in columns, as explained below and create a new data frame, output.
When Year is equal to 2003, I need values in Y_2001 and Y_2002 columns, in output data frame as Year 1 and Year 2. They are the values corresponding to two years prior to year specified in Year column. Similarly, if year equals to 2006, I need values in Y_2004 and Y_2005 in output data frame. Likewise, for all years in Year column.
> df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005
[1,] 1 2003 2 4 6 4 3
[2,] 2 2004 5 9 7 1 2
[3,] 3 2006 4 3 5 7 8
[4,] 4 2004 7 6 4 8 9
> output
ID Year Year1 Year2
[1,] 1 2003 2 4
[2,] 2 2004 9 7
[3,] 3 2006 7 8
[4,] 4 2004 6 4
Can someone please help me to create a code to get above output? Highly appreciate any support.
Here is a tidyverse solution:
Would take data and put into long form with pivot_longer. The data values of interest are where the Year "row" is 1 or 2 years less than the "column" Year. You can filter on these differences (filter here is explicit for 1 or 2 year differences).
An additional column is created with mutate for your column names of Year1 and Year2 (note Year1 is difference of 2 years, and Year2 is difference of 1 year, so the values is subtracted from 3 for this reversal). Finally, pivot_wider puts the data back in wide form.
library(tidyverse)
df %>%
pivot_longer(cols = -c(ID, Year), names_to = c(".value", "Year_Sep"), names_sep = "_", names_ptypes = list(Year_Sep = numeric())) %>%
filter(Year - Year_Sep == 1 | Year - Year_Sep == 2) %>%
mutate(YearCol = paste0("Year", 3 - (Year - Year_Sep))) %>%
pivot_wider(id_cols = c(ID, Year), names_from = YearCol, values_from = Y)
Output
# A tibble: 4 x 4
ID Year Year1 Year2
<int> <int> <int> <int>
1 1 2003 2 4
2 2 2004 9 7
3 3 2006 7 8
4 4 2004 6 4
Bit of a clunky solution, but ...
i.col <- function(data, n) { # Returns the column index corresponding to the year
sapply(data$Year-n, function(x) grep(x, names(data)))
}
df$Year1 <- diag(as.matrix(df[, i.col(df, n=2)]))
df$Year2 <- diag(as.matrix(df[, i.col(df, n=1)]))
Edit:
Apparently using diag is very slow. Using cbind to access matrix elements is preferred.
df$Year1 <- df[cbind(1:4, i.col(df, n=2))] # where 4 is number of rows
df$Year2 <- df[cbind(1:4, i.col(df, n=1))]
df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005 Year1 Year2
1 1 2003 2 4 6 4 3 2 4
2 2 2004 5 9 7 1 2 9 7
3 3 2006 4 3 5 7 8 7 8
4 4 2004 7 6 4 8 9 6 4
Here is one way with row-wise apply assuming that you can find out the starting year (2001).
cbind(df[1:2], t(apply(df[-1], 1, function(x)
{ vals <- x[1] - 2001; x[c(vals:(vals + 1))]})))
# ID Year 1 2
#1 1 2003 2 4
#2 2 2004 9 7
#3 3 2006 7 8
#4 4 2004 6 4

Keep less recent duplicate row in R

So, I have a dataset with bill numbers, day, month, year, and aggregate value. There are a bunch of bull number duplicates, and I want to keep the first ones. If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value.
For example, if the dataset now looks like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4
I want the result to look like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
2 23 11 2001 12
3 11 3 2005 8
I'm not sure if there is a command I can use and just introduce all these arguments or if I should do it in stages, but either way I'm not sure how to begin. I used duplicate() and unique() and then got stuck.
Thanks!
library( data.table )
dt <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
dt[ !duplicated( Bill_Number), ]
# Bill_Number Day Month Year Ag_Value
# 1: 1 10 4 1998 10
# 2: 2 23 11 2001 12
# 3: 3 11 3 2005 8
or
dt[, .SD[1], by = .(Bill_Number) ] #other approach, a bit slower
duplicated() gives you the entries which are identical to earlier one (i.e. ones with smaller subscripts). Therefore, sorting your bill numbers by date (earliest to the top) and then removing duplicates should do the trick. Aggregating your columns day, month and year into one date-column might be helpful.
This answer uses dplyr package and satisfies your condition: "If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value."
library(data.table)
library(dplyr)
myData <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
myData <- as.tibble(myData) #tibble form
sData <- arrange(myData, Bill_Number, Year, Month, Day, desc(Ag_Value)) #sort the data with the required manner
fData <- distinct(sData, Bill_Number, .keep_all = 1) #final data
fData
# A tibble: 3 x 5
Bill_Number Day Month Year Ag_Value
<int> <int> <int> <int> <int>
1 1 10 4 1998 10
2 2 23 11 2001 12
3 3 11 3 2005 8
I used some loops and condition checks, and tried with a test set besides the "base" set you mentioned.
library(tidyverse)
#base dataset
billNumber <- c(1,1,2,2,3,3,3)
day <- c(10,11,23,23,11,12,13)
month <- c(4,4,11,11,3,3,3)
year <- c(1998,1998,2001,2001,2005,2005,2005)
agValue <- c(10,14,12,9,8,9,4)
#test dataset
billNumber <- c(1,1,2,2,3,3,3,4,4,4)
day <- c(10,11,23,23,11,12,13,15,15,15)
month <- c(4,4,11,11,3,3,3,6,6,6)
year <- c(1998,1998,2001,2001,2005,2005,2005,2020,2020,2020)
agValue <- c(10,14,9,12,8,9,4,13,15,8)
#build the dataset
df <- data.frame(billNumber,day,month,year,agValue)
#add a couple of working columns
df_full <- df %>%
mutate(
concat = paste(df$billNumber,df$day,df$month,df$year,sep="-"),
flag = ""
)
df_full
billNumber day month year agValue concat flag
1 1 10 4 1998 10 1-10-4-1998
2 1 11 4 1998 14 1-11-4-1998
3 2 23 11 2001 12 2-23-11-2001
4 2 23 11 2001 9 2-23-11-2001
5 3 11 3 2005 8 3-11-3-2005
6 3 12 3 2005 9 3-12-3-2005
7 3 13 3 2005 4 3-13-3-2005
#separate records with one/multi occurence as defined in the question
row_single <- df_full %>% count(concat) %>% filter(n == 1)
df_full_single <- df_full[df_full$concat %in% row_single$concat,]
row_multi <- df_full %>% count(concat) %>% filter(n > 1)
df_full_multi <- df_full[df_full$concat %in% row_multi$concat,]
#flag the rows with single occurence
df_full_single[1,]$flag = "Y"
for (row in 2:nrow(df_full_single)) {
if (df_full_single[row,]$billNumber == df_full_single[row-1,]$billNumber) {
df_full_single[row,]$flag = "N"
} else
{
df_full_single[row,]$flag = "Y"
}
}
df_full_single
#flag the rows with multi occurences
df_full_multi[1,]$flag = "Y"
for (row in 2:nrow(df_full_multi)) {
if (
(df_full_multi[row,]$billNumber == df_full_multi[row-1,]$billNumber) &
(df_full_multi[row,]$agValue > df_full_multi[row-1,]$agValue)
) {
df_full_multi[row,]$flag = "Y"
df_full_multi[row-1,]$flag = "N"
} else
{
df_full_multi[row,]$flag = "N"
}
}
df_full_multi
#rebuild full dataset and retrieve the desired output
df_full_final <- rbind(df_full_single,df_full_multi)
df_full_final <- df_full_final[df_full_final$flag == "Y",c(1,2,3,4,5)]
df_full_final <- df_full_final[order(df_full_final$billNumber),]
df_full_final
billNumber day month year agValue
1 1 10 4 1998 10
3 2 23 11 2001 12
5 3 11 3 2005 8

Remove rows out of a specific year range, without using for-loop in R

I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1

Selecting distinct rows in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
dat <- data.frame(loc.id = rep(1:2, each = 3),
year = rep(1981:1983, times = 2),
prod = c(200,300,400,150,450,350),
yld = c(1200,1250,1200,3000,3200,3200))
If I want to select for each loc.id distinct values of yld, I do this:
dat %>% group_by(loc.id) %>% distinct(yld)
loc.id yld
<int> <dbl>
1 1200
1 1250
2 3000
2 3200
However, what I want to do is for loc.id, if years have the same yld, then select the yld with a lower
prod value. My dataframe should look like i.e. I want the prod and year column too included in the final dataframe
loc.id year prod yld
1 1981 200 1200
1 1982 300 1250
2 1981 150 3000
2 1983 350 3200
We can do an arrange by 'prod' and then slice the first observation
dat %>%
arrange(loc.id, prod) %>%
group_by(loc.id, yld) %>%
slice(1)
# A tibble: 4 x 4
# Groups: loc.id, yld [4]
# loc.id year prod yld
# <int> <int> <dbl> <dbl>
#1 1 1981 200 1200
#2 1 1982 300 1250
#3 2 1981 150 3000
#4 2 1983 350 3200

Resources