Subsetting a data frame and replacing a column based on condition - r

I am working on a data frame with three columns labelled as id, time1 and time2. A sample is:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 5L, 6L, 30L, 3L, 30L, 7L, 2L, 17L, 5L, 8L, 3L, 22L, 5L, 15L, 4L, 7L, 23L),
time2=c(23L,23L,23L,23L,23L,22L,22L,22L,22L,22L,25L,25L,25L,25L,25L,24L,24L,24L,24L,24L)
),
.Names = c("id", "time1","time2"),
class = "data.frame",
row.names = c(NA,-20L)
)
I am using R and I am trying to subset this data and replace column time2 with a new column based on the following criteria:
Sum the values of time1 for each id until it is greater than or equal to the corresponding value of time2 for that id.
Replace the cells in time1 where the summations terminate with the corresponding time2 value for each id.
Column time2 is to be replaced with a new column labelled as status which consists of 0's and 1's. That is, status takes on 1 for the non-replaced values of time1 and 0 for all the replaced values of time1.
In summary, I am expecting to see something like this:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 23, 22L, 17L, 5L, 25L, 5L, 15L, 24L),
status=c(1L,1L,1L,0L,0L,1L,1L,0L,1L,1L,0L)
),
.Names = c("id", "time1","status"),
class = "data.frame",
row.names = c(NA,-11L)
)
I greatly appreciate any help on this.

We can do the following:
library(tidyverse);
df %>%
group_by(id) %>%
mutate(
status = as.numeric(cumsum(time1) < time2),
time1 = ifelse(status == 1, time1, time2)) %>%
group_by(id, status) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
filter(status == 1 | (status == 0 & n == 1)) %>%
select(-n, -time2)
## A tibble: 11 x 3
# id time1 status
# <int> <int> <dbl>
# 1 1 12 1.
# 2 1 5 1.
# 3 1 3 1.
# 4 1 23 0.
# 5 2 22 0.
# 6 3 17 1.
# 7 3 5 1.
# 8 3 25 0.
# 9 4 5 1.
#10 4 15 1.
#11 4 24 0.
Explanation: We group rows by id, then calculate the cumulative sum of time1 entries, and flag those rows where cumsum(time1) < time2 with 1, else with 0; we replace time1 entries with time2 entries if status == 1. Lastly we need to remove excess status = 0 rows; to do so, we regroup by id and status, number rows consecutively, and keep only one row for status = 0 per id.

Related

Get rows from a column per group based on a condition

I have a data.frame as shown below:
Basic requirement is to find average of "n" number of "value" after certain date per group.
For ex:, user provides:
Certain Date = Failure Date
n = 4
Hence, for A, the average would be (60+70+80+100)/4 ; ignoring NAs
and for B, the average would be (80+90+100)/3. Note for B, n=4 does not happen as there are only 3 values after the satisfied condition failuredate = valuedate.
Here is the dput:
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), FailureDate = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("1/5/2020", "1/7/2020"), class = "factor"), ValueDate = structure(c(1L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 2L, 1L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 2L), .Label = c("1/1/2020", "1/10/2020", "1/2/2020",
"1/3/2020", "1/4/2020", "1/5/2020", "1/6/2020", "1/7/2020", "1/8/2020",
"1/9/2020"), class = "factor"), Value = c(10L, 20L, 30L, 40L,
NA, 60L, 70L, 80L, NA, 100L, 10L, 20L, 30L, 40L, 50L, 60L, 70L,
80L, 90L, 100L)), class = "data.frame", row.names = c(NA, -20L
))
We could create an index with cumsum after grouping by 'Name', extract the 'Value' elements and get the mean
library(dplyr)
n <- 4
df1 %>%
type.convert(as.is = TRUE) %>%
group_by(Name) %>%
summarise(Ave = mean(head(na.omit(Value[lag(cumsum(FailureDate == ValueDate),
default = 0) > 0]), n), na.rm = TRUE))
# A tibble: 2 x 2
# Name Ave
# <chr> <dbl>
#1 A 77.5
#2 B 90
You can convert factor dates to the Date object and then compute averages of "n" numbers after FailureDate per group. Note that "n" numbers should exclude NA, so tidyr::drop_na() is used here.
library(dplyr)
df %>%
mutate(across(contains("Date"), as.Date, "%m/%d/%Y")) %>%
tidyr::drop_na(Value) %>%
group_by(Name) %>%
summarise(mean = mean(Value[ValueDate > FailureDate][1:4], na.rm = T))
# # A tibble: 2 x 2
# Name mean
# <fct> <dbl>
# 1 A 77.5
# 2 B 90
You can try this:
library(dplyr)
n <- 4
df %>%
mutate(condition = as.character(FailureDate) == as.character(ValueDate))
group_by(Name) %>%
mutate(condition = cumsum(condition)) %>%
filter(condition == 1) %>%
slice(-1) %>%
filter(!is.na(Value)) %>%
slice(1:n) %>%
summarise(mean_col = mean(Value))
> df
# A tibble: 2 x 2
Name mean_col
<fct> <dbl>
1 A 77.5
2 B 90

Sum consecutive hours when condition is met

I have a dataframe that has a timestamp and a numeric variable, the data is recorded once every hour. Ultimately, I'd life to know the mean number of hours that the variable is at or below a certain value. For example, what is the average number of hours that var1 is at or below 4? There are missing timestamps in the dataframe, so if the time is not consecutive the sum needs to restart.
In the example data frame the columns HoursBelow5 and RunningGroup were generated 'by hand', if I could create these columns programmatically, I could filter to remove the RunningGroups that were associate with var1 values greater than 4 and then use dplyr::slice to get the maximum HoursBelow5 per group. I could then find the mean of these values.
So, in this approach I would need to create the restarting cumulative sum HoursBelow5, which restarts when the condition var1<5 is not met, or when the timestamp is not consecutive hours. I could then use ifelse statements to create the RunningGroup variable. Is this possible? I may be lacking the jargon to find the procedure. Cumsum and lag seemed promising, but I have yet to construct a procedure that does the above.
Or, there may be a smarter way to do this using the timestamp.
edit: result incorporating code from answer below
df1 <- df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
filter(var1 <= 4) %>%
select(ts, var1, temp)
df2 <- df1 %>% mutate(temp2 = ifelse(temp==1, 1, 0),
newgroup = cumsum(temp2))
df3 <- df2 %>% group_by(newgroup) %>% slice(which.max(temp))
mean(df3$temp)
# example dataframe with desired output columns to then get actual output
df <- structure(list(ts = structure(c(-2208967200, -2208963600, -2208960000,
-2208956400, -2208952800, -2208949200, -2208945600, -2208942000,
-2208938400, -2208934800, -2208931200, -2208927600, -2208924000,
-2208913200, -2208909600, -2208906000, -2208902400, -2208898800,
-2208895200, -2208891600, -2208888000, -2208884400, -2208880800,
-2208877200, -2208852000, -2208848400, -2208844800, -2208841200,
-2208837600, -2208834000, -2208830400, -2208826800, -2208823200,
-2208819600, -2208816000, -2208812400, -2208808800, -2208805200,
-2208801600), class = c("POSIXct", "POSIXt"), tzone = ""), var1 = c(1L,
3L, 4L, 5L, 4L, 3L, 5L, 6L, 7L, 8L, 3L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 1L, 1L, 1L, 1L, 4L, 4L, 3L, 9L, 3L, 3L, 3L, 2L, 2L, 3L, 4L,
5L, 3L, 2L, 1L, 2L, 3L), HoursBelow5 = c(1L, 2L, 3L, 0L, 1L,
2L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 1L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 0L, 1L, 2L,
3L, 4L, 5L), RunningGroup = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L,
6L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
10L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 14L, 14L, 14L,
14L, 14L), NotContinuous = c("", "", "", "", "", "", "", "",
"", "", "", "", "", "NC", "", "", "", "", "", "", "", "", "",
"", "NC", "", "", "", "", "", "", "", "", "", "", "", "", "",
"")), row.names = c(NA, -39L), class = "data.frame")
One way could using dplyr and data.table::rleid could be
library(dplyr)
df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
select(ts, var1, HoursBelow5, temp)
# ts var1 HoursBelow5 temp
# <dttm> <int> <int> <int>
# 1 1900-01-01 12:46:46 1 1 1
# 2 1900-01-01 13:46:46 3 2 2
# 3 1900-01-01 14:46:46 4 3 3
# 4 1900-01-01 15:46:46 5 0 0
# 5 1900-01-01 16:46:46 4 1 1
# 6 1900-01-01 17:46:46 3 2 2
# 7 1900-01-01 18:46:46 5 0 0
# 8 1900-01-01 19:46:46 6 0 0
# 9 1900-01-01 20:46:46 7 0 0
#10 1900-01-01 21:46:46 8 0 0
# … with 29 more rows
temp column is the one which was generated programmatically and HoursBelow5 is kept as it is for comparison purposes. If you also need RunningGroup you could use group and group1 together.

Number of remaining days of a month after maximum value appear

I have a panel data frame like this
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
Now from the above panel data, I want to find a variable called DMAX. DMAX means the unit of days as the difference between the Maximum return day and the last trading day of the same month. For example, in January 1988 the Maximum return appears on 12 Jan 1988 for firm A. Hence the DMAX is the number of days between 12 Jan 1988 to the end of that month which is 15 days.
For firm B, the maximum value appears on 22 Jan 1988. So the remaining number of days of that month is 6 days. Therefore the expected outcome is
date Firms DMAX(days)
Jan-88 A 15
Jan-88 B 6
I would be grateful if you can help me in this regard.
One way using the dplyr package would be the following. I called your data mydf. First, manipulate date. Then, group the data by date and firms. Then, you look for the row with the largest value in return and handle subtraction.
mutate(mydf, date = format(as.Date(date, format = "%d/%m/%Y"), "%m-%Y")) %>%
group_by(date, firms) %>%
summarize(DMAX = n() - which.max(return))
# A tibble: 2 x 3
# Groups: date [?]
# date firms DMAX
# <chr> <fct> <int>
#1 01-1988 A 15
#2 01-1988 B 6
DATA
mydf <-structure(list(date = structure(c(18L, 19L, 20L, 21L, 22L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("11/1/1988",
"12/1/1988", "13/01/1988", "14/01/1988", "15/01/1988", "16/01/1988",
"18/01/1988", "19/01/1988", "20/01/1988", "21/01/1988", "22/01/1988",
"23/01/1988", "25/01/1988", "26/01/1988", "27/01/1988", "28/01/1988",
"29/01/1988", "5/1/1988", "6/1/1988", "7/1/1988", "8/1/1988",
"9/1/1988"), class = "factor"), firms = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
return = c(5L, 6L, 4L, 5L, 6L, 6L, 13L, 3L, 2L, 5L, 2L, 7L,
3L, 5L, 7L, 5L, 9L, 1L, 5L, 2L, 7L, 2L, 5L, 7L, 5L, 9L, 1L,
5L, 2L, 7L, 2L, 5L, 6L, 8L, 5L, 4L, 3L, 18L, 5L, 2L, 7L,
3L, 9L, 2L)), class = "data.frame", row.names = c(NA, -44L
))
1) Base R For each year/month and firm aggregate the difference between the number of rows and the position of the maximum return row. No packages are used.
with(transform(DF, date = as.Date(date, "%d/%m/%Y")),
aggregate(list(DMAX = return),
data.frame(date = format(date, "%Y-%m"), firms),
function(x) length(x) - which.max(x)))
giving:
date firms DMAX
1 1988-01 A 15
2 1988-01 B 6
2) zoo Read DF into a zoo object zd with one column per firm and then aggregate that by year/month. Finally melt it to a long form data frame using fortify.zoo. The fortify.zoo line can be omitted if a zoo time series object is ok as the result.
library(zoo)
zd <- read.zoo(DF, index = "date", format = "%d/%m/%Y", split = "firms")
ag <- aggregate(zd, as.yearmon, function(x) length(na.omit(x)) - which.max(na.omit(x)))
fortify.zoo(ag, melt = TRUE)
giving:
Index Series Value
1 Jan 1988 A 15
2 Jan 1988 B 6
Note that ag is a monthly zoo series of the form:
> ag
A B
Jan 1988 15 6
3) data.table
library(data.table)
DT <- as.data.table(DF)
DT[, list(DMAX = .N - which.max(return)),
by = list(date = format(as.Date(date, "%d/%m/%Y"), "%Y-%m"), firms)]
giving:
date firms DMAX
1: 1988-01 A 15
2: 1988-01 B 6
Note
Lines <- "
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
"
DF <- read.table(text = Lines, header = TRUE)
Here is a tidyverse solution.
library(tidyverse)
library(zoo)
df1 %>%
mutate(date = dmy(date),
month = as.yearmon(date)) %>%
group_by(firms, month) %>%
summarise(i = which(return == max(return)),
DMAX = last(date) - date[last(i)]) %>%
select(month, firms, DMAX)
## A tibble: 2 x 3
## Groups: firms [2]
# month firms DMAX
# <S3: yearmon> <chr> <time>
#1 Jan 1988 A 17 days
#2 Jan 1988 B " 7 days"

Delete rows of Dataframe based on dates in R

I have a data frame which has over 4000 columns and 3000 rows. Columns are companies and rows have daily stock closing price. The rows have daily observation data based on dates of the Month. Now, I want is to remove rows in between the last date of of each month i.e. I want to have data of only last day of month based on the avaiable date of month form my data frame. Last date of each month should be according to the date column in my data frame avaiable.
the main challenge and difference of my question to others is date of last month should be according to provided dates in my dataframe. Its a financial data and non trading days and no. of trading days differ from other types of sectors of industry
I illustrate some part of my dataframe.
Date A B
30/12/1999 1 3
04/01/2000 1 3
05/01/2000 1 3
06/01/2000 1 3
07/01/2000 1 3
10/01/2000 1 3
11/01/2000 1 3
12/01/2000 1 3
13/01/2000 1 3
14/01/2000 1 3
17/01/2000 1 3
18/01/2000 1 3
19/01/2000 1 3
20/01/2000 1 3
21/01/2000 1 3
24/01/2000 1 3
25/01/2000 1 3
26/01/2000 1 3
27/01/2000 1 3
28/01/2000 1 3
31/01/2000 1 3
01/02/2000 1 3
02/02/2000 1 3
03/02/2000 1 3
04/02/2000 1 3
07/02/2000 1 3
08/02/2000 1 3
09/02/2000 1 3
10/02/2000 1 3
11/02/2000 1 3
14/02/2000 1 3
15/02/2000 1 3
16/02/2000 1 3
17/02/2000 1 3
18/02/2000 1 3
21/02/2000 1 3
22/02/2000 1 3
23/02/2000 1 3
24/02/2000 1 3
25/02/2000 1 3
28/02/2000 1 3
29/02/2000 1 3
Desired output
Date A B
30/12/1999 1 3
31/01/2000 1 3
29/02/2000 1 3
I would really appreciate your help in this regard.
Using lubridate and dplyr, first parse Date
library(lubridate)
library(dplyr)
df$Date <- dmy(df$Date)
Now we can build a dplyr chain to filter:
df %>% group_by(month = month(Date), year = year(Date)) %>% filter(Date == max(Date))
where we group_by month and year columns we add, and then filter down to only the dates that are the max for each group. It returns
Source: local data frame [3 x 5]
Groups: month, year [3]
Date A B month year
(time) (int) (int) (dbl) (dbl)
1 1999-12-30 1 3 12 1999
2 2000-01-31 1 3 1 2000
3 2000-02-29 1 3 2 2000
You could, of course, do this all in base R if you prefer.
Edit: H/T #Jaap for recommending using group_by to add columns instead of a separate mutate. You could also use slice(which.max(Date)) instead of the filter term; it would likely be a hint faster, if that's a concern.
We can also use data.table
library(data.table)
library(lubridate)
setDT(df1)[, c('month', 'year', 'Date') :={tmp <- dmy(Date)
list(month= month(tmp), year= year(tmp), Date= tmp)}
][, .SD[ which.max(Date)] ,.(month, year)]
# month year Date A B
#1: 12 1999 1999-12-30 1 3
#2: 1 2000 2000-01-31 1 3
#3: 2 2000 2000-02-29 1 3
Here's another possibility:
month_year <- as.numeric(as.factor(sub("^[0-9]*/","",df1$Date)))
df1[!!c(diff(month_year),1),]
# Date A B
#1 30/12/1999 1 3
#21 31/01/2000 1 3
#42 29/02/2000 1 3
This solution does not change the format of the date in the original dataframe. However, it is assumed that the data is chronologically ordered like the data displayed in the OP.
data
df1 <- structure(list(Date = structure(c(41L, 4L, 6L, 7L, 8L, 12L, 14L,
16L, 17L, 18L, 22L, 24L, 26L, 27L, 28L, 32L, 34L, 36L, 37L, 38L,
42L, 1L, 2L, 3L, 5L, 9L, 10L, 11L, 13L, 15L, 19L, 20L, 21L, 23L,
25L, 29L, 30L, 31L, 33L, 35L, 39L, 40L), .Label = c("01/02/2000",
"02/02/2000", "03/02/2000", "04/01/2000", "04/02/2000", "05/01/2000",
"06/01/2000", "07/01/2000", "07/02/2000", "08/02/2000", "09/02/2000",
"10/01/2000", "10/02/2000", "11/01/2000", "11/02/2000", "12/01/2000",
"13/01/2000", "14/01/2000", "14/02/2000", "15/02/2000", "16/02/2000",
"17/01/2000", "17/02/2000", "18/01/2000", "18/02/2000", "19/01/2000",
"20/01/2000", "21/01/2000", "21/02/2000", "22/02/2000", "23/02/2000",
"24/01/2000", "24/02/2000", "25/01/2000", "25/02/2000", "26/01/2000",
"27/01/2000", "28/01/2000", "28/02/2000", "29/02/2000", "30/12/1999",
"31/01/2000"), class = "factor"), A = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), B = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-42L))
I'd create a vector containing the end of month dates for your data like so:
library(dplyr)
df.dates = seq(as.Date("1999-01-01"),as.Date(Sys.Date()),by="months")-1
df.dates = as.data.frame(df.dates)
names(df.dates) = "Date"
df.joined = inner_join(df.dates, df)
This assumes that you have your data in a data frame with the Date column named "Date"
*Re-reading the question, this won't work if the last trading day isn't the last day of the month. #alistaire has a better solution using max(Date)

Replace NA values in dataframe variable with values from other dataframe by "ID"

I would like to know if there is a more concise way to replace NA values for a variable in a dataframe than what I did below. The code below seems to be longer than what I think might be possible in R. For example, I am unaware of some package/tool that might do this more succinctly.
Is there a way to replace, or merge values only if they are NA? After merging two dataframes using all.x = T I have some NA values, I'd like to replace those with information from another dataframe using a common variable to link the replacement.
# get dataframes
breaks <- structure(list(Break = 1:11, Value = c(2L, 13L, 7L, 9L, 40L,
21L, 10L, 37L, 7L, 26L, 42L)), .Names = c("Break", "Value"), class = "data.frame", row.names = c(NA,
-11L))
fsites <- structure(list(Site = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), Plot = c(0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 0L,
1L, 2L, 3L, 4L, 5L), Break = c(1L, 5L, 7L, 8L, 11L, 1L, 6L, 11L,
1L, 4L, 6L, 8L, 9L, 11L)), .Names = c("Site", "Plot", "Break"
), class = "data.frame", row.names = c(NA, -14L))
bps <- structure(list(Site = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L,
3L), Plot = c(0L, 1L, 2L, 3L, 1L, 2L, 0L, 1L, 2L, 3L, 4L), Value = c(0.393309653,
0.12465733, 0.27380161, 0.027288989, 0.439712533, 0.289724079,
0.036429062, 0.577460008, 0.820375917, 0.323217357, 0.28637503
)), .Names = c("Site", "Plot", "Value"), class = "data.frame", row.names = c(NA,
-11L))
# merge fsites and bps
df1 <- merge(fsites, bps, by=c("Site", "Plot"), all.x=T)
# merge df1 and breaks to get values to eventually replace the NA values in
# df1$Values.x, here "Break" is the ID by which to replace the NA values
df2 <- merge(df1, breaks, by=c("Break"))
# Create a new column 'Value' that uses Value.x, unless NA, then Value.y
df3 <- df2
df3$Value <- df2$Value.x
df2.na <- is.na(df2$Value.x)
df3$Value[df2.na] <- df2$Value.y[df2.na]
# get rid of unnecessary columns
cols <- c(1:3,6)
df4 <- df3[,cols]
At the stage where there is only (breaks, fsites, bps and) df1 around:
df1$Value <- ifelse(is.na(df1$Value),
breaks$Value[match(df1$Break, breaks$Break)], df1$Value)
#> df1
# Site Plot Break Value
#1 1 0 1 0.39330965
#2 1 1 5 0.12465733
#3 1 2 7 0.27380161
#4 1 3 8 0.02728899
#5 1 4 11 42.00000000
#6 2 0 1 2.00000000
#7 2 1 6 0.43971253
#8 2 2 11 0.28972408
#9 3 0 1 0.03642906
#10 3 1 4 0.57746001
#11 3 2 6 0.82037592
#12 3 3 8 0.32321736
#13 3 4 9 0.28637503
#14 3 5 11 42.00000000
#just to test with your `df4`
> sort(df1$Value) == sort(df4$Value)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Resources