Conditional Replacing with NA in R (two dataframes) - r

I have
idx <- c(1397, 2000, 3409, 3415, 4077, 4445, 5021, 5155)
idy <- c( 1397, 2000, 2860, 3029, 3415, 3707, 4077, 4445, 5021, 5155,
5251, 5560)
agex <- c(NA, NA, NA, 35, NA, 62, 35, 46)
agey <- c( 3, 45, 0, 89, 7, 2, 13, 24, 58, 8, 3, 45)
dat1 <- as.data.frame(cbind(idx, agex))
dat2 <- as.data.frame(cbind(idy, agey))
Now I want whenever agex = NA, and idx = idy, that agey = NA, so that
idy agey
1 1397 NA
2 2000 NA
3 2860 0
4 3029 89
5 3415 7
6 3707 2
7 4077 NA
8 4445 24
9 5021 58
10 5155 8
11 5251 3
12 5560 45
I have tried this
ifelse(is.na(dat1$agex) | dat1$idx %in% dat2$idy, NA, dat2$agey)
it returns NAs at the correct indices, but shortens idy to the length of idx.

I want whenever agex = NA, and idx = idy, that agey = NA
With a data.table update join...
library(data.table)
setDT(dat1); setDT(dat2)
dat2[dat1[is.na(agex)], on=.(idy = idx), agey := NA]
dat2
idy agey
1: 1397 NA
2: 2000 NA
3: 2860 0
4: 3029 89
5: 3415 7
6: 3707 2
7: 4077 NA
8: 4445 24
9: 5021 58
10: 5155 8
11: 5251 3
12: 5560 45
How it works
dat1[is.na(agex)] is the subset where agex is NA
DT[mDT, on=, j] is a join where rows of mDT are looked up in DT using on=
j is done in the joined subset of DT
when j is k := expr, column k of DT is updated

Related

How to set missing some columns and their corresponding columns in data frame in R

I have a longitudinal data with three follow-up. The columns 2,3 and 4
I want to set the value 99 in the columns v_9, v_01, and v_03 to NA, but I want to set their corresponding columns (columns "d_9", "d_01","d_03" and "a_9", "a_01","a_03") as NA as well. As an example for ID 101 as below:
How can I do this for all the individuals and my whole data set in R? thanks in advance for the help.
"id" "v_9" "v_01" "v_03" "d_9" "d_01" "d_03" "a_9" "a_01" "a_03"
101 12 NA 10 2015-03-23 NA 2003-06-19 40.50650 NA 44.1065
structure(list(id = c(101, 102, 103, 104), v_9 = c(12, 99, 16,
25), v_01 = c(99, 12, 16, NA), v_03 = c(10, NA, 99, NA), d_9 = structure(c(16517,
17613, 16769, 10667), class = "Date"), d_01 = structure(c(13291,
NA, 13566, NA), class = "Date"), d_03 = structure(c(12222, NA,
12119, NA), class = "Date"), a_9 = c(40.5065, 40.5065, 30.19713,
51.40862), a_01 = c(42.5065, 41.5112, 32.42847, NA), a_03 = c(44.1065,
NA, 35.46543, NA)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Try this function:
fn <- function(df){
for(s in c("_9" , "_01" , "_03")){
i <- which(`[[`(df,paste0("v",s)) == 99)
df[i, paste0("v",s)] <- NA
df[i, paste0("d",s)] <- NA
df[i, paste0("a",s)] <- NA
}
df
}
df <- fn(df)
Output
# A tibble: 4 × 10
id v_9 v_01 v_03 d_9 d_01 d_03 a_9 a_01 a_03
<dbl> <dbl> <dbl> <dbl> <date> <date> <date> <dbl> <dbl> <dbl>
1 101 12 NA 10 2015-03-23 NA 2003-06-19 40.5 NA 44.1
2 102 NA 12 NA NA NA NA NA 41.5 NA
3 103 16 16 NA 2015-11-30 2007-02-22 NA 30.2 32.4 NA
4 104 25 NA NA 1999-03-17 NA NA 51.4 NA NA

Combining componenets of a list in r

I have a list that contains data by year. I want to combine these components into a single dataframe, which is matched by row. Example list:
List [[1]]
State Year X Y
23 1971 etc etc
47 1971 etc etc
List[[2]]
State Year X Y
13 1972 etc etc
23 1973 etc etc
47 1973 etc etc
etc....
List[[45]]
State Year X Y
1 2017 etc etc
2 2017 etc etc
3 2017 etc etc
1 2017 etc etc
23 2017 etc etc
47 2017 etc etc
I want the dataframe to look like (I know I will have to go through and remove some extra columns:
State 1971_X 1971_Y 1972_X 1972_Y....2018_X 2019_Y
1 NA NA NA NA etc etc
2 NA NA etc etc etc etc
3 etc ect etc etc etc etc
...
50 NA NA etc etc etc etc
I have tried the command Outcomewanted=do.call("cbind", examplelist) but get the message
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 36, 40, 20, 42, 38, 26, 17, 31, 35, 23, 33, 13, 29, 28, 32, 34, 41, 37, 43, 39, 30, 14, 10, 4, 7"
It seems that the cbind.fill command could be an option but has been retired? Thanks for any help in advance.
You may use reshape after a do.call(rbind()) manoeuvre.
res <- reshape(do.call(rbind, lst), idvar="state", timevar="year", direction="wide")
res
# state x.1971 y.1971 x.1972 y.1972 x.1973 y.1973
# 1 23 1.3709584 0.3631284 NA NA -0.1061245 2.0184237
# 2 24 -0.5646982 0.6328626 NA NA 1.5115220 -0.0627141
# 3 13 NA NA 0.4042683 -0.09465904 NA NA
Data
lst <- list(structure(list(state = c(23, 24), year = c(1971, 1971),
x = c(1.37095844714667, -0.564698171396089), y = c(0.363128411337339,
0.63286260496104)), class = "data.frame", row.names = c(NA,
-2L)), structure(list(state = c(13, 23, 24), year = c(1972, 1973,
1973), x = c(0.404268323140999, -0.106124516091484, 1.51152199743894
), y = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421
)), class = "data.frame", row.names = c(NA, -3L)))

R: Sum values based on 2 condition from 2 data frame of different length

I'm trying to get the total & average of some variables based on date. I have fuel data & driving data of several vehicles. Fuel data consist of several dates whereas driving data consist of more dates (Logically you'll only refuel after x trips). My end result would be to get the sum/average driving data based on fuel dates.
Fuel data:
plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345")
date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18")
liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48)
df1 = data.frame(plate, date, liter)
Driving data:
plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345")
date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30")
mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39)
accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9)
df2 = data.frame(plate, date, mileage,accx)
Merging both data
df.all = left_join(df2, df1, by.x =c("plate", "date"))
I'd like to get the total mileage(sum) and average accx based on fuel dates. The end result should look like this:
Is there a way to use dplyr to mutate the desired results? FYI I only need the mutated parameters with their plates (result line 1,2,4,6,8,etc isn't needed)
Thanks in advance!
There might be a more elegant way, but this works:
library(dplyr)
df.all %>%
mutate(date = as.Date(date)) %>%
group_by(plate) %>%
arrange(plate, date) %>%
mutate(t.mileage = cumsum(mileage) * !is.na(liter),
t.accx = cumsum(accx) * !is.na(liter),
n = seq_len(n())) %>%
filter(!is.na(liter)) %>%
mutate(t.mileage.lag = lag(t.mileage),
t.accx.lag = lag(t.accx),
n.lag = lag(n),
t.mileage = ifelse(!is.na(t.mileage.lag), t.mileage - t.mileage.lag, t.mileage),
a.accx = ifelse(!is.na(t.accx.lag), (t.accx - t.accx.lag)/(n - n.lag), t.accx/n)) %>%
select(-t.mileage.lag, -t.accx.lag, -n.lag, -n, -t.accx)
# A tibble: 9 x 7
# Groups: plate [3]
# plate date mileage accx liter t.mileage a.accx
# <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB123 2017-09-08 45 4 33 159 2.3333333
# 2 AB123 2017-09-11 58 8 15 128 4.0000000
# 3 AB123 2017-09-13 22 2 28 63 6.5000000
# 4 AB123 2017-09-20 48 10 40 153 8.0000000
# 5 AC234 2017-09-06 70 7 43 164 7.3333333
# 6 AC234 2017-09-08 88 8 20 134 7.0000000
# 7 AC234 2017-09-15 51 1 25 162 0.6666667
# 8 AC234 2017-09-23 74 7 50 203 5.3333333
# 9 AD345 2017-09-18 65 15 48 351 10.6666667
Group by plate to only calculate the following per different plate. Then calculate the totale mileage and totale accx with cumsum, but only keep the values where we have non-missing liter. Also count how many drives we have with n. Then as we're only interested in the information where we fuel the car we filter by non-missing liter. Using lag subtract the previous total milage and accx from each value (unless there is no previous value, i.e. the lag is NA) and then calculate average accx.
Data
df1 <-
data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345"),
date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08",
"2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18"),
liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48),
stringsAsFactors = F)
df2 <-
data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345"),
date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30"),
mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39),
accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9),
stringsAsFactors = F)
df.all <- left_join(df2, df1, by = c("plate", "date"))
Although #kath provided a more convenient solution, here's a base R one (if only because I spend some time working on it):
# generate factor to split on
temp <- which(!is.na(df.all$liter))
vec <- temp - c(0, temp[-length(temp)])
df.all$split <- rep(seq(1, length(temp)+1), c(vec, nrow(df.all)-temp[length(temp)]))
# split df.all and calculate t.mileage and a.accx for each subsample
df.temp <- split(df.all, df.all$split)
t.mileage <- sapply(df.temp, function(x) sum(x[, "mileage"]))
a.accx <- sapply(df.temp, function(x) mean(x[, "accx"]))
# generate new variables and insert calculated values
df.all$t.mileage <- NA
df.all$t.mileage[temp] <- t.mileage[-length(t.mileage)]
df.all$a.accx <- NA
df.all$a.accx[temp] <- a.accx[-length(a.accx)]
# display df.all without splitting factor
df.all <- subset(df.all, select = -split)
> df.all
plate date mileage accx liter t.mileage a.accx
1 AB123 2017-09-01 50 0 NA NA NA
2 AB123 2017-01-05 64 3 NA NA NA
3 AB123 2017-09-08 45 4 33 159 2.3333333
4 AB123 2017-09-10 70 0 NA NA NA
5 AB123 2017-09-11 58 8 15 128 4.0000000
6 AB123 2017-09-12 41 11 NA NA NA
7 AB123 2017-09-13 22 2 28 63 6.5000000
8 AB123 2017-09-16 15 5 NA NA NA
9 AB123 2017-09-17 90 9 NA NA NA
10 AB123 2017-09-20 48 10 40 153 8.0000000
11 AB123 2017-09-22 52 2 NA NA NA
12 AB123 2017-09-25 48 22 NA NA NA
13 AC234 2017-09-02 29 9 NA NA NA
14 AC234 2017-09-03 65 6 NA NA NA
15 AC234 2017-09-06 70 7 43 264 9.2000000
16 AC234 2017-09-07 46 6 NA NA NA
17 AC234 2017-09-08 88 8 20 134 7.0000000
18 AC234 2017-09-09 71 1 NA NA NA
19 AC234 2017-09-13 40 0 NA NA NA
20 AC234 2017-09-15 51 1 25 162 0.6666667
21 AC234 2017-09-17 38 8 NA NA NA
22 AC234 2017-09-20 91 1 NA NA NA
23 AC234 2017-09-23 74 7 50 203 5.3333333
24 AC234 2017-09-25 61 6 NA NA NA
25 AD345 2017-09-01 41 4 NA NA NA
26 AD345 2017-09-04 33 3 NA NA NA
27 AD345 2017-09-09 59 9 NA NA NA
28 AD345 2017-09-12 81 11 NA NA NA
29 AD345 2017-09-15 72 22 NA NA NA
30 AD345 2017-09-18 65 15 48 412 10.0000000
31 AD345 2017-09-19 43 13 NA NA NA
32 AD345 2017-09-20 81 1 NA NA NA
33 AD345 2017-09-23 20 2 NA NA NA
34 AD345 2017-09-27 49 4 NA NA NA
35 AD345 2017-09-30 39 9 NA NA NA
By the way, there has to be a simpler way to generate the factor of step 1 above, does anyone know how?

R: Combining several character columns into one by replacing NA-rows

I have a data frame consisting of character variables which looks like this:
V1 V2 V3 V4 V5
1 ID Date pic1 pic2 pic3
2 1 15.06.16 11:50 abc <NA> def
3 1 16.06.16 11:19 <NA> hij <NA>
4 1 17.06.16 11:41 <NA> <NA> nop
5 2 28.05.16 11:40 tuv <NA> <NA>
6 2 29.05.16 11:39 <NA> zab <NA>
7 2 30.05.16 09:07 <NA> <NA> wxy
8 3 03.06.16 07:31 lmn <NA> <NA>
9 3 04.06.16 11:01 <NA> rst <NA>
10 3 05.06.16 13:57 <NA> <NA> opq
So on each day one of the pic-variables contains a value, the rest is NA.
Now I want to combine all pic-values into one variable by replacing the NA's. Sorry if this is a dublicate, I've already tried a lot of suggested solutions but nothing has worked so far.
Thanks!
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', and 'Date', we unlist the Subset of Data.table (.SD) and omit the NA elements (na.omit)
library(data.table)
setDT(df1)[, .(pic = na.omit(unlist(.SD))), by = .(ID, Date)]
# ID Date pic
# 1: 1 15.06.16 11:50 abc
# 2: 1 15.06.16 11:50 def
# 3: 1 16.06.16 11:19 hij
# 4: 1 17.06.16 11:41 nop
# 5: 2 28.05.16 11:40 tuv
# 6: 2 29.05.16 11:39 zab
# 7: 2 30.05.16 09:07 wxy
# 8: 3 03.06.16 07:31 lmn
# 9: 3 04.06.16 11:01 rst
#10: 3 05.06.16 13:57 opq
Or another option is pmax if there is only a single non-NA per row
setDT(df1)[, pic := do.call(pmax, c(.SD, na.rm = TRUE)),
.SDcols = pic1:pic3][, paste0("pic", 1:3) := NULL][]
Or using dplyr
library(dplyr)
df1 %>%
mutate(pic = pmax(pic1, pic2, pic3, na.rm=TRUE))%>%
select(-(pic1:pic3))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Date = c("15.06.16 11:50",
"16.06.16 11:19", "17.06.16 11:41", "28.05.16 11:40", "29.05.16 11:39",
"30.05.16 09:07", "03.06.16 07:31", "04.06.16 11:01", "05.06.16 13:57"
), pic1 = c("abc", NA, NA, "tuv", NA, NA, "lmn", NA, NA), pic2 = c(NA,
"hij", NA, NA, "zab", NA, NA, "rst", NA), pic3 = c("def", NA,
"nop", NA, NA, "wxy", NA, NA, "opq")), .Names = c("ID", "Date",
"pic1", "pic2", "pic3"), row.names = c(NA, -9L), class = "data.frame")
Assuming
on each day one of the pic-variables contains a value, the rest is NA
You can use coalesce from dplyr to get what you want:
library(dplyr)
result <- df1 %>% mutate(pic = coalesce(pic1, pic2, pic3)) %>%
select(-(pic1:pic3))
With the data supplied by akrun:
print(result)
## ID Date pic
##1 1 15.06.16 11:50 abc
##2 1 16.06.16 11:19 hij
##3 1 17.06.16 11:41 nop
##4 2 28.05.16 11:40 tuv
##5 2 29.05.16 11:39 zab
##6 2 30.05.16 09:07 wxy
##7 3 03.06.16 07:31 lmn
##8 3 04.06.16 11:01 rst
##9 3 05.06.16 13:57 opq

Align timeseries in R

I have a data frame like this.
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-29 NA 100.7
3: 2002-01-31 96.59 NA
4: 2002-02-28 96.67 100.7
5: 2002-03-29 NA 100.7
6: 2002-03-31 97.36 NA
7: 2002-04-29 NA 87.3
8: 2002-04-30 97.72 NA
9: 2002-05-29 NA 87.3
10:2002-05-31 97.60 NA
I have some values with different dates and I would like to align them to month end, so would like to use X1 as a "base" and align X2 values to month end as in X1. End product would be clean data frame without NAs and matching dates.
Expected output:
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-31 96.59 100.7
3: 2002-02-28 96.67 100.7
4: 2002-03-31 97.36 100.7
5: 2002-04-30 97.72 87.3
6: 2002-05-31 97.60 87.3
Data
df <- structure(list(date = structure(c(11687L, 11716L, 11718L, 11746L,
11775L, 11777L, 11806L, 11807L, 11836L, 11838L), class = "Date"),
X1 = c(96.32, NA, 96.59, 96.67, NA, 97.36, NA, 97.72, NA,
97.6), X2 = c(NA, 100.7, NA, 100.7, 100.7, NA, 87.3, NA,
87.3, NA)), .Names = c("date", "X1", "X2"), row.names = c(NA,
10L), class = "data.frame")
We could try the following using data.table.
library(data.table)
setDT(df)[,month := month(date)][,lapply(.SD, max, na.rm = TRUE), by = month]
# month date X1 X2
#1: 12 2001-12-31 96.32 -Inf
#2: 1 2002-01-31 96.59 100.7
#3: 2 2002-02-28 96.67 100.7
#4: 3 2002-03-31 97.36 100.7
#5: 4 2002-04-30 97.72 87.3
#6: 5 2002-05-31 97.60 87.3
There is a new variable month that has been created for grouping purposes (and to keep the original date column), you can always get rid of it if not needed afterwards.

Resources