I have a large set of data with various indices and such. I would like to change my data from something like this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 NA NA 3.3 NA NA
1 2 NA 2.5 NA NA 1.2 NA
1 3 NA NA 3.5 NA NA .7
to something like this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 2.5 3.5 3.3 1.2 .7
1 2 NA 2.5 NA NA 1.2 NA
1 3 NA NA 3.5 NA NA .7
or this:
id time var1_t1 var1_t2 var1_t3 var2_t1 var2_t2 var2_t3
1 1 1.5 2.5 3.5 3.3 1.2 .7
1 2 1.5 2.5 3.5 3.3 1.2 .7
1 3 1.5 2.5 3.5 3.3 1.2 .7
Except that there are rather more than six columns I need to do this for, and "id' has values other than 1.
I can do this for a single column as follows:
for (i in 1:max(df$id) ){
df[df$id == i & df$time == 1,]$var1_t1 <- max(df[df$id == i,]$var1_t1,
na.rm = TRUE)
}
But that uses a for loop, so it is a terrible idea. And I would have to repeat that line for each column. Is there a way I can do this more elegantly?
If you want to replace all NA's with the column-wise max value by group of id, you could define a little custom function:
f <- function(x) {
x[is.na(x)] <- max(x, na.rm = TRUE)
x
}
And then use your favorite data manipulation functions/package, for example dplyr:
library(dplyr)
df %>% group_by(id) %>% mutate_each(funs(f))
Or data.table:
library(data.table)
setDT(df)[, lapply(.SD, f), by = id]
Related
I have a partially filled in table, there are NAs at the top and bottom of the table (column X in the table below). I want to fill in the table using a rate (0.3) to get the results in the Goal column. This is similar to the fill up/down function in Excel used to copy a formula and fill cells.
df <- data.frame(X = matrix(nrow = 10, ncol = 1, NA))
df [3:5,1] <- 2:4
X Goal
1 NA 1.4
2 NA 1.7
3 2 2
4 3 3
5 4 4
6 NA 4.3
7 NA 4.6
8 NA 4.9
9 NA 5.2
10 NA 5.9
Essentially what I want the code to do is this:
1.4 (X2 answer - 0.3)
1.7 (2 - 0.3)
2
3
4
4.3 (4 + 0.3)
4.6 (X6 answer + 0.3)
4.9 (X7 answer + 0.3)
5.2 (X8 answer + 0.3)
5.5 (X9 answer + 0.3)
I know this can probably be done using loops, but I find them intimidating given my skill level, so I'm looking for a solution that avoids them (if that's even possible).
Avoiding loops with nafill() and fcoalesce() from data.table.
library(data.table)
loc = range(which(!is.na(df$X)))
df$Goal =
fcoalesce(nafill(df$X, "locf"), nafill(df$X, "nocb")) +
c( -((loc[1] - 1):1)*0.3, rep(0, diff(loc)+1), (1:(nrow(df) - loc[2]))*0.3 )
Still, it is (arguably) much easier to keep track of what is happening in each case with a loop:
# Preallocate
df$Goal = 0
for (i in 1:nrow(df)) {
if (i < loc[1]) df$Goal[i] = df$X[loc[1]] - (loc[1] - i) * 0.3
else if (i > loc[2]) df$Goal[i] = df$X[loc[2]] + (i - loc[2]) * 0.3
else df$Goal[i] = df$X[i ]
}
# X Goal
# 1 NA 1.4
# 2 NA 1.7
# 3 2 2.0
# 4 3 3.0
# 5 4 4.0
# 6 NA 4.3
# 7 NA 4.6
# 8 NA 4.9
# 9 NA 5.2
# 10 NA 5.5
I have a dataframe here:
df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))
"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.
Now I am trying to replace the "-999" with the average of the nearest values. For example:
The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6
I tried to usenext statement to find the next iteration, and use if statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?
Many thanks.
One approach utilizing dplyr, purrr and tidyr could be:
df %>%
mutate(New_Value = if_else(Value == -999,
map_dbl(.x = seq_along(Value),
~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
Value))
Time Value New_Value
1 1 1.7 1.7
2 2 NA NA
3 3 -999.0 1.6
4 4 -999.0 1.6
5 5 1.5 1.5
6 6 1.7 1.7
7 7 NA NA
8 8 4.0 4.0
9 9 -999.0 6.0
10 10 8.0 8.0
Using a few while loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.
# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)
# more initializing
i <- 1
lag_n <- 1
lead_n <- 1
while(i <= len_replace){
# find appropriate lagged value
# can't use NA or lag value == -999
while(is.na(lag_value[i]) | lag_value[i] == -999){
lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
lag_n <- lag_n + 1
}
# find appropriate lead value
# can't use NA or -999 as lead value
while(is.na(lead_value[i]) | lead_value[i] == -999){
lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
lead_n <- lead_n + 1
}
# reset iterators
i <- i + 1
lag_n <- 1
lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2
# Time Value
# 1 1 1.7
# 2 2 NA
# 3 3 1.6
# 4 4 1.6
# 5 5 1.5
# 6 6 1.6
# 7 7 NA
# 8 8 4.0
# 9 9 6.0
# 10 10 8.0
I created two new helper colums - before and after.
Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.
df <- data.frame(Time = 1:10,
Value = c(1.7, NA, -999, -999, 1.5,
1.6, NA,
4, -999, 8))
df <- df %>%
mutate(before = recode(Value, `-999` = NA_real_),
after = recode(Value, `-999` = NA_real_)) %>%
fill(before, .direction = "down") %>%
fill(after, .direction = "up") %>%
mutate(Value = case_when(Value == -999 ~ (before + after)/2,
TRUE ~ Value)) %>%
select(Time, Value)
The Output
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Here is a base R option using findInterval
x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))
such that
> dfout
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Just sticking with base R data.frames we can make a function and use sapply over indices of interest.
outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
filled_outliers <- sapply(outliers, function(x) {
before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
after_ind = min(fillers[fillers > x])
0.5*(df$Value[before_ind] + df$Value[after_ind])
})
df[outliers, ] <- filled_outliers
df
Gives:
Time Value
1 1.0 1.7
2 2.0 NA
3 1.6 1.6
4 1.6 1.6
5 5.0 1.5
6 6.0 1.6
7 7.0 NA
8 8.0 4.0
9 6.0 6.0
10 10.0 8.0
I have a dataframe df with columns ID, X and Y
ID = c(1,1,2,2)
X = c(1,0.4,0.8,0.1)
Y = c(0.5,0.5,0.7,0.7)
df <- data.frame(ID,X,Y)
ID X Y
1 1.0 0.5
1 0.4 0.5
2 0.8 0.7
2 0.1 0.7
I would like to obtain two new columns:
Xg equal to X when X is greater than Y and NA otherwise
Xl equal to X when X is less than Y and NA otherwise. That is,
ID X Y Xg Xl
1 1.0 0.5 1.0 NA
1 0.4 0.5 NA 0.4
2 0.8 0.7 0.8 NA
2 0.1 0.7 NA 0.1
Below should work, even if there are NA's in X or Y:
library(dplyr)
df %>%
mutate(Xg = ifelse(X > Y, X, NA),
Xl = ifelse(X < Y, Y, NA))
If you want to use if_else from dplyr, you have to convert NA to numeric. if_else is stricter than ifelse in that it checks whether the TRUE and FALSE values are the same type:
df %>%
mutate(Xg = if_else(X > Y, X, as.numeric(NA)),
Xl = if_else(X < Y, Y, as.numeric(NA)))
Result:
ID X Y Xg Xl
1 1 1.0 0.5 1.0 NA
2 1 0.4 0.5 NA 0.5
3 2 0.8 0.7 0.8 NA
4 2 0.1 0.7 NA 0.7
5 3 NA 1.0 NA NA
6 3 3.0 NA NA NA
Data:
ID = c(1,1,2,2,3,3)
X = c(1,0.4,0.8,0.1,NA,3)
Y = c(0.5,0.5,0.7,0.7,1,NA)
df <- data.frame(ID,X,Y)
What about some plain old R indexing and subsetting?
ID <- c(1,1,2,2, 3, 3)
X <- c(1,0.4,0.8,0.1, NA, 2)
Y <- c(0.5,0.5,0.7,0.7, 2, NA)
Xg <- Xl <- rep(NA_real_, length(ID))
Xg[which(X > Y)] <- X[which(X > Y)]
Xl[which(X < Y)] <- X[which(X < Y)]
data.frame(ID, X, Y, Xg, Xl)
Note: I assume that if X or Y is missing, Xg and Xl should be NA.
For the sake of completeness and as the question originally used data.table() before it was edited (and because I like the concise code) here is "one-liner" using data.table's update in place:
library(data.table)
setDT(df)[X > Y, Xg := X][X < Y, Xl := X][]
ID X Y Xg Xl
1: 1 1.0 0.5 1.0 NA
2: 1 0.4 0.5 NA 0.4
3: 2 0.8 0.7 0.8 NA
4: 2 0.1 0.7 NA 0.1
5: 3 NA 1.0 NA NA
6: 3 3.0 NA NA NA
(Using the data of useR)
NA's are handled automatically as only matching rows are updated.
I have a data frame looking like this:
as.data.frame(matrix(c(1,2,3,NA,4,5,NA,NA,9), nrow = 3, ncol = 3))
V1 V2 V3
1 1 NA NA
2 2 4 NA
3 3 5 9
I would like to calculate a cumulative mean per column, which ignores NAs, so something like this:
V1 V2 V3
1 1 NA NA
2 3 4 NA
3 6 9 9
I tried this:
B[!is.na(A)] <- as.data.frame(apply(B[!is.na(A)], 2, cummean))
But received this error message:
dim(X) must have a positive length
Thanks for your help!
Cheers
This should work :
A <- as.data.frame(matrix(c(1,2,3,NA,4,5,NA,NA,9), nrow = 3, ncol = 3))
B <- as.data.frame(apply(A,2,function(col){
col[!is.na(col)] <- dplyr::cummean(col[!is.na(col)])
return(col)
}))
> B
V1 V2 V3
1 1.0 NA NA
2 1.5 4.0 NA
3 2.0 4.5 9
We can use data.table
library(data.table)
library(dplyr)
setDT(d1)
for(j in seq_along(d1)){
set(d1, i = which(!is.na(d1[[j]])), j=j, value = cummean(d1[[j]][!is.na(d1[[j]])]))
}
d1
# V1 V2 V3
#1: 1.0 NA NA
#2: 1.5 4.0 NA
#3: 2.0 4.5 9
I'm having a brain-freeze.
This is what I have:
C <- c(C1, C2, C3) # A constant for every row in the data frame
r <- c(r1, r2, r3, r4) # A ratio for every column in the data frame
My data frame looks like this:
1 2 3 4
a 0.7 0.4 NA NA
b NA NA 0.3 NA
c NA 0.6 NA 0.4
I need to fill in the NA's with a multiplication of C and r so that it looks like this:
1 2 3 4
a 0.7 0.4 C1*r3 C1*r4
b C2*r1 C2*r2 0.3 C2*r4
c C3*r1 0.6 C3*r3 0.4
Notice that the multiplication is only done for the NA's and not for numbers that already exist. I know is.na is used to pick out the NA's, and it's probably just linear algebra, but my brain has quit for the day. Any help would be great.
Thanks.
If mm is your matrix , you can fill missing values like this:
mm[is.na(mm)] <- outer(C,r)[is.na(mm)]
example with data :
mm <- read.table(text=' 1 2 3 4
a 0.7 0.4 NA NA
b NA NA 0.3 NA
c NA 0.6 NA 0.4')
C <- c(1, 1, 1) # A constant for every row in the data frame
r <- c(2, 2, 2, 2)
mm[is.na(mm)] <- outer(C,r)[is.na(mm)]
# X1 X2 X3 X4
# a 0.7 0.4 2.0 2.0
# b 2.0 2.0 0.3 2.0
# c 2.0 0.6 2.0 0.4