R sum of columns divided by number of columns without NA - r

i cant seem to figure this out. What i want to do is make a new column in my dataframe with the sum of several columns divided by the number of columns which constribute to the sum.
so like this:
ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2
to this:
ID 2003 2004 2005 2006 SUM/col
1 1 4 1 NA 2
2 2 2 NA 3 2.33
3 1 3 NA NA 2
4 4 1 1 NA 3
5 3 1 4 2 2.5

We can use the rowMeans function and set na.rm = TRUE. dt[, -1] is a way to exclude the first column for the analysis.
dt$`SUM/col` <- rowMeans(dt[, -1], na.rm = TRUE)
dt
ID X2003 X2004 X2005 X2006 SUM/col
1 1 1 4 1 NA 2.000000
2 2 2 2 NA 3 2.333333
3 3 1 3 NA NA 2.000000
4 4 4 1 1 NA 2.000000
5 5 3 1 4 2 2.500000
DATA
dt <- read.table(text = "ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2",
header = TRUE)

If your data.frame is called df, then try:
df$"SUM/col" <- apply(df, 1, function(x) mean(x, na.rm=T))
The apply function calculates, for each row, the sum (excluding NAs) divided by the total number of non-NA elements. The resulting vector is then added as a column to df.

Related

Applying multiple if-else conditions on different columns in R

I have the following dataset:
Column1 Column2 Column3
3 3 1
2 3 2
1 NA 2
NA 4 1
2 NA NA
NA NA NA
I want to create a new column (Column 4) with the following conditions:
If columns 1 and 2 have the same value, the value in column 4 is the same as columns 1 and 2.
If columns 1 and 2 have different values, the value in column 4 should be 5.
If column 1 or column 2 have an NA, pick the value from column 3.
If 2 columns out of 3 have NA, then the value in column 4 should be that of the column that has a non-NA value.
If all the columns have NA, then column 4 should have NA too.
Column1 Column2 Column3 Column4
3 3 1 3 (Condition 1)
2 3 2 5 (Condition 2)
1 NA 2 2 (Condition 3)
NA 4 1 1 (Condition 3)
2 NA NA 2 (Condition 4)
NA NA NA NA (Condition 5)
Thanks in advance for answering this query.
How about this?
df <- read.table(text = "Column1 Column2 Column3
3 3 1
2 3 2
1 NA 2
NA 4 1
2 NA NA
NA NA NA ", header = T)
df %>%
mutate(col4 = case_when(
is.na(Column1) & is.na(Column2) & is.na(Column3) ~ NA_real_, # Con 5
is.na(Column1) | is.na(Column2) & !is.na(Column3) ~ as.numeric(Column3), #Con 3
!is.na(Column1) & is.na(Column2) |is.na(Column3) ~ as.numeric(Column1), #Con4
Column1 == Column2 ~ as.numeric(Column1), #Con 1
TRUE ~ 5 #Con 2
))
Column1 Column2 Column3 col4
<int> <int> <int> <dbl>
1 3 3 1 3
2 2 3 2 5
3 1 NA 2 2
4 NA 4 1 1
5 2 NA NA 2
6 NA NA NA NA
New code
dummy <- data.frame(
ck6ethrace = c(2,2,3,2,2,2,NA,NA,2,NA,3,NA,1,3,NA,2,NA,2,4,2),
cm1ethrace = c(2,2,3,1,2,2,2,1,2,2,3,2,1,3,1,2,3,2,4,2),
cf1ethrace = c(2,2,3,2,2,2,3,1,2,2,2,2,1,3,3,2,3,2,4,2)
)
dummy %>%
mutate(race = case_when(
is.na(ck6ethrace) & is.na(cm1ethrace) & is.na(cf1ethrace) ~ NA_real_, # Con 5
is.na(ck6ethrace) | is.na(cm1ethrace) & !is.na(cf1ethrace) ~ as.numeric(cf1ethrace), #Con 3
!is.na(ck6ethrace) & is.na(cm1ethrace) |is.na(cf1ethrace) ~ as.numeric(ck6ethrace), #Con4
ck6ethrace == cm1ethrace ~ as.numeric(ck6ethrace), #Con 1
TRUE ~ 5 #Con 2
))
result
ck6ethrace cm1ethrace cf1ethrace race
1 2 2 2 2
2 2 2 2 2
3 3 3 3 3
4 2 1 2 5
5 2 2 2 2
6 2 2 2 2
7 NA 2 3 3
8 NA 1 1 1
9 2 2 2 2
10 NA 2 2 2
11 3 3 2 3
12 NA 2 2 2
13 1 1 1 1
14 3 3 3 3
15 NA 1 3 3
16 2 2 2 2
17 NA 3 3 3
18 2 2 2 2
19 4 4 4 4
20 2 2 2 2

R First Non-NA Value From Cols

df <- data.frame(ID=c(1,2,3,4,5,6),
CO=c(-6,4,2,3,0,2),
CATFOX=c(1,NA,NA,3,0,NA),
DOGFOX=c(NA,NA,5,1,2,NA),
RABFOX=c(NA,3,NA,5,3,NA),
D=c(0,4,5,6,1,2),
WANT=c(1,3,5,3,0,NA))
I have a dataframe and i wish to make column WANT take the first value of 'CATFOX' 'DOGFOX' 'RABFOX' that is not NA. Is there a data.table solution? I tried this but it did not produce the desired outcome:
df$WANT=do.call(coalesce, data[grepl('FOX',names(data))])
You have coalesce in your example which is dplyr's construct. Try fcoalesce:
library(data.table)
setDT(df)[, WANT2 := fcoalesce(CATFOX, DOGFOX, RABFOX)]
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1: 1 -6 1 NA NA 0 1 1
2: 2 4 NA NA 3 4 3 3
3: 3 2 NA 5 NA 5 5 5
4: 4 3 3 1 5 6 3 3
5: 5 0 0 2 3 1 0 0
6: 6 2 NA NA NA 2 NA NA
We can use a vectorized option in base R
i1 <- endsWith(names(df), 'FOX')
df$WANT2 <- df[i1][cbind(seq_len(nrow(df)), max.col(!is.na(df[i1]), 'first'))]
df$WAN2
#[1] 1 3 5 3 0 NA
You could try this base R solution:
#Data
data=data.frame(ID=c(1,2,3,4,5),
CO=c(-6,4,2,3,0),
CATFOX=c(1,NA,NA,3,0),
DOGFOX=c(NA,NA,5,1,2),
RABFOX=c(NA,3,NA,5,3),
D=c(0,4,5,6,1),
WANT=c(1,3,5,3,0))
#Process
index <- which(names(data) %in% c('CATFOX','DOGFOX','RABFOX'))
data$WANT2 <- apply(data[,index],1,function(x) x[min(which(!is.na(x)))])
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1 1 -6 1 NA NA 0 1 1
2 2 4 NA NA 3 4 3 3
3 3 2 NA 5 NA 5 5 5
4 4 3 3 1 5 6 3 3
5 5 0 0 2 3 1 0 0

Shifting the last non-NA value by id

I have a data table that looks like this:
DT<-data.table(day=c(1,2,3,4,5,6,7,8),Consumption=c(5,9,10,2,NA,NA,NA,NA),id=c(1,2,3,1,1,2,2,1))
day Consumption id
1: 1 5 1
2: 2 9 2
3: 3 10 3
4: 4 2 1
5: 5 NA 1
6: 6 NA 2
7: 7 NA 2
8: 8 NA 1
I want to create two columns that show the last non-Na consumption value before the observation, and the day difference between those observations using the id groups. So far, I tried this:
DT[, j := day-shift(day, fill = NA,n=1), by = id]
DT[, yj := shift(Consumption, fill = NA,n=1), by = id]
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 1 NA
8: 8 NA 1 3 NA
However, I want that the lagged consumption values with n=1 come from the rows which have non-NA consumption values. For example, in the 7th row and column "yj", the yj value is NA because it comes from the 6th row which has NA consumption. I want it to come from the 2nd row. Therefore, I would like the end up with this data table:
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 5 9
8: 8 NA 1 4 2
Note: The reason for specifically using the parameter n of shift function is that I will also need the 2nd last non-Na consumption values in the next step.
Thank You
Here's a data.table solution with an assist from zoo:
library(data.table)
library(zoo)
DT[, `:=`(day_shift = shift(day),
yj = shift(Consumption)),
by = id]
#make the NA yj records NA for the days
DT[is.na(yj), day_shift := NA_integer_]
#fill the DT with the last non-NA value
DT[,
`:=`(day_shift = na.locf(day_shift, na.rm = F),
yj = zoo::na.locf(yj, na.rm = F)),
by = id]
# finally calculate j
DT[, j:= day - day_shift]
# you can clean up the ordering or remove columns later
DT
day Consumption id day_shift yj j
1: 1 5 1 NA NA NA
2: 2 9 2 NA NA NA
3: 3 10 3 NA NA NA
4: 4 2 1 1 5 3
5: 5 NA 1 4 2 1
6: 6 NA 2 2 9 4
7: 7 NA 2 2 9 5
8: 8 NA 1 4 2 4

Calculate diff price in a unbalanced set

I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources