R First Non-NA Value From Cols - r

df <- data.frame(ID=c(1,2,3,4,5,6),
CO=c(-6,4,2,3,0,2),
CATFOX=c(1,NA,NA,3,0,NA),
DOGFOX=c(NA,NA,5,1,2,NA),
RABFOX=c(NA,3,NA,5,3,NA),
D=c(0,4,5,6,1,2),
WANT=c(1,3,5,3,0,NA))
I have a dataframe and i wish to make column WANT take the first value of 'CATFOX' 'DOGFOX' 'RABFOX' that is not NA. Is there a data.table solution? I tried this but it did not produce the desired outcome:
df$WANT=do.call(coalesce, data[grepl('FOX',names(data))])

You have coalesce in your example which is dplyr's construct. Try fcoalesce:
library(data.table)
setDT(df)[, WANT2 := fcoalesce(CATFOX, DOGFOX, RABFOX)]
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1: 1 -6 1 NA NA 0 1 1
2: 2 4 NA NA 3 4 3 3
3: 3 2 NA 5 NA 5 5 5
4: 4 3 3 1 5 6 3 3
5: 5 0 0 2 3 1 0 0
6: 6 2 NA NA NA 2 NA NA

We can use a vectorized option in base R
i1 <- endsWith(names(df), 'FOX')
df$WANT2 <- df[i1][cbind(seq_len(nrow(df)), max.col(!is.na(df[i1]), 'first'))]
df$WAN2
#[1] 1 3 5 3 0 NA

You could try this base R solution:
#Data
data=data.frame(ID=c(1,2,3,4,5),
CO=c(-6,4,2,3,0),
CATFOX=c(1,NA,NA,3,0),
DOGFOX=c(NA,NA,5,1,2),
RABFOX=c(NA,3,NA,5,3),
D=c(0,4,5,6,1),
WANT=c(1,3,5,3,0))
#Process
index <- which(names(data) %in% c('CATFOX','DOGFOX','RABFOX'))
data$WANT2 <- apply(data[,index],1,function(x) x[min(which(!is.na(x)))])
Output:
ID CO CATFOX DOGFOX RABFOX D WANT WANT2
1 1 -6 1 NA NA 0 1 1
2 2 4 NA NA 3 4 3 3
3 3 2 NA 5 NA 5 5 5
4 4 3 3 1 5 6 3 3
5 5 0 0 2 3 1 0 0

Related

How to make the next number in a column a sequence in r

sorry to bother everyone. I have been stuck with coding
Student Number
1 NA
1 NA
1 1
1 1
2 NA
2 1
2 1
2 1
3 NA
3 NA
3 1
3 1
I tried using dplyr to cluster by students try to find a way so that every time it reads that 1, it adds it to the following column so it would read as
Student Number
1 NA
1 NA
1 1
1 2
2 NA
2 1
2 2
2 3
3 NA
3 NA
3 1
3 2
etc
Thank you! It'd help with attendance.
data.table solution;
library(data.table)
setDT(df)
df[!is.na(Number),Number:=cumsum(Number),by=Student]
df
Student Number
<int> <int>
1 1 NA
2 1 NA
3 1 1
4 1 2
5 2 NA
6 2 1
7 2 2
8 2 3
9 3 NA
10 3 NA
11 3 1
12 3 2
Try using cumsum, note that cumsum itself cannot ignore NA
library(dplyr)
df %>%
group_by(Student) %>%
mutate(n = cumsum(ifelse(is.na(Number), 0, Number)) + 0 * Number)
Student Number n
<int> <int> <dbl>
1 1 NA NA
2 1 NA NA
3 1 1 1
4 1 1 2
5 2 NA NA
6 2 1 1
7 2 1 2
8 2 1 3
9 3 NA NA
10 3 NA NA
11 3 1 1
12 3 1 2

R Insert Value within Dataframe

I have a very complex problem, i hope someone can help -> i want to copy a row value (i.e. Player 1 or Player 2) into two other rows (for Player 3 and 4) if and only if these players are in the same Treatment, Group and Period AND this player was indeed picked (see column Player.Picked)
I know that with tidyverse I can group_by my columns of interest: Treatment, Group, and Period.
However, I am unsure how to proceed with the condition that Player Picked is fulfilled and then how to extract this value appropriately for the players 3 and 4 in the same treatment, group, period.
The column "extracted.Player 1/2 Value" should be the output. (I have manually provided the first four correct solutions).
Any ideas? Help would be very much appreciated. Thanks a lot in advance!
df
T Player Group Player.Picked Period Player1/2Value extracted.Player1/2Value
1 1 6 1 1 10
1 2 6 1 1 9
1 3 5 2 1 NA -> 4
1 4 6 1 1 NA -> 10
1 5 3 1 1 NA
1 1 5 2 1 8
1 2 1 0 1 7
1 3 6 1 1 NA -> 10
1 4 2 2 1 NA
1 5 2 2 1 NA
1 1 1 0 1 7
1 2 2 2 1 11
1 3 3 1 1 NA
1 4 4 1 1 NA
1 5 4 1 1 NA
1 1 2 2 1 21
1 2 4 1 1 17
1 3 1 0 1 NA
1 4 5 2 1 NA -> 4
1 5 6 1 1 NA
1 1 3 1 1 12
1 2 3 1 1 15
1 3 4 1 1 NA
1 4 1 0 1 NA
1 5 1 0 1 NA
1 1 4 1 1 11
1 2 5 2 1 4
1 3 2 2 1 NA
1 4 3 1 1 NA
1 5 5 2 1 NA
I'm not sure if I understood the required logic; here I'm assuming that Player 5 always picks Player 1 or 2 per Group.
So, here is my go at this using library(data.table):
library(data.table)
DT <- data.table::data.table(
check.names = FALSE,
T = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
Player = c(1L,2L,3L,
4L,5L,1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,
1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,1L,
2L,3L,4L,5L),
Group = c(6L,6L,5L,
6L,3L,5L,1L,6L,2L,2L,1L,2L,3L,4L,4L,
2L,4L,1L,5L,6L,3L,3L,4L,1L,1L,4L,
5L,2L,3L,5L),
Player.Picked = c(1L,1L,2L,
1L,1L,2L,0L,1L,2L,2L,0L,2L,1L,1L,1L,
2L,1L,0L,2L,1L,1L,1L,1L,1L,0L,0L,
1L,2L,2L,2L),
Period = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
`Player1/2Value` = c(10L,9L,NA,
NA,NA,8L,7L,NA,NA,NA,7L,11L,NA,NA,
NA,21L,17L,NA,NA,NA,12L,15L,NA,NA,NA,
11L,4L,NA,NA,NA),
`extracted.Player1/2Value` = c(NA,NA,4L,
10L,NA,NA,NA,10L,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,4L,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA)
)
setorderv(DT, cols = c("T", "Group", "Period", "Player"))
Player5PickedDT <- DT[Player == 5, Player.Picked, by = c("T", "Group", "Period")]
setnames(Player5PickedDT, old = "Player.Picked", new = "Player5Picked")
DT <- DT[Player5PickedDT, on = c("T", "Group", "Period")]
extractedDT <- DT[Player == Player5Picked & Player5Picked > 0, `Player1/2Value`, by = c("T", "Group", "Period")]
setnames(extractedDT, old = "Player1/2Value", new = "extractedValue")
DT[, "Player5Picked" := NULL]
DT <- extractedDT[DT, on = c("T", "Group", "Period")]
DT[, extractedValue := fifelse(Player %in% c(3, 4), yes = extractedValue, no = NA_real_)]
setcolorder(DT, c("T", "Group", "Period", "Player", "Player.Picked", "Player1/2Value", "extracted.Player1/2Value", "extractedValue"))
DT
The resulting table differs from your expected result (extracted.Player1/2Value vs extractedValue, but in my eyes it is following the explained logic):
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue
1: 1 1 1 1 0 7 NA NA
2: 1 1 1 2 0 7 NA NA
3: 1 1 1 3 0 NA NA NA
4: 1 1 1 4 1 NA NA NA
5: 1 1 1 5 0 NA NA NA
6: 1 2 1 1 2 21 NA NA
7: 1 2 1 2 2 11 NA NA
8: 1 2 1 3 2 NA NA 11
9: 1 2 1 4 2 NA NA 11
10: 1 2 1 5 2 NA NA NA
11: 1 3 1 1 1 12 NA NA
12: 1 3 1 2 1 15 NA NA
13: 1 3 1 3 1 NA NA 12
14: 1 3 1 4 2 NA NA 12
15: 1 3 1 5 1 NA NA NA
16: 1 4 1 1 0 11 NA NA
17: 1 4 1 2 1 17 NA NA
18: 1 4 1 3 1 NA NA 11
19: 1 4 1 4 1 NA NA 11
20: 1 4 1 5 1 NA NA NA
21: 1 5 1 1 2 8 NA NA
22: 1 5 1 2 1 4 NA NA
23: 1 5 1 3 2 NA 4 4
24: 1 5 1 4 2 NA 4 4
25: 1 5 1 5 2 NA NA NA
26: 1 6 1 1 1 10 NA NA
27: 1 6 1 2 1 9 NA NA
28: 1 6 1 3 1 NA 10 10
29: 1 6 1 4 1 NA 10 10
30: 1 6 1 5 1 NA NA NA
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue

Calculate diff price in a unbalanced set

I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3

R sum of columns divided by number of columns without NA

i cant seem to figure this out. What i want to do is make a new column in my dataframe with the sum of several columns divided by the number of columns which constribute to the sum.
so like this:
ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2
to this:
ID 2003 2004 2005 2006 SUM/col
1 1 4 1 NA 2
2 2 2 NA 3 2.33
3 1 3 NA NA 2
4 4 1 1 NA 3
5 3 1 4 2 2.5
We can use the rowMeans function and set na.rm = TRUE. dt[, -1] is a way to exclude the first column for the analysis.
dt$`SUM/col` <- rowMeans(dt[, -1], na.rm = TRUE)
dt
ID X2003 X2004 X2005 X2006 SUM/col
1 1 1 4 1 NA 2.000000
2 2 2 2 NA 3 2.333333
3 3 1 3 NA NA 2.000000
4 4 4 1 1 NA 2.000000
5 5 3 1 4 2 2.500000
DATA
dt <- read.table(text = "ID 2003 2004 2005 2006
1 1 4 1 NA
2 2 2 NA 3
3 1 3 NA NA
4 4 1 1 NA
5 3 1 4 2",
header = TRUE)
If your data.frame is called df, then try:
df$"SUM/col" <- apply(df, 1, function(x) mean(x, na.rm=T))
The apply function calculates, for each row, the sum (excluding NAs) divided by the total number of non-NA elements. The resulting vector is then added as a column to df.

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources