conditional statements in data.table - r

I am trying to use conditional statements to obtain some variables in a data table. Here's some simplified data, the code and the results:
> dt
id trial bet outcome
1: 11 1 1 6
2: 11 2 456 2
3: 11 3 3456 3
4: 11 4 456 6
5: 12 1 34 6
6: 12 2 3456 2
7: 12 3 12 4
8: 12 4 123 2
dt1=dt[,list(
nbet=nchar(bet),
if (nchar(bet)>2.5) riskybet=1 else riskybet=0,
if (grepl(outcome,bet)==TRUE) win=1 else win=0),
by='id,trial']
> dt1
id trial nbet V2 V3
1: 11 1 1 0 0
2: 11 2 3 1 0
3: 11 3 4 1 1
4: 11 4 3 1 1
5: 12 1 2 0 0
6: 12 2 4 1 0
7: 12 3 2 0 0
8: 12 4 3 1 1
The conditional statements are working as they should but without the assigned variable names 'riskybet' and 'win', i.e. they appear as V2 and V3. What am I doing wrong?

You are assigning values to variables "inside" the if/else-statement. Try this:
dt1=dt[,list(
nbet=nchar(bet),
riskybet = if (nchar(bet)>2.5) 1 else 0,
win = if (grepl(outcome, bet)) 1 else 0),
by='id,trial']
id trial nbet riskybet win
1: 11 1 1 0 0
2: 11 2 3 1 0
3: 11 3 4 1 1
4: 11 4 3 1 1
5: 12 1 2 0 0
6: 12 2 4 1 0
7: 12 3 2 0 0
8: 12 4 3 1 1
Alternatively you could also use ifelse instead of the traditional if-else.

Related

Sorting a specific range of column names in dplyr

I have a data frame and wish to sort specific columns alphabetically in dplyr. I know I can use the code below to sort all columns, but I would only like to sort columns C, B and A alphabetically. I tried using the across function as I would effectively like to select columns C:A, but this did not work.
df <- data.frame(1:16)
df$Testinfo1 <- 1
df$Band <- 1
df$Alpha <- 1
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$A <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16)
df
df %>%
select(sort(names(.)))
A Alpha B Band C Testinfo1 X1.16
1: 1 1 10 1 10 1 1
2: 1 1 0 1 12 1 2
3: 1 1 0 1 14 1 3
4: 1 1 0 1 16 1 4
5: 1 1 12 1 10 1 5
6: 1 1 12 1 12 1 6
7: 1 1 12 1 14 1 7
8: 1 1 12 1 16 1 8
9: 1 1 0 1 10 1 9
10: 1 1 14 1 12 1 10
11: 1 1 NA 1 14 1 11
12: 14 1 14 1 16 1 12
13: NA 1 16 1 10 1 13
14: NA 1 16 1 12 1 14
15: NA 1 16 1 14 1 15
16: 16 1 16 1 16 1 16
My desired output is below:
X1.16 Testinfo1 Band Alpha A B C
1: 1 1 1 1 1 10 10
2: 2 1 1 1 1 0 12
3: 3 1 1 1 1 0 14
4: 4 1 1 1 1 0 16
5: 5 1 1 1 1 12 10
6: 6 1 1 1 1 12 12
7: 7 1 1 1 1 12 14
8: 8 1 1 1 1 12 16
9: 9 1 1 1 1 0 10
10: 10 1 1 1 1 14 12
11: 11 1 1 1 1 NA 14
12: 12 1 1 1 14 14 16
13: 13 1 1 1 NA 16 10
14: 14 1 1 1 NA 16 12
15: 15 1 1 1 NA 16 14
16: 16 1 1 1 16 16 16
You can use relocate() (from dplyr 1.0.0 onwards):
library(dplyr)
vars <- c("C", "B", "A")
df %>%
relocate(all_of(sort(vars)), .after = last_col())
If you are passing a character vector of names you should wrap it in all_of() (which will error if any variables are missing) or any_of() which won't.
You can do
sortcols <- c("A","B","C")
library(dplyr)
df %>%
select(-sortcols, sort(sortcols))
The -sortcols part selects everything but the columns you want to sort and then you put the columns you want after those.
A base R option for a case which may or may not exist. If the columns that you want to sort are not at the end of the dataframe.
We add a new column D which you don't want to change the position of.
df$D <- 1:16
cols_to_sort <- c('A', 'B', 'C')
inds <- match(cols_to_sort, names(df))
cols <- seq_along(df)
cols[cols %in% inds] <- inds
df[cols]
# X1.16 Testinfo1 Band Alpha A B C D
#1 1 1 1 1 1 10 10 1
#2 2 1 1 1 1 0 12 2
#3 3 1 1 1 1 0 14 3
#4 4 1 1 1 1 0 16 4
#5 5 1 1 1 1 12 10 5
#6 6 1 1 1 1 12 12 6
#7 7 1 1 1 1 12 14 7
#8 8 1 1 1 1 12 16 8
#9 9 1 1 1 1 0 10 9
#10 10 1 1 1 1 14 12 10
#11 11 1 1 1 1 NA 14 11
#12 12 1 1 1 14 14 16 12
#13 13 1 1 1 NA 16 10 13
#14 14 1 1 1 NA 16 12 14
#15 15 1 1 1 NA 16 14 15
#16 16 1 1 1 16 16 16 16

R: How do I replace NAs in a dataframe column with values from conditions leveraging other multiple columns?

Using R, I am trying to fill NAs in a column with values leveraging conditions of other columns. The data frame has 4columns. The 4 columns are described below.
"Water_Level": Has some values which also include NAs. This is the column I want to replace the NAs. Take this column as the amount of water in liters in a tank.
"Tank": Unique identifier for tanks. In this sample, I have tank 1 and tank 2.
"Flag": This has a series of 0's and 1's. When value is 0 the tap is opened and the Water_level value decreases by a constant of 0.05. When flag is 1, the tank is being pumped, so the water level increases in the respective tank gradually to the peak value at the end of the series of 1's. The rate of increase is varies and is determined by the length of 1's in the Flag column or the Counter number corresponding to the end of the series of 1's.
"Counter": A column counting the number of 0's and 1's in the flag column in order.
I need to fill the NAs in the "Water_level" column with the conditions of the other columns.
Honestly, I haven't been able to try anything despite clearly understanding the outcome required.
df <- data.frame(
Water_level = c(67.92, rep(NA,9),67.96,10.5,rep(NA,8),20),
Flag = c(rep(0,5),rep(1,6),rep(0,5),rep(1,5)),
Tank= c(rep(1, 11), rep(2, 10)),
Counter = c(seq(1:5),seq(1:6), seq(1:5),seq(1:5))
)
df
Water_level Flag Tank Counter
1 67.92 0 1 1
2 NA 0 1 2
3 NA 0 1 3
4 NA 0 1 4
5 NA 0 1 5
6 NA 1 1 1
7 NA 1 1 2
8 NA 1 1 3
9 NA 1 1 4
10 NA 1 1 5
11 67.96 1 1 6
12 10.50 0 2 1
13 NA 0 2 2
14 NA 0 2 3
15 NA 0 2 4
16 NA 0 2 5
17 NA 1 2 1
18 NA 1 2 2
19 NA 1 2 3
20 NA 1 2 4
21 20.00 1 2 5
The result expected is to fill the NAs in the Water_level as described by the conditions in my introduction.
For example, line 2 in the "Water_level" should be 67.92 - 0.05 = 67.87. This is because the tap is open i.e Flag is at 0. line 3 will be 67.87 - 0.05 = 67.82 and so on.
The tricky part is in line 6 were the Flag changes to 1 i.e the tank is being pumped. We can see the series of 1's for Tank 1 ends at line 11. The peak value recorded for water_level is 67.96. So the rate of increase from line 6 to 10 will now be as seen in the formular below.
(67.96- value at line5 following the decrease pattern) / number of Counter steps i.e 6 for this case
This calculation continues for Tank 2.
Thanks is anticipation for a solution.
Update.
#manotheshark. This is a good beginning. But it doesnt generalise well. When I include row 12 to 16, it produces a wrong output. i.e. it doesnt decline by 0.05 from line 11.
df <- data.frame(
Water_level = c(67.92, rep(NA,9),67.96, rep(NA,5),10.5,rep(NA,8),20),
Flag = c(rep(0,5),rep(1,6),rep(0,5),rep(0,5),rep(1,5)),
Tank= c(rep(1, 16), rep(2, 10)),
Counter = c(seq(1:5),seq(1:6),seq(1:5), seq(1:5),seq(1:5))
)
df
Water_level Flag Tank Counter
1 67.92 0 1 1
2 NA 0 1 2
3 NA 0 1 3
4 NA 0 1 4
5 NA 0 1 5
6 NA 1 1 1
7 NA 1 1 2
8 NA 1 1 3
9 NA 1 1 4
10 NA 1 1 5
11 67.96 1 1 6
12 NA 0 1 1
13 NA 0 1 2
14 NA 0 1 3
15 NA 0 1 4
16 NA 0 1 5
17 10.50 0 2 1
18 NA 0 2 2
19 NA 0 2 3
20 NA 0 2 4
21 NA 0 2 5
22 NA 1 2 1
23 NA 1 2 2
24 NA 1 2 3
25 NA 1 2 4
26 20.00 1 2 5
The output running your solution is presented below. Line 12 should be 67.96 - 0.05 = 67.91.
Water_level Flag Tank Counter
1 67.92000 0 1 1
2 67.87000 0 1 2
3 67.82000 0 1 3
4 67.77000 0 1 4
5 67.72000 0 1 5
6 67.30167 1 1 1
7 67.43333 1 1 2
8 67.56500 1 1 3
9 67.69667 1 1 4
10 67.82833 1 1 5
11 67.96000 1 1 6
12 67.37000 0 1 1
13 67.32000 0 1 2
14 67.27000 0 1 3
15 67.22000 0 1 4
16 67.17000 0 1 5
17 10.50000 0 2 1
18 10.45000 0 2 2
19 10.40000 0 2 3
20 10.35000 0 2 4
21 10.30000 0 2 5
22 12.24000 1 2 1
23 14.18000 1 2 2
24 16.12000 1 2 3
25 18.06000 1 2 4
26 20.00000 1 2 5
Not tested if this works for multiple tank cycles. Converted data.frame to data.table
library(data.table)
setDT(df)
# calculate tank levels when dropping with Flag of 0
df[Flag == 0, Water_level := first(Water_level) - 0.05 * (.I - first(.I)), by = .(Flag, Tank)]
# use sequence to determine tank levels when filling from previous minimum to new max
df[Flag == 1, Water_level := seq(df[Flag == 0, last(Water_level), by = .(Flag, Tank)][,V1][.GRP], last(Water_level), length.out = .N + 1)[-1], by = .(Flag, Tank)]
> df
Water_level Flag Tank Counter
1: 67.92 0 1 1
2: 67.87 0 1 2
3: 67.82 0 1 3
4: 67.77 0 1 4
5: 67.72 0 1 5
6: 67.76 1 1 1
7: 67.80 1 1 2
8: 67.84 1 1 3
9: 67.88 1 1 4
10: 67.92 1 1 5
11: 67.96 1 1 6
12: 10.50 0 2 1
13: 10.45 0 2 2
14: 10.40 0 2 3
15: 10.35 0 2 4
16: 10.30 0 2 5
17: 12.24 1 2 1
18: 14.18 1 2 2
19: 16.12 1 2 3
20: 18.06 1 2 4
21: 20.00 1 2 5
Water_level Flag Tank Counter

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Count non-zero values of column in R [duplicate]

This question already has an answer here:
Add a new column of the sum by group [duplicate]
(1 answer)
Closed 6 years ago.
Suppose i have data frame like this one
DF
Id X Y Z
1 1 5 0
1 2 0 0
1 3 0 5
1 4 9 0
1 5 2 3
1 6 5 0
2 1 5 0
2 2 4 0
2 3 0 6
2 4 9 6
2 5 2 0
2 6 5 2
3 1 5 6
3 2 4 0
3 3 6 5
3 4 9 0
3 5 2 0
3 6 5 0
I want to count the number of non zero entries for variable Z in a particular Id and record that value in a new column Count, so the new data frame will look like
DF1
Id X Y Z Count
1 1 5 0 2
1 2 4 0 2
1 3 6 5 2
1 4 9 0 2
1 5 2 3 2
1 6 5 0 2
2 1 5 0 3
2 2 4 0 3
2 3 6 6 3
2 4 9 6 3
2 5 2 0 3
2 6 5 2 3
3 1 5 6 2
3 2 4 0 2
3 3 6 5 2
3 4 9 0 2
3 5 2 0 2
3 6 5 0 2
We can use base R ave
Counting the number of non-zero values for column Z grouped by Id
df$Count <- ave(df$Z, df$Id, FUN = function(x) sum(x!=0))
df$Count
#[1] 2 2 2 2 2 2 3 3 3 3 3 3 2 2 2 2 2 2
You can try this, it gives you exactly what you want:
library(data.table)
dt <- data.table(df)
dt[, Count := sum(Z != 0), by = Id]
dt
# Id X Y Z Count
# 1: 1 1 5 0 2
# 2: 1 2 0 0 2
# 3: 1 3 0 5 2
# 4: 1 4 9 0 2
# 5: 1 5 2 3 2
# 6: 1 6 5 0 2
# 7: 2 1 5 0 3
# 8: 2 2 4 0 3
# 9: 2 3 0 6 3
# 10: 2 4 9 6 3
# 11: 2 5 2 0 3
# 12: 2 6 5 2 3
# 13: 3 1 5 6 2
# 14: 3 2 4 0 2
# 15: 3 3 6 5 2
# 16: 3 4 9 0 2
# 17: 3 5 2 0 2
# 18: 3 6 5 0 2
This will also work:
df$Count <- rep(aggregate(Z~Id, df[df$Z != 0,], length)$Z, table(df$Id))
Id X Y Z Count
1 1 1 5 0 2
2 1 2 0 0 2
3 1 3 0 5 2
4 1 4 9 0 2
5 1 5 2 3 2
6 1 6 5 0 2
7 2 1 5 0 3
8 2 2 4 0 3
9 2 3 0 6 3
10 2 4 9 6 3
11 2 5 2 0 3
12 2 6 5 2 3
13 3 1 5 6 2
14 3 2 4 0 2
15 3 3 6 5 2
16 3 4 9 0 2
17 3 5 2 0 2
18 3 6 5 0 2

How to assign a value to a particular sequence in a column in r?

Data
I have following data frame:
dm <- data.frame(vehid = rep(c(2,3), each=10),
frameid=rep(c(7,10,11,13,6,5,14,13,12,1),2),
frame.diff = rep(c(NA,3,1,2,7,1,9,1,1,11), 2),
s.frame = rep(c(rep(0,9), 12), 2))
Question
For the s.frame!=0 I want to detect all the frameid which are in the sequence of frameid to s.frame and assign a value loop to them and . to others.
Desired Output
> dom
vehid frameid frame.diff s.frame loop
1 2 7 NA 0 loop
2 2 10 3 0 loop
3 2 11 1 0 loop
4 2 13 2 0 .
5 2 6 7 0 loop
6 2 5 1 0 loop
7 2 14 9 0 .
8 2 13 1 0 .
9 2 12 1 0 loop
10 2 1 11 12 loop
11 3 7 NA 0 loop
12 3 10 3 0 loop
13 3 11 1 0 loop
14 3 13 2 0 .
15 3 6 7 0 loop
16 3 5 1 0 loop
17 3 14 9 0 .
18 3 13 1 0 .
19 3 12 1 0 loop
20 3 1 11 12 loop
In the above example, for both vehids, the sequence was 1 to 12 (frameid:s.frame). The output shows the word 'loop' for the sequence and '.' for all others.
There should be a simple solution for this but I can't figure it out. I would appreciate if someone provide a data.table or plyr solution. Thanks.
I'm not sure I understand your question correctly, but here's a data.table option
library(data.table)
setDT(dm)[, loop := ifelse(frameid %between% c(frameid[s.frame != 0], s.frame[s.frame != 0]), "loop", "."), by = vehid]
dm
# vehid frameid frame.diff s.frame loop
# 1: 2 7 NA 0 loop
# 2: 2 10 3 0 loop
# 3: 2 11 1 0 loop
# 4: 2 13 2 0 .
# 5: 2 6 7 0 loop
# 6: 2 5 1 0 loop
# 7: 2 14 9 0 .
# 8: 2 13 1 0 .
# 9: 2 12 1 0 loop
# 10: 2 1 11 12 loop
# 11: 3 7 NA 0 loop
# 12: 3 10 3 0 loop
# 13: 3 11 1 0 loop
# 14: 3 13 2 0 .
# 15: 3 6 7 0 loop
# 16: 3 5 1 0 loop
# 17: 3 14 9 0 .
# 18: 3 13 1 0 .
# 19: 3 12 1 0 loop
# 20: 3 1 11 12 loop
Using dplyr
library(dplyr)
dm %>%
group_by(vehid) %>%
mutate(loop= c(".", "loop")[(max(frameid[!!s.frame]) <= frameid & frameid <= min(s.frame[!!s.frame]))+1])
# vehid frameid frame.diff s.frame loop
# 1 2 7 NA 0 loop
# 2 2 10 3 0 loop
# 3 2 11 1 0 loop
# 4 2 13 2 0 .
# 5 2 6 7 0 loop
# 6 2 5 1 0 loop
# 7 2 14 9 0 .
# 8 2 13 1 0 .
# 9 2 12 1 0 loop
# 10 2 1 11 12 loop
# 11 3 7 NA 0 loop
# 12 3 10 3 0 loop
# 13 3 11 1 0 loop
# 14 3 13 2 0 .
# 15 3 6 7 0 loop
# 16 3 5 1 0 loop
# 17 3 14 9 0 .
# 18 3 13 1 0 .
# 19 3 12 1 0 loop
# 20 3 1 11 12 loop

Resources