Flag first by-group in R data frame - r

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0

> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0

You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0

Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)

Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Related

anti-join not working - giving 0 rows, why?

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

Add new variables financial dataset r

I have a question about a financial transaction dataset.
The data set look as the following:
Account_from Account_to Value Timestamp
1 1 2 10 1
2 1 3 15 1
3 3 4 20 1
4 2 1 10 2
5 1 3 25 2
6 2 1 15 3
7 1 3 10 3
8 1 4 20 4
I would like to create a couple of extra variables based on the account from column. The variables I want to create are :
(total value out of account from in the last two timestamps),
(total value incoming of account from in the last two timestamps),
(total transaction value out that timestamp of all transaction done during that timestamp),
(value out of account from / previous value out of account from),
That it will look like this:
Acc_from Acc_to Value Timestamp Tot_val_out Tot_val_inc Tot_val_out_1time val_out/prev_val_out
1 1 2 10 1 10 0 45 0
2 1 3 15 1 25 0 45 1.5
3 3 4 20 1 20 15 45 0
4 2 1 10 2 10 10 35 0
5 1 3 25 2 50 10 35 1.67
6 2 1 15 3 25 0 25 1.5
7 1 3 10 3 35 25 25 0.4
8 1 4 20 4 30 15 20 2
For example row 5 tot_val_out is 50, this means that account 1 transferred the amount of 50 in the last two timestamps (timestamps 1 and 2). At row 8 account 1 transferred 30 in the last two timestamps (timestamps 3 and 4).
The same should be done for incoming value.
Additionally I would like to create the variables:
(number of transactions done by account from in the previous 4 timestamps)
(number of transactions done by account from in the previous 2 timestamps)
So that:
Account_from Account_to Value Timestamp Transactions_previous4 Transactions_previous2
1 1 2 10 1 1 1
2 1 3 15 1 2 2
3 3 4 20 1 1 1
4 2 1 10 2 1 1
5 1 3 25 2 3 3
6 2 1 15 3 2 2
7 1 3 10 3 4 2
8 1 4 20 4 5 2
At row 8 account 1 has made 5 transactions the last 4 timestamps (timestamps 1 till 4), but in the last 2 timestamps (timestamps 3 and 4) only 2 transactions.
I cannot figure out how to do this. It would be extremely helpfull if someone knows how to do this.
Thanks in advance,
Here is a base R code
for (k in seq(nrow(df))) {
nr <- seq(nrow(df))
Account_from <- df$Account_from
Account_to <- df$Account_to
Value <- df$Value
Timestamp <- df$Timestamp
df$Tot_val_out[k] <- sum(Value[which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k)])
df$Tot_val_in[k] <- sum(Value[which(Account_to == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k)])
df$Transactions_previous4[k] <- length(which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(3:0)) & nr<=k))
df$Transactions_previous2[k] <- length(which(Account_from == Account_from[k] & Timestamp %in% (Timestamp[k]-(1:0)) & nr<=k))
}
dfout <- cbind(df,
with(df,
data.frame(Tot_val_out_1time = ave(Value,Timestamp,FUN = sum),
val_out_prev_val_out = ave(Value, Account_from, FUN = function(x) c(0,x[-1]/x[-length(x)])))))
such that
> dfout
Account_from Account_to Value Timestamp Tot_val_out Tot_val_in Transactions_previous4 Transactions_previous2 Tot_val_out_1time val_out_prev_val_out
1 1 2 10 1 10 0 1 1 45 0.000000
2 1 3 15 1 25 0 2 2 45 1.500000
3 3 4 20 1 20 15 1 1 45 0.000000
4 2 1 10 2 10 10 1 1 35 0.000000
5 1 3 25 2 50 10 3 3 35 1.666667
6 2 1 15 3 25 0 2 2 25 1.500000
7 1 3 10 3 35 25 4 2 25 0.400000
8 1 4 20 4 30 15 5 2 20 2.000000

Split columns to few variables and move corresponding value to new column

I have a data frame like this (with many more rows):
id act_l_n pas_l_n act_q_p pas_q_p act_l_p pas_l_p act_q_n pas_q_n
1 14 8 14 10 21 11 21 11
2 19 9 11 17 22 11 20 11
Every column name contains information about 3 variables separated by '_' (each has 2 levels named act/pas, l/q, n/p). Values are scores corresponding to each combination of variables (i.e. 1 of 8 conditions).
I need to move 3 variables to 3 separate columns, mark their levels by digits, and move corresponding value to separate column called "score". So from 1st row of current data frame I'd get something like this:
id score actpas lq pn
1 14 1 1 1
1 8 2 1 1
1 14 1 2 2
1 10 2 2 2
1 21 1 1 2
1 11 2 1 2
1 21 1 2 1
1 11 2 2 1
I've tried wrangling this with dplyr using gather and separate functions, but I can't really get what I need. Help with dplyr would be most appriciated!
If I understand well:
df<-read.table(textConnection(
"id,act_l_n,pas_l_n,act_q_p,pas_q_p,act_l_p,pas_l_p,act_q_n,pas_q_n
1,14,8,14,10,21,11,21,11
2,19,9,11,17,22,11,20,11"),
header=TRUE,sep=",")
library(tidyr)
library(dplyr)
gather(df,k,score,-id) %>% mutate(v1=1+as.integer(substr(k,1,3)=="pas")
,v2=1+as.integer(substr(k,5,5)=="q")
,v3=1+as.integer(substr(k,7,7)=="p")) %>%
select(-2) %>% arrange(id)
# id score v1 v2 v3
#1 1 14 1 1 1
#2 1 8 2 1 1
#3 1 14 1 2 2
#4 1 10 2 2 2
#5 1 21 1 1 2
#6 1 11 2 1 2
#7 1 21 1 2 1
#8 1 11 2 2 1
#9 2 19 1 1 1
#10 2 9 2 1 1
#11 2 11 1 2 2
#12 2 17 2 2 2
#13 2 22 1 1 2
#14 2 11 2 1 2
#15 2 20 1 2 1
#16 2 11 2 2 1

replace values in row if it matches with last row in R

I have below data frame in R
df <- read.table(text = "
A B C D E
14 6 8 16 14
5 6 10 6 4
2 4 6 3 4
26 6 18 39 36
1 2 3 1 2
3 1 1 1 1
3 5 1 4 11
", header = TRUE)
Now if values in last two rows are same, I need to replace these values with 0, can any one help me in this if it is doable in R
For example:
values last two rows in column 1 are 3 so I need to replace 3 by 0.
Also same for column 3 last two rows in column 3 are 1 so I need to replace 3 by 0.
you can compare last 2 rows and replace in the columns where the values are same :
nr <- nrow(df)
df[(nr-1):nr, df[nr-1, ]==df[nr, ]] <- 0
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11
One option is to loop through the columns, check if the last two elements (tail(x,2)) or duplicated, then replace it with 0 or else return the column and assign the output back to the dataset. The [] make sure that the structure is intact.
df[] <- lapply(df, function(x) if(anyDuplicated(tail(x, 2))>0)
replace(x, c(length(x)-1, length(x)), 0) else x)
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11
You could also do this:
r <- tail(df, 2)
r[,r[1,]==r[2,]] <- 0
df <- rbind(head(df, -2), r)

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Resources