How to recode multiple variables for a subset of a dataframe? - r

I'm lost, so any directions would be helpful. Let's say I have a dataframe:
df <- data.frame(
id = 1:12,
v1 = rep(c(1:4), 3),
v2 = rep(c(1:3), 4),
v3 = rep(c(1:6), 2),
v4 = rep(c(1:2), 6))
My goal would be to recode 2=4 and 4=2 for variables v3 and v4 but only for the first 4 cases (id < 5). I'm looking for a solution that works for up to twenty variables. I know how to do basic recoding but I don't see a simple way to implement the subset condition while manipulating multiple variables.

Here is a base R solution,
df[1:5, c('v3', 'v4')] <- lapply(df[1:5, c('v3', 'v4')], function(i)
ifelse(i == 2, 4, ifelse(i == 4, 2, i)))
which gives,
id v1 v2 v3 v4
1 1 1 1 1 1
2 2 2 2 4 4
3 3 3 3 3 1
4 4 4 1 2 4
5 5 1 2 5 1
6 6 2 3 6 2
7 7 3 1 1 1
8 8 4 2 2 2
9 9 1 3 3 1
10 10 2 1 4 2
11 11 3 2 5 1
12 12 4 3 6 2

You can try mutate_at with case_when in dplyr
library(dplyr)
df %>%
mutate_at(vars(v3:v4), ~case_when(id < 5 & . == 4 ~ 2L,
id < 5 & . == 2 ~ 4L,
TRUE ~.))
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
With mutate_at you can specify range of columns to apply the function.

Another, more direct, option is to get the indices of the numbers to replace, and to replace them by 6 minus the number (6-4=2, 6-2=4):
whToChange <- which(df[1:5, c("v3", "v4")] ==2 | df[1:5, c("v3", "v4")]==4, arr.ind=TRUE)
df[, c("v3", "v4")][whToChange] <- 6-df[, c("v3", "v4")][whToChange]
head(df, 5)
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1

You can use match and a lookup table - just in chase you have to recede more than two values.
rosetta <- matrix(c(2,4,4,2), 2)
df[1:4, c("v3", "v4")] <- lapply(df[1:4, c("v3", "v4")], function(x) {
i <- match(x, rosetta[1,]); j <- !is.na(i); "[<-"(x, j, rosetta[2, i[j]])})
df
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
Have also a look at R: How to recode multiple variables at once or Recoding multiple variables in R

Related

Find all combinations of one column based on the unique values of another column in a dataframe

Suppose that I have a dataframe
data.frame(v1 = c(1,1,1,2,2,3), v2 = c(6,1,6,3,4,2))
v1 v2
1 1 6
2 1 1
3 1 6
4 2 3
5 2 4
6 3 2
Is there an R function to return the following dataframe? i.e. the combinations of v2 with based on the unique values of v1
data.frame(v1 = rep(1:3, 6), v2 = c(6,3,2, 6,4,2, 1,3,2, 1,4,2, 6,3,2, 6,4,2))
v1 v2
1 1 6
2 2 3
3 3 2
4 1 6
5 2 4
6 3 2
7 1 1
8 2 3
9 3 2
10 1 1
11 2 4
12 3 2
13 1 6
14 2 3
15 3 2
16 1 6
17 2 4
18 3 2
P.S. I don't think my question is duplicated. Here v2 has duplicated values and the output dataframe has to keep the order (i.e. v1 = c(1,2,3, 1,2,3, ...). The desired out put has 18 rows but expand.grid gives 36 rows and crossing gives 15 rows
Try the code below
dfout <- data.frame(
v1 = unique(df$v1),
v2 = c(t(rev(expand.grid(rev(with(df, split(v2, v1)))))))
)
which gives
> dfout
v1 v2
1 1 6
2 2 3
3 3 2
4 1 6
5 2 4
6 3 2
7 1 1
8 2 3
9 3 2
10 1 1
11 2 4
12 3 2
13 1 6
14 2 3
15 3 2
16 1 6
17 2 4
18 3 2

In R: How to coerce a list of vectors with unequal length to a dataframe using tidyverse?

Suppose you have the following list in R:
list_test <- list(c(2,4,5, 6), c(1,2,3), c(7,8))
What I am looking for is a dataframe of the following form:
value list_index
2 1
4 1
5 1
6 1
1 2
2 2
3 2
7 3
8 3
I tried to find a solution with the tidyverse but either lost the the list_index/name or had problems with the unequal length of the vectors.
You can give name to the list and then use stack in base R.
names(list_test) <- seq_along(list_test)
stack(list_test)
# values ind
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
If interested in a tidyverse solution we can use enframe with unnest.
tibble::enframe(list_test) %>% tidyr::unnest(value)
Or imap_dfr from purrr.
purrr::imap_dfr(list_test, ~tibble::tibble(value = .x, list_index = .y))
Another option could be:
map_dfr(list_test, ~ enframe(.) %>%
select(-name), .id = "name")
name value
<chr> <dbl>
1 1 2
2 1 4
3 1 5
4 1 6
5 2 1
6 2 2
7 2 3
8 3 7
9 3 8
Or if you don't mind to have a column also with vector indexes:
map_dfr(list_test, enframe, .id = "name_list")
name_list name value
<chr> <int> <dbl>
1 1 1 2
2 1 2 4
3 1 3 5
4 1 4 6
5 2 1 1
6 2 2 2
7 2 3 3
8 3 1 7
9 3 2 8
In base R, we can use lengths to replicate the sequence and unlist the list elements into a two column 'data.frame'
data.frame(value = unlist(list_test),
list_index = rep(seq_along(list_test), lengths(list_test)))
# value list_index
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3

Count interactions with unique accounts in financial transaction dataset

I have a question about a dataset with financial transactions:
Account_from Account_to Value
1 1 2 25.0
2 1 3 30.0
3 2 1 28.0
4 2 3 10.0
5 2 3 12.0
6 3 1 40.0
7 3 1 30.0
8 3 1 20.0
Each row represents a transaction. I would like to create an extra column with a variable containing the information of the number of interactions with each unique account.
That it would look like the following:
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25.0 2 2
2 1 3 30.0 2 2
3 2 1 28.0 2 1
4 2 3 10.0 2 1
5 2 3 12.0 2 1
6 3 1 40.0 1 2
7 3 1 30.0 1 2
8 3 1 20.0 1 2
Account 3 only interacts with account 1, therefore Count_interactions_out is 1. However, it receives interactions from account 1 and 2, therefore the count_interactions_in is 2.
How can I apply this to the whole dataset?
Thanks
Here's an approach using dplyr
library(dplyr)
financial.data %>%
group_by(Account_from) %>%
mutate(Count_interactions_out = nlevels(factor(Account_to))) %>%
ungroup() %>%
group_by(Account_to) %>%
mutate(Count_interactions_in = nlevels(factor(Account_from))) %>%
ungroup()
Here is a solution with base R, where ave() is used
df <- cbind(df,
with(df, list(
Count_interactions_out = ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in = ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)])))
such that
> df
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 2 1
4 2 3 10 2 1
5 2 3 12 2 1
6 3 1 40 1 2
7 3 1 30 1 2
8 3 1 20 1 2
or
df <- within(df, list(
Count_interactions_out <- ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in <- ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)]))
such that
> df
Account_from Account_to Value Count_interactions_in Count_interactions_out
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 1 2
4 2 3 10 1 2
5 2 3 12 1 2
6 3 1 40 2 1
7 3 1 30 2 1
8 3 1 20 2 1

Looping through Columns replicating each column fetched six times

I have this data frame where the column names are from v1 to v292. There are 17 observations. I need to iterate over the columns and replicate each column fetched 6 times.
For example:
v1 v2 v3 v4
1 3 4 6
3 4 3 1
What the output should be
x
1
3
1
3
1
3
1
3
1
3
1
3
3
4
3
4
3
4
3
4
3
4
3
4 .. and so on.
Please help. Thank you in advance.
You could use rep
data.frame(x = unlist(rep(df, each = 6)))
Checking output with each = 2
data.frame(x = unlist(rep(df, each = 2)))
# x
#1 1
#2 3
#3 1
#4 3
#5 3
#6 4
#7 3
#8 4
#9 4
#10 3
#11 4
#12 3
#13 6
#14 1
#15 6
#16 1

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Resources