difference between two comma separated strings - r

I have the following dataframe yy
fundId Year Qtr StockCurrentQtr StockNextQtr
1 2015 1 1,2,3,4,5 2,3,4,51
1 2015 2 2,3,4,51 7,8,9,4,2
1 2015 3 7,8,9,4,2 NA
2 2015 1 10,11,14 14,16,19
2 2015 2 14,16,19 20,21,45
2 2015 3 20,21,45 NA
I want to know the difference between StockNextQtr and StocCurrentQtr for each row group_by fundId or the difference between successive rows for the column 'StockCurrentQtr' group_by fundId
yy <- yy %>%
group_by(fundId) %>%
mutate(StockDiff = apply(yy,2,function(x){
paste(setdiff(unlist(strsplit(x[5], split = ",")), unlist(strsplit(x[4],
split = ","))),collapse = ",")}))
I am getting following error:
Column StockDiff must be length 3 (the group size) or one, not 5

You don't have to use apply here. Just rowwise, i.e.
library(dplyr)
df %>%
mutate_at(vars(4:5), funs(strsplit(., ','))) %>%
rowwise() %>%
mutate(new = toString(setdiff(StocCurrentQtr, StockNextQtr)))
which gives,
Source: local data frame [6 x 6]
Groups: <by row>
# A tibble: 6 x 6
fundId Year Qtr StocCurrentQtr StockNextQtr new
<int> <int> <int> <list> <list> <chr>
1 1 2015 1 <chr [5]> <chr [4]> 1, 5
2 1 2015 2 <chr [4]> <chr [5]> 3, 51
3 1 2015 3 <chr [5]> <chr [1]> 7, 8, 9, 4, 2
4 2 2015 1 <chr [3]> <chr [3]> 10, 11
5 2 2015 2 <chr [3]> <chr [3]> 14, 16, 19
6 2 2015 3 <chr [3]> <chr [1]> 20, 21, 45
The equivalent in base R,
mapply(function(x, y)toString(setdiff(x, y)), strsplit(df$StocCurrentQtr, ','),
strsplit(df$StockNextQtr, ','))
#[1] "1, 5" "3, 51" "7, 8, 9, 4, 2" "10, 11" "14, 16, 19" "20, 21, 45"
If StockNextQtr is missing, we can create it first and continue in the same manner as before, i.e.
df %>%
group_by(fundId) %>%
mutate(StockNextQtr = lead(StocCurrentQtr)) %>%
mutate_at(vars(4:5), funs(strsplit(., ','))) %>%
rowwise() %>%
mutate(new = toString(setdiff(StocCurrentQtr, StockNextQtr)))

I found another way
yy <- yy %>% group_by(fundId, Year, Qtr) %>% mutate(new = paste(setdiff((unlist(strsplit(StockCurrentQtr,split = ","))), unlist(strsplit(StockNextQtr,split = ","))),collapse = ","))

Related

dplyr mutate based on columns condition and an external vector

I am trying to add a list column to a tibble data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.
Suppose that the data frame and the vector are the following:
library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))
vec <- c(12, 25, 28, 58, 98)
Now, I would like to add (mutate) the column y so that for each row y is a list containing the elements of vec between A and B (inclusive).
The not-so-proper way to do this would be via loop. I initialize the column y as list and update it row-wise based on the condition A <= vec & vec <= B:
dat %<>%
mutate(y = list(vec))
for (i in 1:nrow(dat)){
dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec & vec <= dat[i,]$B])
}
The result is a data frame with y being a list of dbl of variable length:
> dat
# A tibble: 9 x 3
A B y
<dbl> <dbl> <list>
1 12 68 <dbl [4]>
2 27 46 <dbl [1]>
3 22 69 <dbl [3]>
4 1 7 <dbl [0]>
5 15 44 <dbl [2]>
6 30 76 <dbl [1]>
7 20 72 <dbl [3]>
8 28 50 <dbl [1]>
9 19 51 <dbl [2]>
The first four values of y are:
[[1]]
[1] 12 25 28 58
[[2]]
[1] 28
[[3]]
[1] 25 28 58
[[4]]
numeric(0)
Note: the 4-th list is empty, because no value of vec is between A=1 and B=7.
I have tried as an intermediate step with getting the subscripts via which using mutate(y = list(which(A <= vec & vec <= B))) or with a combination of seq and %in%, for instance mutate(y = list(vec %in% seq(A, B))). These both give an error. However, I don't need the subscripts, I need a subset of vec.
Create a small helper function with the logic that you want to implement.
return_values_in_between <- function(vec, A, B) {
vec[A <= vec & vec <= B]
}
and call the function for each row (using rowwise) -
library(dplyr)
result <- dat %>%
rowwise() %>%
mutate(y = list(return_values_in_between(vec, A, B))) %>%
ungroup()
result
# A tibble: 9 × 3
# A B y
# <dbl> <dbl> <list>
#1 12 68 <dbl [4]>
#2 27 46 <dbl [1]>
#3 22 69 <dbl [3]>
#4 1 7 <dbl [0]>
#5 15 44 <dbl [2]>
#6 30 76 <dbl [1]>
#7 20 72 <dbl [3]>
#8 28 50 <dbl [1]>
#9 19 51 <dbl [2]>
Checking the first 4 values in result$y -
result$y
#[[1]]
#[1] 12 25 28 58
#[[2]]
#[1] 28
#[[3]]
#[1] 25 28 58
#[[4]]
#numeric(0)
#...
#...
With the help of #Ronak Shah, I was able to come up with a solution that doesn't require a dedicated function and also makes sure that the vec is pulled from the global environment (in case there might be a column vec in the data frame):
library(tidyverse)
dat |>
rowwise() |>
mutate(y = list(.GlobalEnv$vec[.GlobalEnv$vec >= A & .GlobalEnv$vec <= B])) |>
ungroup()

How to join tibbles while merging non-matched columns into lists

I'm looking for a way to perform a full join on 2+ tibbles, by a column with unique indices, in a way that would preserve the original column names and merge (non-identical) values into a vector or list. The tibbles have the same column names.
Example input tibbles
> a
# A tibble: 3 × 3
id name location
<dbl> <chr> <chr>
1 1 Caspar NL
2 2 Monica USA
3 3 Martin DE
> b
# A tibble: 3 × 3
id name location
<dbl> <chr> <chr>
1 1 Caspar WWW
2 2 Monique USA
3 4 Francis FR
Desired output
or:
The ability to handle more than just 2 tibbles at the same time would be ideal.
All I know is dyplr's full_join(), which doesn't give me the desired result:
> dplyr::full_join(a,b, by='id')
# A tibble: 4 × 5
id name.x location.x name.y location.y
<dbl> <chr> <chr> <chr> <chr>
1 1 Caspar NL Caspar WWW
2 2 Monica USA Monique USA
3 3 Martin DE NA NA
4 4 NA NA Francis FR
Reprex
a <- tibble::tribble(~id, ~name, ~location, 1, 'Caspar', 'NL', 2, 'Monica', 'USA', 3, 'Martin', 'DE')
b <- tibble::tribble(~id, ~name, ~location, 1, 'Caspar', 'WWW', 2, 'Monique', 'USA', 4, 'Francis', 'FR')
It may be better with binding the rows first and then do a group by summarise
library(dplyr)
bind_rows(a, b) %>%
group_by(id) %>%
summarise(across(c('name', 'location'), list), .groups = 'drop')
-output
# A tibble: 4 × 3
id name location
<dbl> <list> <list>
1 1 <chr [2]> <chr [2]>
2 2 <chr [2]> <chr [2]>
3 3 <chr [1]> <chr [1]>
4 4 <chr [1]> <chr [1]>

Remove empty lists from a tibble in R

I am trying to remove any list from my tibble that has "<chr [0]>"
library(tidyverse)
df <- tibble(x = 1:3, y = list(as.character()),
z=list(as.character("ATC"),as.character("TAC"), as.character()))
df
#> # A tibble: 3 × 3
#> x y z
#> <int> <list> <list>
#> 1 1 <chr [0]> <chr [1]>
#> 2 2 <chr [0]> <chr [1]>
#> 3 3 <chr [0]> <chr [0]>
Created on 2022-02-15 by the reprex package (v2.0.1)
I want my tibble to look like this
#> # A tibble: 3 × 3
#> x z
#> <int> <list>
#> 1 1 <chr [1]>
#> 2 2 <chr [1]>
#> 3 3 NA
any help is appreciated
You can do:
df %>%
select(where(~!all(lengths(.) == 0))) %>%
mutate(z = lapply(z, function(x) ifelse(length(x) == 0, NA, x)))
# A tibble: 3 x 2
x z
<int> <list>
1 1 <chr [1]>
2 2 <chr [1]>
3 3 <lgl [1]>
Note, in your z column you can‘t have list elemtents for row 1 and 2 and a direct logical value NA. The whole column needs to be a list.
If all elements of z have only one element, you can add another line of code with mutate(z = unlist(z)).
TO asked for a more dynamic solution to pass several columns.
Here is an example where I simply created another z2 variable. Generally, you can repeat the recoding for several columns using across.
library(tidyverse)
df <- tibble(x = 1:3, y = list(as.character()),
z=list(as.character("ATC"),as.character("TAC"), as.character()),
z2 = z)
df %>%
select(where(~!all(lengths(.) == 0))) %>%
mutate(across(starts_with('z'), ~ lapply(., function(x) ifelse(length(x) == 0, NA, x))))
Which gives:
# A tibble: 3 x 3
x z z2
<int> <list> <list>
1 1 <chr [1]> <chr [1]>
2 2 <chr [1]> <chr [1]>
3 3 <lgl [1]> <lgl [1]>
A two-step way using base R:
df <- tibble(x = 1:3, y = list(as.character()),
z=list(as.character("ATC"),as.character("TAC"), as.character()))
df <- df[apply(df, 2, function(x) any(lapply(x, length) > 0))] #Remove empty columns
df[apply(df, 2, function(x) lapply(x, length) == 0)] <- NA #Replace empty lists with NA
df
# A tibble: 3 x 2
x z
<int> <list>
1 1 <chr [1]>
2 2 <chr [1]>
3 3 <NULL>

Simplifying the list for nested data frame

Sorry I am new in R
I need to get a dataframe ready a json format. But I have trouble to put the variable back to the original format c(1,2,3,...). For example
library(tidyr)
x<-tibble(x = 1:3, y = list(c(1,5), c(1,5,10), c(1,2,3,20)))
View(x)
This shows
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
x1<-x %>% unnest(y)
x2<-x1 %>% nest(data=c(y))
View(x2)
This shows
1 1 1 variable
2 2 1 variable
3 3 1 variable
the desired format is c(...) rather than a variable to get ready for the json data file
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
Please help
x$y is a list-column of doubles. Whereas x2$y is a list-column of tibbles.
Use map and unlist to turn the tibbles into doubles.
library(tidyverse)
x2 %>%
mutate(data = map(data, unlist))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>
Alternatively, instead of nesting, you can use summarise.
x1 %>%
group_by(x) %>%
summarise(data = list(y))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>

dataframe having list items and we're checking a value is present there or not in each row in that DF

I have this dataframe :
df = structure(list(session_id = 1:14, rv = list(c(1, 2, 3), 4, c(5,
6), c(7, 8), 5, c(9, 6, 10, 10), c(9, 6), c(11, 9, 12, 13), c(8,
3, 9), 3, 14, c(13, 11, 15), c(6, 6), 16)), row.names = c(NA,
14L), vars = list(session_id), drop = TRUE, .Names = c("session_id",
"rv"), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"))
Now i want to check if value 9 is present in what rv columns grouped by session ids.
Eg. checking first row rv[[1]] 9 is not present return 0 .Again check for 2nd row rv[[2]] 9 is not present ....in 6th row rv[[6]] 9 is present so returns its index position 1 ...Likewise in 9th row of rv[[9]] 9 is present at 3rd index return it.......So the idea is if value 9 is present in rv return index position else return 0.
I hope this in enough to expalin what the idea is.
Looking for dplyr way.
Using a combination of dplyr and purrr you can try:
df %>% ungroup() %>%
mutate(index = map_int(rv, function(l) if_else(any(l == 9), which.max(l == 9), 0L)))
# A tibble: 14 x 3
session_id rv index
<int> <list> <int>
1 1 <dbl [3]> 0
2 2 <dbl [1]> 0
3 3 <dbl [2]> 0
4 4 <dbl [2]> 0
5 5 <dbl [1]> 0
6 6 <dbl [4]> 1
7 7 <dbl [2]> 1
8 8 <dbl [4]> 2
9 9 <dbl [3]> 3
10 10 <dbl [1]> 0
11 11 <dbl [1]> 0
12 12 <dbl [3]> 0
13 13 <dbl [2]> 0
14 14 <dbl [1]> 0
Here I use map_int because the input is a list and you want to have an integer as output.
In case there are several 9 in one of the vectors, it returns the first index.
I had to use ungroup as your data.frame is a "rowwise_df".

Resources