I have very large data.table that I want to trim down in this fashion:
Only one unique id
If there is any other data than "X" in the same log, that other should stay
If only X, then the first X should stay
If there is more than one other than "X", then all those should stay, separated by commas, but not the "X".
Sample dataset:
library(data.table)
dt <- data.table(
id=c(1,1,2,3,3,4,4,4,5,5),
log=c(11,11,11,12,12,12,12,12,13,13),
art=c("X", "Y", "X", "X", "X", "Z", "X", "Y","X", "X")
)
dt
id log art
1: 1 11 X
2: 1 11 Y
3: 2 11 X
4: 3 12 X
5: 3 12 X
6: 4 12 Z
7: 4 12 X
8: 4 12 Y
9: 5 13 X
10: 5 13 X
Required output:
id log art
1 11 Y
2 11 Y
3 12 Z,Y
4 12 Z,Y
5 13 X
Here is one method, though there maybe a more efficient approach.
unique(dt[,.(id, log)])[dt[, .(art=if(.N == 1 | all(art == "X"))
art[1] else toString(unique(art[art != "X"]))),
by=log], on="log"]
which returns
id log art
1: 1 11 Y
2: 2 11 Y
3: 3 12 Z, Y
4: 4 12 Z, Y
5: 5 13 X
perform a left join of the desired values of art by each log onto the unique pairs of ID and log. This assumes that no ID spans two logs, which is the case in the example.
We can try
dt[, .(art = if(all(art=="X")) "X" else
toString(unique(art[art != "X"]))), .(id, logbld = log)]
# id logbld art
#1: 1 11 Y
#2: 2 11 X
#3: 3 12 X
#4: 4 12 Z, Y
#5: 5 13 X
Just wanted to try this with dplyr:
library(data.table)
library(dplyr)
dat <- setDT(dt %>% group_by(id) %>%
unique() %>%
summarise(bldlog = mean(log),
art = gsub("X,|,X", "",paste(art, collapse = ","))))
dat
# id bldlog art
# 1: 1 11 Y
# 2: 2 11 X
# 3: 3 12 X
# 4: 4 12 Z,Y
# 5: 5 13 X
Related
I have a dt:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5), b = c(4,5,6,7,8), c = c("X","X","X","Y","Y") )
I want to add one column d, within each group of column C:
the first row value should be the same as b[i],
the second to last row within each group should be d[i-1] + 2*b[i]
Intended results:
a b c d
1: 1 4 X 4
2: 2 5 X 14
3: 3 6 X 26
4: 4 7 Y 7
5: 5 8 Y 23
I tried to use functions such as shift but I struggle to update rows dynamically (so to speak) here,
wonder if there is any elegant data.table style solution?
We can use cumsum and subtract the first row using [1]:
DT[, d := cumsum(2 * b) - b[1], .(c)][]
#> a b c d
#> 1: 1 4 X 4
#> 2: 2 5 X 14
#> 3: 3 6 X 26
#> 4: 4 7 Y 7
#> 5: 5 8 Y 23
Here we can use accumulate
library(purrr)
library(data.table)
DT[, d := accumulate(b, ~ .x + 2 *.y), by = c]
Or with Reduce and accumulate = TRUE from base R
DT[, d := Reduce(function(x, y) x + 2 * y, b, accumulate = TRUE), by = c]
I have the following dataframe
library(tidyverse)
x <- c(1,2,3,NA,NA,4,5)
y <- c(1,2,3,5,5,4,5)
z <- c(1,1,1,6,7,7,8)
df <- data.frame(x,y,z)
df
x y z
1 1 1 1
2 2 2 1
3 3 3 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
I would like to update the dataframe according to the following conditions
If z==1, update to x=1, else leave the current value for x
If z==1, update to y=2, else leave the current value for y
The following code does the job fine
df %>% mutate(x=if_else(z==1,1,x),y=if_else(z==1,2,y))
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
However, I have to add if_else statement for x and y mutate functions. This has the potential to make my code complicated and hard to read. To give you a SQL analogy, consider the following code
UPDATE df
SET x= 1, y= 2
WHERE z = 1;
I would like to achieve the following:
Specify the update condition ahead of time, so I don't have to repeat it for every mutate function
I would like to avoid using data.table or base R. I am using dplyr so I would like to stick to it for consistency
Using mutate_cond posted at dplyr mutate/replace several columns on a subset of rows we can do this:
df %>% mutate_cond(z == 1, x = 1, y = 2)
giving:
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
sqldf
Of course you can directly implement it in SQL with sqldf -- ignore the warning message that the backend RSQLite issues.
library(sqldf)
sqldf(c("update df set x = 1, y = 2 where z = 1", "select * from df"))
base R
It straight-forward in base R:
df[df$z == 1, c("x", "y")] <- list(1, 2)
library(dplyr)
df %>%
mutate(x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
In base R
transform(df,
x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
If you store the condition in a variable, you don't have to type it multiple times
condn = (df$z == 1)
transform(df,
x = replace(x, condn, 1),
y = replace(y, condn, 2))
Here is one option with map2. Loop through the 'x', 'y' columns of the dataset, along with the values to change, apply case_when based on the values of 'z' if it is TRUE, then return the new value, or else return the same column and bind the columns with the original dataset
library(dplyr)
library(purrr)
map2_df(df %>%
select(x, y), c(1, 2), ~ case_when(df$z == 1 ~ .y, TRUE ~ .x)) %>%
bind_cols(df %>%
select(z), .) %>%
select(names(df))
Or using base R, create a logical vector, use that to subset the rows of columns 'x', 'y' and update by assigning to a list of values
i1 <- df$z == 1
df[i1, c('x', 'y')] <- list(1, 2)
df
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
The advantage of both the solutions are that we can pass n number of columns with corresponding values to pass and not repeating the code
If you have an SQL background, you should really check out data.table:
library(data.table)
dt <- as.data.table(df)
set(dt, which(z == 1), c('x', 'y'), list(1, 2))
dt
# or perhaps more classic syntax
dt <- as.data.table(df)
dt
# x y z
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
dt[z == 1, `:=`(x = 1, y = 2)]
dt
# x y z
#1: 1 2 1
#2: 1 2 1
#3: 1 2 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
The last option is an update join. This is great if you have the lookup data already done upfront:
# update join:
dt <- as.data.table(df)
dt_lookup <- data.table(x = 1, y = 2, z = 1)
dt[dt_lookup, on = .(z), `:=`(x = i.x, y = i.y)]
dt
I have mydf data frame below. I want to split any cell that contains comma separated data and put it into rows. I am looking for a data frame similar to y below. How could i do it efficiently in few steps? Currently i am using cSplit function on one column at a time.
I tried cSplit(mydf, c("name","new"), ",", direction = "long"), but that didn`t work
library(splitstackshape)
mydf=data.frame(name = c("AB,BW","x,y,z"), AB = c('A','B'), new=c("1,2,3","4,5,6,7"))
mydf
x=cSplit(mydf, c("name"), ",", direction = "long")
x
y=cSplit(x, c("new"), ",", direction = "long")
y
There are times when a for loop is totally fine to work with in R. This is one of those times. Try:
library(splitstackshape)
cols <- c("name", "new")
for (i in cols) {
mydf <- cSplit(mydf, i, ",", "long")
}
mydf
## name AB new
## 1: AB A 1
## 2: AB A 2
## 3: AB A 3
## 4: BW A 1
## 5: BW A 2
## 6: BW A 3
## 7: x B 4
## 8: x B 5
## 9: x B 6
## 10: x B 7
## 11: y B 4
## 12: y B 5
## 13: y B 6
## 14: y B 7
## 15: z B 4
## 16: z B 5
## 17: z B 6
## 18: z B 7
Here's a small test using slightly bigger data:
# concat.test = sample data from "splitstackshape"
test <- do.call(rbind, replicate(5000, concat.test, FALSE))
fun1 <- function() {
cols <- c("Likes", "Siblings")
for (i in cols) {
test <- cSplit(test, i, ",", "long")
}
test
}
fun2 <- function() {
test %>%
separate_rows("Likes") %>%
separate_rows("Siblings")
}
system.time(fun1())
# user system elapsed
# 3.205 0.056 3.261
system.time(fun2())
# user system elapsed
# 11.598 0.066 11.662
We can use the separate_rows function from the tidyr package.
library(tidyr)
mydf2 <- mydf %>%
separate_rows("name") %>%
separate_rows("new")
mydf2
# AB name new
# 1 A AB 1
# 2 A AB 2
# 3 A AB 3
# 4 A BW 1
# 5 A BW 2
# 6 A BW 3
# 7 B x 4
# 8 B x 5
# 9 B x 6
# 10 B x 7
# 11 B y 4
# 12 B y 5
# 13 B y 6
# 14 B y 7
# 15 B z 4
# 16 B z 5
# 17 B z 6
# 18 B z 7
If you don't what to use separate_rows function more than once, we can further design a function to iteratively apply the separate_rows function.
expand_fun <- function(df, vars){
while (length(vars) > 0){
df <- df %>% separate_rows(vars[1])
vars <- vars[-1]
}
return(df)
}
The expand_fun takes two arguments. The first argument, df, is the original data frame. The second argument, vars, is a character string with the columns names we want to expand. Here is an example using the function.
mydf3 <- expand_fun(mydf, vars = c("name", "new"))
mydf3
# AB name new
# 1 A AB 1
# 2 A AB 2
# 3 A AB 3
# 4 A BW 1
# 5 A BW 2
# 6 A BW 3
# 7 B x 4
# 8 B x 5
# 9 B x 6
# 10 B x 7
# 11 B y 4
# 12 B y 5
# 13 B y 6
# 14 B y 7
# 15 B z 4
# 16 B z 5
# 17 B z 6
# 18 B z 7
I would like to transform a list like this:
l <- list(x = c(1, 2), y = c(3, 4, 5))
into a tibble like this:
Name Value
x 1
x 2
y 3
y 4
y 5
I think nothing will be easier than using the stack-function from base R:
df <- stack(l)
gives you a dataframe back:
> df
values ind
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
Because you asked for tibble as output, you can do as_tibble(df) (from the tibble-package) to get that.
Or more directly: df <- as_tibble(stack(l)).
Another pure base R method:
df <- data.frame(ind = rep(names(l), lengths(l)), value = unlist(l), row.names = NULL)
which gives a similar result:
> df
ind value
1 x 1
2 x 2
3 y 3
4 y 4
5 y 5
The row.names = NULL isn't necessarily needed but gives rownumbers as rownames.
Update
I found a better solution.
This works both in case of simple and complicated lists like the one I posted before (below)
l %>% map_dfr(~ .x %>% as_tibble(), .id = "name")
give us
# A tibble: 5 x 2
name value
<chr> <dbl>
1 x 1.
2 x 2.
3 y 3.
4 y 4.
5 y 5.
==============================================
Original answer
From tidyverse:
l %>%
map(~ as_tibble(.x)) %>%
map2(names(.), ~ add_column(.x, Name = rep(.y, nrow(.x)))) %>%
bind_rows()
give us
# A tibble: 5 × 2
value Name
<dbl> <chr>
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
The stack function from base R is great for simple lists as Jaap showed.
However, with more complicated lists like:
l <- list(
a = list(num = 1:3, let_a = letters[1:3]),
b = list(num = 101:103, let_b = letters[4:6]),
c = list()
)
we get
stack(l)
values ind
1 1 a
2 2 a
3 3 b
4 a b
5 b a
6 c a
7 101 b
8 102 b
9 103 a
10 d a
11 e b
12 f b
which is wrong.
The tidyverse solution shown above works fine, keeping the data from different elements of the nested list separated:
# A tibble: 6 × 4
num let Name lett
<int> <chr> <chr> <chr>
1 1 a a <NA>
2 2 b a <NA>
3 3 c a <NA>
4 101 <NA> b d
5 102 <NA> b e
6 103 <NA> b f
We can use melt from reshape2
library(reshape2)
melt(l)
# value L1
#1 1 x
#2 2 x
#3 3 y
#4 4 y
#5 5 y
In the following dataset:
Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A
How can I assign numbering to the variable Name in following order using R:
Day Place Name Number
22 X A 1
22 X A 1
22 X B 2
22 X A 1
22 Y C 1
22 Y C 1
22 Y D 2
23 X B 1
23 X A 2
In a nutshell, I need to number the names according to their order to occurrence on a certain day and at a certain place.
In base R using tapply:
dat$Number <-
unlist(tapply(dat$Name,paste(dat$Day,dat$Place),
FUN=function(x){
y <- as.character(x)
as.integer(factor(y,levels=unique(y)))
}))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
idea
Group by Day and Place using tapply
For each group, create a coerce the Name to the factor conserving the same order of levels.
Coerce the created factor to integer to get the final result.
using data.table(sugar syntax) :
library(data.table)
setDT(dat)[,Number := {
y <- as.character(Name)
as.integer(factor(y,levels=unique(y)))
},"Day,Place"]
Day Place Name Number
1: 22 X A 1
2: 22 X A 1
3: 22 X B 2
4: 22 Y C 1
5: 22 Y C 1
6: 22 Y D 2
7: 23 X B 1
8: 23 X A 2
idx <- function(x) cumsum(c(TRUE, tail(x, -1) != head(x, -1)))
transform(dat, Number = ave(idx(Name), Day, Place, FUN = idx))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
Use ddply from plyr.
dfr <- read.table(header = TRUE, text = "Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A")
library(plyr)
ddply(
dfr,
.(Day, Place),
mutate,
Number = as.integer(factor(Name, levels = unique(Name)))
)
Or use dplyr, in a variant of beginneR's deleted answer.
library(dplyr)
dfr %>%
group_by(Day, Place) %>%
mutate(Number = as.integer(factor(Name, levels = unique(Name))))