How to split a dataframe into chunks when a particular column value occurs (R)? [duplicate] - r

This question already has answers here:
Split data.frame by value
(2 answers)
Closed 3 years ago.
I am trying to split a dataframe into chunks, based on a particular value in a column (rather than a grouping value), so every time the column matches this value, it should chunk the dataframe. For example, with dataframe x:
f1 f2
3 0
4 1
5 2
6 0
7 1
8 2
9 3
How would I split x to be a list of, where the split occurs anytime "f2"==0:
[1]
f1 f2
3 0
4 1
5 2
[2]
f1 f2
6 0
7 1
8 2
9 3
I have tried
split(x, x$f2 == 0)
which just creates a list of two elements, one where x x$f2 == 0 is FALSE and one where x$f2 == 0 is TRUE.
I have also tried to use apply() as in
mm <- apply(x, function(x) split(x$f2 == 0))
but I get the error "Error in match.fun(FUN) : argument "FUN" is missing, with no default"
Code to make a simple dataframe as above:
f1 <- c(3,4,5,6,7,8,9)
f2 <- c(0,1,2,0,1,2,3)
x <- data.frame(f1,f2)

Using base R's split with for example cumsum like this would be a way:
split(x, cumsum(x$f2 == 0))
Output
# $`1`
# f1 f2
# 1 3 0
# 2 4 1
# 3 5 2
#
# $`2`
# f1 f2
# 4 6 0
# 5 7 1
# 6 8 2
# 7 9 3

With dplyr, you can do (basically the same thing as the idea by #jogo):
df %>%
group_split(cumsum(f2 == 0), keep = FALSE)
[[1]]
# A tibble: 3 x 2
f1 f2
<int> <int>
1 3 0
2 4 1
3 5 2
[[2]]
# A tibble: 4 x 2
f1 f2
<int> <int>
1 6 0
2 7 1
3 8 2
4 9 3

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

Changing a subset of column names in a list of data frames in R

This question is an extension of Changing Column Names in a List of Data Frames in R.
That post addresses changing names of all columns of a data.frame.
But how do you change the names of only a selected number of columns?
Example:
I want to change the name of the first column only in each data.frame in my list:
dat <- data.frame(Foo = 1:5,Bar = 1:5)
lst <- list(dat,dat)
print(lst)
[[1]]
Foo Bar
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
[[2]]
Foo Bar
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(Failed) Attempts:
lapply(1:2, function(x) names(lst[[x]])[names(lst[[x]]) == 'Foo'] <- 'New')
lapply(1:2, function(x) names(lst[[x]])[names(lst[[x]]) == 'Foo']) <- rep('New',2)
lapply(1:2, function(x) setNames(lst[[x]][names(lst[[x]]) == 'Foo'],'New'))
Here is one possibility using setNames and gsub:
# Sample data
dat <- data.frame(Foo = 1:5,Bar = 1:5)
lst <- list(dat,dat[, 2:1])
# Replace Foo with FooFoo
lst <- lapply(lst, function(x) setNames(x, gsub("^Foo$", "FooFoo", names(x))) )
#[[1]]
# FooFoo Bar
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
#
#[[2]]
# Bar FooFoo
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
Two problems with your attempts:
It's weird to use lapply(1:2, ...) instead of lapply(lst, ...). This makes your anonymous function more awkward.
Your anonymous function doesn't return the data frame. The last line of a function is returned (in absence of a return() statement). In your first attempt, the value of the last line is just the value assigned, "new" - we need to return the whole data frame with the modified name.
Solution:
lapply(lst, function(x) {names(x)[names(x) == 'Foo'] <- 'New'; x})
# [[1]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
Here is a way to change the name of the column by column index.
lapply(lst, function(x, pos = 1, newname = "New"){
# x: data frame, pos: column index, newname: new name of the column
column <- names(x)
column[pos] <- newname
names(x) <- column
return(x)
})
# [[1]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
I posted this answer before I saw an updated comment from the OP saying that the index of the target column from each data frame could be different. This is not mentioned in the original post. Please see others' post as my answer only works if the column index is consistent.
My solution is more complicated than the others but here it goes.
The main difference is that instead of == it uses grep (with argument ignore.case = TRUE).
lapply(lst, function(DF) {
inx <- grep("^foo$", names(DF), ignore.case = TRUE)
names(DF)[inx] <- "New"
DF
})
#[[1]]
# New Bar
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
#
#[[2]]
# New Bar
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
Using tidyverse:
library(tidyverse)
map(lst,rename_at,"Foo",~"New")
# [[1]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
Using data.table:
library(data.table)
lst2 <- copy(lst)
lapply(lst2,setnames,"Foo","New")
# [[1]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# New Bar
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
Here changes are made by reference so we make a copy first.
Note without the assignment, it doesn't change the original object.
lst <- purrr::map(lst, ~setNames(.x, c('new', names(.x)[-1])))

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Resources