Expand each row with specific value in tidyr [duplicate] - r

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 3 years ago.
I have a dataset with grouped observations per row. However, I would like to expand each row observation from a single observation per replicate to a set number (in this case "20" observations each).
In the attached picture,
Each replicate is a row. I would like to expand each row into 20. So "wellA" for "LS x SB" becomes expands to 20 of the same line. As a bonus, I would also like to make a new column called "Replicate2" that numerically lists 1 to 20 to reflect these 20 new rows per replicate.
The idea would to then add the survival status per individual (reflected in the new columns "Status" and "Event").
I think the "expand" function in tidyr has potential but can't figure out how to just add a fixed number per replicate. Using the "Alive" column is adding a variable number of observations.
expand<-DF %>% expand(nesting(Date, Time, Cumulative.hrs, Timepoint, Treatment, Boat, Parentage, Well, Mom, Dad, Cone, NumParents, Parents), Alive)
Any help appreciated!

In base R, we can use rep to repeat rows and transform to add new columns
n <- 20
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
Using a reproducible example with n = 3
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
n <- 3
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
# a b c Replicate2
#1 1 4 7 1
#2 1 4 7 2
#3 1 4 7 3
#4 2 5 8 1
#5 2 5 8 2
#6 2 5 8 3
#7 3 6 9 1
#8 3 6 9 2
#9 3 6 9 3
Using dplyr we can use slice to repeat rows and mutate to add new column.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = n)) %>%
mutate(Replicate2 = rep(seq_len(n), n))

Do a cross join between your existing data and the numbers 1:20.
tidyr::crossing(DF, replicate2 = 1:20)
If you want to add additional columns, use mutate:
... %>% mutate(status = 1, event = FALSE)

Related

Eliminate duplicates in R [duplicate]

This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 1 year ago.
If I have a df like this
data<-data.frame(id=c(1,1,3,4),n=c("x","y","e","w"))
data
id n
1 1 x
2 1 y
3 3 e
4 4 w
I want to get a new df like this:
data
id n
3 3 e
4 4 w
That is, I want it to remove all repeating rows. I've tried functions like distinct from dplyr but it always gets one of the repeating rows.
Another subset option with ave
subset(
data,
ave(n, id, FUN = length) == 1
)
gives
id n
3 3 e
4 4 w
We may need duplicated
subset(data, !(duplicated(id)|duplicated(id, fromLast = TRUE)))
id n
3 3 e
4 4 w
or use table
subset(data, id %in% names(which(table(id) == 1)))
id n
3 3 e
4 4 w
Just adding to the already useful answers with a dplyr solution.
library(dplyr)
data %>% filter(
!(duplicated(id,fromLast = FALSE) | duplicated(id,fromLast = TRUE) )
)
distinct won't work for you, as it will retain all distinct values based on, in your case, id in which 1 is always a part of.
Although more verbose, you can also use base R.
data[!(duplicated(data["id"])|duplicated(data["id"], fromLast=TRUE)),]
Output
id n
3 3 e
4 4 w
Or use dplyr.
library(dplyr)
data %>%
dplyr::group_by(id) %>%
dplyr::filter(n() == 1) %>%
dplyr::ungroup()

How to filter rows where there are changes in a categorical variable

We have a dataframe called data with 2 columns: Time which is arranged in ascending order, and Place which describes where the individual was:
data.frame(Time = seq(1,20,1),
Place = rep(letters[c(1:3,1)], c(5,5,3,7)))
Since this data is in ascending order with respect to Time, we want to subset the rows where Place changes from the previous observation.
The resulting dataframe for this data would look like this:
Time Place
1 a
6 b
11 c
14 a
Notice that the same Place can show up later, like Place == a did in this example. How can we perform this kind of subset in R?
Apply the duplicated on the rleid of the 'Place'
library(dplyr)
library(data.table)
df1 %>%
filter(!duplicated(rleid(Place)))
Or in base R with rle
subset(df1, !duplicated(with(rle(Place), rep(seq_along(values), lengths))))
-output
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
Another base R option using subset + tail + head
subset(
df,
c(TRUE, tail(Place, -1) != head(Place, -1))
)
which gives
Time Place
1 1 a
6 6 b
11 11 c
14 14 a

R Fill empty cells of separate column with matching ID [duplicate]

This question already has answers here:
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have just added new data to my original data frame, so simplified, it looks like this:
df <- data.frame(ID = rep(letters[1:5], each = 2))
df
ID Volume
1 a 1.23
2 a
3 a
4 a
5 b 4.74
6 b
7 b
8 b
9 c 5.35
10 c
11 c
12 c
13 c
14 d 1.53
15 d
16 d
where I have an ID column with differing numbers of entries for each ID and a volume for one entry, but not the others.
Is there a way to populate the empty Volume cells with the filled cell of the corresponding ID?
I'm essentially trying to remove the step of going into Excel and using the "drag to fill" for each ID (I have over 2000 IDs). Not every ID has the same amount of entries (i.e. ID "a" has 4, where ID "c" has 5 and ID "d" has 3).
I'm thinking dplyr will have a tool to do this, but I have not been able to find the answer.
In the tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill()
Here is a base R solution, where ave() is used to fill up the NAs
df <- within(df,Volume <- ave(Volume,ID, FUN = function(x) unique(x[!is.na(x)])))

Transform table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5

sum cells of certain columns for each row

I would like to calculate sums for certain columns and then apply this summation for every row. Unfortunately, I can only get to the first step. How do I now make it happen for each row? I know that R doesn't need loops; what are good approaches?
My matrix (zscore) looks like this:
a b c t y
1 3 4 7 7 4
2 4 56 6 6 4
3 3 3 2 1 7
4 3 88 9 9 9
Now I would want to calculate the row sum for each row, based on some of the columns. For one row it could look like this:
f1 <- sum(zscore[1,1:2], zscore[1,3], zscore[1,5])
How do I do that now for each row?
You could do something like this:
summed <- rowSums(zscore[, c(1, 2, 3, 5)])
The summation of all individual rows can also be done using the row-wise operations of dplyr (with col1, col2, col3 defining three selected columns for which the row-wise sum is calculated):
library(tidyverse)
df <- df %>%
rowwise() %>%
mutate(rowsum = sum(c(col1, col2,col3)))
If you don't have NA you can apply this
suma.zscore = (zscore$a + zscore$c + zscore$t + zscore$y)

Resources