R Fill empty cells of separate column with matching ID [duplicate] - r

This question already has answers here:
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have just added new data to my original data frame, so simplified, it looks like this:
df <- data.frame(ID = rep(letters[1:5], each = 2))
df
ID Volume
1 a 1.23
2 a
3 a
4 a
5 b 4.74
6 b
7 b
8 b
9 c 5.35
10 c
11 c
12 c
13 c
14 d 1.53
15 d
16 d
where I have an ID column with differing numbers of entries for each ID and a volume for one entry, but not the others.
Is there a way to populate the empty Volume cells with the filled cell of the corresponding ID?
I'm essentially trying to remove the step of going into Excel and using the "drag to fill" for each ID (I have over 2000 IDs). Not every ID has the same amount of entries (i.e. ID "a" has 4, where ID "c" has 5 and ID "d" has 3).
I'm thinking dplyr will have a tool to do this, but I have not been able to find the answer.

In the tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill()

Here is a base R solution, where ave() is used to fill up the NAs
df <- within(df,Volume <- ave(Volume,ID, FUN = function(x) unique(x[!is.na(x)])))

Related

How to filter rows where there are changes in a categorical variable

We have a dataframe called data with 2 columns: Time which is arranged in ascending order, and Place which describes where the individual was:
data.frame(Time = seq(1,20,1),
Place = rep(letters[c(1:3,1)], c(5,5,3,7)))
Since this data is in ascending order with respect to Time, we want to subset the rows where Place changes from the previous observation.
The resulting dataframe for this data would look like this:
Time Place
1 a
6 b
11 c
14 a
Notice that the same Place can show up later, like Place == a did in this example. How can we perform this kind of subset in R?
Apply the duplicated on the rleid of the 'Place'
library(dplyr)
library(data.table)
df1 %>%
filter(!duplicated(rleid(Place)))
Or in base R with rle
subset(df1, !duplicated(with(rle(Place), rep(seq_along(values), lengths))))
-output
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
Another base R option using subset + tail + head
subset(
df,
c(TRUE, tail(Place, -1) != head(Place, -1))
)
which gives
Time Place
1 1 a
6 6 b
11 11 c
14 14 a

Transform table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5

Expand each row with specific value in tidyr [duplicate]

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 3 years ago.
I have a dataset with grouped observations per row. However, I would like to expand each row observation from a single observation per replicate to a set number (in this case "20" observations each).
In the attached picture,
Each replicate is a row. I would like to expand each row into 20. So "wellA" for "LS x SB" becomes expands to 20 of the same line. As a bonus, I would also like to make a new column called "Replicate2" that numerically lists 1 to 20 to reflect these 20 new rows per replicate.
The idea would to then add the survival status per individual (reflected in the new columns "Status" and "Event").
I think the "expand" function in tidyr has potential but can't figure out how to just add a fixed number per replicate. Using the "Alive" column is adding a variable number of observations.
expand<-DF %>% expand(nesting(Date, Time, Cumulative.hrs, Timepoint, Treatment, Boat, Parentage, Well, Mom, Dad, Cone, NumParents, Parents), Alive)
Any help appreciated!
In base R, we can use rep to repeat rows and transform to add new columns
n <- 20
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
Using a reproducible example with n = 3
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
n <- 3
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
# a b c Replicate2
#1 1 4 7 1
#2 1 4 7 2
#3 1 4 7 3
#4 2 5 8 1
#5 2 5 8 2
#6 2 5 8 3
#7 3 6 9 1
#8 3 6 9 2
#9 3 6 9 3
Using dplyr we can use slice to repeat rows and mutate to add new column.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = n)) %>%
mutate(Replicate2 = rep(seq_len(n), n))
Do a cross join between your existing data and the numbers 1:20.
tidyr::crossing(DF, replicate2 = 1:20)
If you want to add additional columns, use mutate:
... %>% mutate(status = 1, event = FALSE)

Using a loop to create multiple dataframes in R based on columns criteria [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 4 years ago.
Suppose I have a dataframe with 3 columns. I would like to create separate sub-dataframes for each of the unique combinations of a few columns.
For example, suppose we have just 3 columns,
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
I would like to get a separate dataframe for each of the unique combinations of Column 'a' and 'b'
I started with using unique to get a list of the unique combinations as the following,
factors <- unique(df[,c('a','b')])
a b
1 1 a
2 5 a
3 2 f
4 3 d
5 4 f
6 5 c
7 3 a
8 2 r
10 3 c
But I am not sure what to do next.
The code below are for illustration purposes. Ideally this will be done through a loop where it uses each of the rows in factors to create the dataframes.
df_1_a <- df %>% filter(a==1, b=='a')
a b c
1 1 a 0.2
2 1 a 0.9
df_3_a <- %>% filter(a==3, b=='a')
a b c
1 3 a 0.112
.
.
.
This is kinda dirty and I'm not sure that answer your question but try this :
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
d <- paste0(a,b)
df <- data.frame(a,b,c,d)
df_splited <- split(df,df$d)
You obtain a list composed of dataframes with unique combinaison of a,b
You can use split after you get the unique combinations you are after.
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c,stringsAsFactors = FALSE)
fx <- unique(df[,c('a','b')])
fx_list <- split(fx,rownames(fx))

remove ALL copies of duplicated rows from data frame (not just the duplicated copies) [duplicate]

This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 7 years ago.
I want to keep only non-duplicated rows in a dataset. This is going one step beyond "removing duplicates"; that is, I want to eliminate ALL copies of duplicated rows, not just the duplicate copies, and keep only the rows that were never duplicated in the first place.
Dataset:
df <- data.frame(A = c(5,5,6,7,8,8,8), B = sample(1:100, 7))
df
A B
5 91
5 46
6 41
7 98
8 35
8 56
8 36
Want to turn it into:
A B
6 41
7 98
Here is what I tried using dplyr:
df_single <- df %>% count(A) %>% filter(n == 1)
# Returns all the values of A for which only one row exists
df %>% filter(A == df_single$A)
# Trying to subset only those values of A, but this returns error
# "longer object length is not a multiple of shorter object length"
Thanks for your help. A nice bonus would be additional code for doing the opposite (keeping all the OTHER rows - i.e., eliminating only the non-duplicated rows from the dataset).
Try this (no packages needed):
subset(df, !duplicated(A) & !duplicated(A, fromLast = TRUE))
giving:
A B
3 6 41
4 7 98

Resources