Repeat (duplicate) just one row twice in R - r

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!

Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6

One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

Related

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

Remove duplicate rows with certain value in specific column

I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.
In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".
My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.
Below is re-producible ex:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)
Thank you so much!
Update: Or, when removing duplicate, only keep the second observation (4th row).
dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.
Here is an example with a loop:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)
# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))
to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
# Find simmilar rows
selection=which(d$a==d$a[i] & d$b==d$b[i])
# Sotre indices of raw in the set of duplicated row whihc are "excluded"
to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}
# Remove rows
d=d[-to_remove,]
print(d)
> a b c
> 1 4 red
> 2 2 5 green
> 4 3 6 orange
> 5 7 8 excluded
Here is a possibility ... I hope it can help :)
nquit <- (d %>%
mutate(code= 1:nrow(d)) %>%
group_by(a, b) %>%
mutate(nDuplicate= n()) %>%
filter(nDuplicate > 1) %>%
filter(c == "excluded"))$code
e <- d[-nquit]
Shortening the approach by #Klone a bit, another dplyr solution:
d %>% mutate(c = factor(c, ordered = TRUE,
levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
arrange(c) %>% # Sort the data frame so that excluded comes first
group_by(a, b) %>% # Group by the two columns that determine duplicates
mutate(id = 1:n()) %>% # Assign IDs in each group
filter(id == 1) # Only keep one row in each group
Result:
# A tibble: 4 x 4
# Groups: a, b [4]
a b c id
<dbl> <dbl> <ord> <int>
1 1 4 red 1
2 2 5 green 1
3 3 6 orange 1
4 7 8 excluded 1
Regarding your edit at the end of the question:
Update: Or, when removing duplicate, only keep the second observation (4th row).
note that, in case the ordering of the rows by col3 determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE in the duplicated() function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.
Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):
a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)
a b c
1 1 4 excluded
2 1 4 red
3 2 5 green
4 3 6 excluded
5 3 6 excluded
6 3 6 orange
7 7 8 excluded
using:
ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])
we get:
a b c
2 1 4 red
3 2 5 green
6 3 6 orange
7 7 8 excluded
Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.

Restructuring DataFrame Based on Single Column Values

I am trying to move data from one column to another based on multiple existing values. I researched and found a simple solution for a single column - as seen in the current code below. However, I would like a way to do it for all rows. I've been trying to research a way, but cannot seem to find a way to apply a possible loop to this function. Any help would be great. I am using the latest version of R, and RStudio. Thanks!
CURRENT DATAFRAME:
Row #People
A 3
A 2
A 2
B 1
B 1
C 3
C 3
C 2
C 1
Desired DataFrame:
Row: A B C
3 1 3
2 1 3
2 2
1
Current Code:
files <- read.csv("SampleData3.csv", header = T)
subset<-as.data.frame(files[files$RowID == A, "DisRank"])
Try the following:
library("qpcR")
do.call(qpcR:::data.frame.na,split(df$X.People, df$Row))
A B C
1 3 1 3
2 2 1 3
3 2 NA 2
4 NA NA 1
Here's a tidyverse way of doing it using tidyr::spread. You'll also need to add row numbers, which I get rid of in the end by using dplyr's select(-id).
Start by creating the data:
df = read.table(text="Row People
A 3
A 2
A 2
B 1
B 1
C 3
C 3
C 2
C 1", header = TRUE)
Now do the work:
library(tidyverse)
df %>%
group_by(Row) %>%
mutate(id = row_number()) %>%
spread(key = Row, value = People) %>%
select(-id)
As far as I know, your desired DataFrame is not a valid DataFrame in R. So it is impossible. You should explain why you want something like that. There are other data types like lists that can store data in a structure like that, but I haven't a clue what you want to do afterwards.
How about reshape2::dcast(.~Row, data = dta, fun.aggregate=list)[, -1]. This will give you a data.frame with list in a cell.
Added for output
A B C
1 3, 2, 2 1, 1 3, 3, 2, 1

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Resources