R, conditionally remove duplicate rows - r

I have a dataframe in R containing the columns ID.A, ID.B and DISTANCE, where distance represents the distance between ID.A and ID.B. For each value (1->n) of ID.A, there may be multiple values of ID.B and DISTANCE (i.e. there may be multiple duplicate rows in ID.A e.g. all of value 4 which each has a different ID.B and distance in that row).
I would like to be able to remove rows where ID.A is duplicated, but conditional upon the distance value such that I am left with the smallest distance values for each ID.A record.
Hopefully that makes sense?
Many thanks in advance
EDIT
Hopefully an example will prove more useful than my text. Here I would like to remove the second and third rows where ID.A = 3:
myDF <- read.table(text="ID.A ID.B DISTANCE
1 3 1
2 6 8
3 2 0.4
3 3 1
3 8 5
4 8 7
5 2 11", header = TRUE)

You can also do it easily in base R. If dat is your dataframe,
do.call(rbind,
by(dat, INDICES=list(dat$ID.A),
FUN=function(x) head(x[order(x$DISTANCE), ], 1)))

One possibility:
myDF <- myDF[order(myDF$ID.A, myDF$DISTANCE), ]
newdata <- myDF[which(!duplicated(myDF$ID.A)),]
Which gives :
ID.A ID.B DISTANCE
1 1 3 1.0
2 2 6 8.0
5 3 2 0.4
6 4 8 7.0
7 5 2 11.0

You can use the plyr package for that. For example, if your data are like these :
d <- data.frame(id.a=c(1,1,1,2,2,3,3,3,3),
id.b=c(1,2,3,1,2,1,2,3,4),
dist=c(12,10,15,20,18,16,17,25,9))
id.a id.b dist
1 1 1 12
2 1 2 10
3 1 3 15
4 2 1 20
5 2 2 18
6 3 1 16
7 3 2 17
8 3 3 25
9 3 4 9
You can use the ddply function like this :
library(plyr)
ddply(d, "id.a", function(df) return(df[df$dist==min(df$dist),]))
Which gives :
id.a id.b dist
1 1 2 10
2 2 2 18
3 3 4 9

Related

Randomly select number (without repetition) for each group in R

I have the following dataframe containing a variable "group" and a variable "number of elements per group"
group elements
1 3
2 1
3 14
4 10
.. ..
.. ..
30 5
then I have a bunch of numbers going from 1 to (let's say) 30
when summing "elements" I would get 900. what I want to obtain is to randomly select a number (from 0 to 30) from 1-30 and assign it to each group until I fill the number of elements for that group. Each of those should appear 30 times in total.
thus, for group 1, I want to randomly select 3 number from 0 to 30
for group 2, 1 number from 0 to 30 etc. until I filled all of the groups.
the final table should look like this:
group number(randomly selected)
1 7
1 20
1 7
2 4
3 21
3 20
...
any suggestions on how I can achieve this?
In base R, if you have df like this...
df
group elements
1 3
2 1
3 14
Then you can do this...
data.frame(group = rep(df$group, #repeat group no...
df$elements), #elements times
number = unlist(sapply(df$elements, #for each elements...
sample.int, #...sample <elements> numbers
n=30, #from 1 to 30
replace = FALSE))) #without duplicates
group number
1 1 19
2 1 15
3 1 28
4 2 15
5 3 20
6 3 18
7 3 27
8 3 10
9 3 23
10 3 12
11 3 25
12 3 11
13 3 14
14 3 13
15 3 16
16 3 26
17 3 22
18 3 7
Give this a try:
df <- read.table(text = "group elements
1 3
2 1
3 14
4 10
30 5", header = TRUE)
# reproducibility
set.seed(1)
df_split2 <- do.call("rbind",
(lapply(split(df, df$group),
function(m) cbind(m,
`number(randomly selected)` =
sample(1:30, replace = TRUE,
size = m$elements),
row.names = NULL
))))
# remove element column name
df_split2$elements <- NULL
head(df_split2)
#> group number(randomly selected)
#> 1.1 1 25
#> 1.2 1 4
#> 1.3 1 7
#> 2 2 1
#> 3.1 3 2
#> 3.2 3 29
The split function splits the df into chunks based on the group column. We then take those smaller data frames and add a column to them by sampling 1:30 a total of elements time. We then do.call on this list to rbind back together.
Yo have to generate a new dataframe repeating $group $element times, and then using sample you can generate the exact number of random numbers:
data<-data.frame(group=c(1,2,3,4,5),
elements=c(2,5,2,1,3))
data.elements<-data.frame(group=rep(data$group,data$elements),
number=sample(1:30,sum(data$elements)))
The result:
group number
1 1 9
2 1 4
3 2 29
4 2 28
5 2 18
6 2 7
7 2 25
8 3 17
9 3 22
10 4 5
11 5 3
12 5 8
13 5 26
I solved as follow:
random_sample <- rep(1:30, each=30)
random_sample <- sample(random_sample)
then I create a df with this variable and a variable containing one group per row repeated by the number of elements in the group itself

How to return the positions of first occurrence for (different) duplicated rows in a data.frame?

Suppose you have a data frame like the following:
dfiris <- rbind(iris[1:5, -5], iris[1:5, -5], iris[1:5, -5], iris[1:5, -5], iris[1:5, -5])
Since the first 5 rows are then repeated other 4 times, I would like to efficiently get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
The function duplicate() does not help me because it only returns TRUE from the second occurrence on of a certain duplicated row.
My (inefficient) solution:
apply(dfiris, 1, function(df) {
which(apply(unique(dfiris), 1, function(df_u) identical(df, df_u)))
})
There must be a quicker way to do that. Any suggestions?
Using data.table:
library(data.table)
setDT(dfiris, keep.rownames=TRUE)
print(setkey(dfiris[, list(rn=as.numeric(rn), firstOcc=.I[1]),
by=c(names(dfiris)[-1])], rn))
You may also try:
library(dplyr)
left_join(dfiris,mutate(distinct(dfiris), rn=row_number()))
%>% select(rn)

Remove rows from a dataframe based on a value in one column

I have a dataframe (imported from a csv file) as follows
moose loose hoose
2 3 8
1 3 4
5 4 2
10 1 4
The R code should generate a mean column and then I would like to remove all rows where the value of the mean is <4 so that I end up with:
moose loose hoose mean
2 3 8 4.3
1 3 4 2.6
5 4 2 3.6
10 1 4 5
which should then end up as:
moose loose hoose mean
2 3 8 4.3
10 1 4 5
How can I do this in R?
dat2 <- subset(transform(dat1, Mean=round(rowMeans(dat1),1)), Mean >=4)
dat2
# moose loose hoose Mean
#1 2 3 8 4.3
#4 10 1 4 5.0
Using data.table
setDT(dat1)[, Mean:=rowMeans(.SD)][Mean>=4]
# moose loose hoose Mean
#1: 2 3 8 4.333333
#2: 10 1 4 5.000000
I will assume your data is called d. Then you run:
d$mean <- rowMeans(d) ## create a new column with the mean of each row
d <- d[d$mean >= 4, ] ## filter the data using this column in the condition
I suggest you read about creating variables in a data.frame, and filtering data. These are very common operations that you can use in many many contexts.
You could also use within, which allows you to assign/remove columns and then returns the transformed data. Start with df,
> df
# moose loose hoose
#1 2 3 8
#2 1 3 4
#3 5 4 2
#4 10 1 4
> within(d <- df[rowMeans(df) > 4, ], { means <- round(rowMeans(d), 1) })
# moose loose hoose means
#1 2 3 8 4.3
#4 10 1 4 5.0

R: How to use intervals as input data for histograms?

I would like to import the data into R as intervals, then I would like to count all the numbers falling within these intervals and draw a histogram from this counts.
Example:
start end freq
1 8 3
5 10 2
7 11 5
.
.
.
Result:
number freq
1 3
2 3
3 3
4 3
5 5
6 5
7 10
8 10
9 7
10 7
11 5
Some suggestions?
Thank you very much!
Assuming your data is in df, you can create a data set that has each number in the range repeated by freq. Once you have that it's trivial to use the summarizing functions in R. This is a little roundabout, but a lot easier than explicitly computing the sum of the overlaps (though that isn't that hard either).
dat <- unlist(apply(df, 1, function(x) rep(x[[1]]:x[[2]], x[[3]])))
hist(dat, breaks=0:max(df$end))
You can also do table(dat)
dat
1 2 3 4 5 6 7 8 9 10 11
3 3 3 3 5 5 10 10 7 7 5

Compute difference between rows in R and setting in zero first difference

Hi everybody I am trying to solve a little problem in R. I want to compute the difference between rows in a dataframe in R. My dataframe looks like this:
df <- data.frame(ID=1:8, x2=8:1, x3=11:18, x4=c(2,4,10,0,1,1,9,12))
I want to create a new column named diff.var. This column saves the results of differences from rows in variable. One posibble solution is using diff() function. When I used this function I got this:
diff(df$x4)
[1] 2 6 -10 1 0 8 3
That works fine but when I try to apply in my dataframe using df$diff.var=diff(df$x4) I got this:
Error in `$<-.data.frame`(`*tmp*`, "diff.var", value = c(2, 6, -10, 1, :
replacement has 7 rows, data has 8
Due to the fact that the firs row doesn't have a previous row to compute the difference I want to set this in zero. I would like to get something this:
ID x2 x3 x4 diff.var
1 8 11 2 0
2 7 12 4 2
3 6 13 10 6
4 5 14 0 -10
5 4 15 1 1
6 3 16 1 0
7 2 17 9 8
8 1 18 12 3
Where the first element of diff.var is zero due to this element doesn't have a previous element. I would like to build a function to set firts element of diff.var is zero and that makes the differences for the next rows. I wish to create a new dataframe with all variables and diff.var because ID is used por posterior analysis with diff.var. diff() doesn't allow to create this new variable. Thanks for your help.
This question was already asked before in this forum and can be found elsewhere. Anyway, do what Frank suggests
df <- data.frame(ID=1:8, x2=8:1, x3=11:18, x4=c(2,4,10,0,1,1,9,12))
df$vardiff <- c(0, diff(df$x4))
df
ID x2 x3 x4 vardiff
1 1 8 11 2 0
2 2 7 12 4 2
3 3 6 13 10 6
4 4 5 14 0 -10
5 5 4 15 1 1
6 6 3 16 1 0
7 7 2 17 9 8
8 8 1 18 12 3

Resources