function similar to dplyr::distinct with a cutoff - r

I have a dataframe of points with x,y positions (in pixels) and would like to filter out all the points +/- 5 pixels. Is there a function similar to dplyr::distinct() but with a cutoff.
Example dataset:
X.1 X Y
1 637 614
2 559 503
3 601 459
4 601 459
5 603 462
6 604 460
I am expecting an output of :
X.1 X Y
1 637 614
2 559 503
3 601 459 <- the first element is preserved.
Thanks

A simple solution is to round your data to the nearest multiple of 5 and then use a regular distinct function:
X.1$x <- round(X.1$x/5)*5
X.1$y <- round(X.1$y/5)*5
distinct(X.1,.keep_all = TRUE)
#Output:
X.1 X Y
1 635 615
2 560 505
3 600 560
Your problem may require a higher level of accuracy however.

Related

recoding a numerical variable based on a specific criterion in r

I would like to recode a numerical variable based on a cut score criterion. If the cut scores are not available in the variable, I would like to recode the closest smaller value as a cut score. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores <- c(512,531,541,555,562,565,570,572,573,588)
data <- data.frame(ids, scores)
> data
ids scores
1 1 512
2 2 531
3 3 541
4 4 555
5 5 562
6 6 565
7 7 570
8 8 572
9 9 573
10 10 588
cuts <- c(531, 560, 575)
The first cut score (531) is in the dataset. So it will stay the same as 531. However, 560 and 575 were not available. I would like to recode the closest smaller value (555) to the second cut score as 560 in the new column, and for the third cut score, I'd like to recode 573 as 575.
Here is what I would like to get.
ids scores rescored
1 1 512 512
2 2 531 531
3 3 541 541
4 4 555 560
5 5 562 562
6 6 565 565
7 7 570 570
8 8 572 572
9 9 573 575
10 10 588 588
Any thoughts?
Thanks
One option would be to find the index with findInterval and then get the pmax of the 'scores' corresponding to that index with the 'cuts' and updated the 'rescored' column elements on that index
i1 <- with(data, findInterval(cuts, scores))
data$rescored <- data$scores
data$rescored[i1] <- with(data, pmax(scores[i1], cuts))
data
# ids scores rescored
#1 1 512 512
#2 2 531 531
#3 3 541 541
#4 4 555 560
#5 5 562 562
#6 6 565 565
#7 7 570 570
#8 8 572 572
#9 9 573 575
#10 10 588 588

Custom sorting of dataframe by column with duplicates

I wanted to ask for help, because I am having difficulties ordering my table, because the column for the table to be ordered has duplicates (coltoorder). This is a tiny part of my table. The desired order is custom, roughly speaking, it is based on the order of the first column, except for the first value (887).
text<-"col1 col2 col3 coltoorder
895 2 1374 887
888 2 14 887
1018 3 1065 895
896 2 307 895
889 2 4 888
891 2 8 888
1055 2 971 1018
926 3 241 896
1021 2 87 1018
897 2 64 896"
mytable<-read.table(text=text, header = T)
mytable
desired order
myindex<-c(887,895,888,1018,896) # equivalent to
myindex2<-c(887,887,895,895,888,888,1018,1018,896,896)
some failed attemps
try1<-mytable[match(myindex, mytable$coltoorder),]
try2<-mytable[match(myindex2, mytable$coltoorder),]
try3<-mytable[mytable$coltoorder %in% myindex,]
try3<-mytable[myindex %in% mytable$coltoorder,]
try4<-mytable[myindex2 %in% mytable$coltoorder,]
rownames(mytable) <- mytable$coltoorder # error
It seems like coltoorder should be treated categorically, not numerically. All factors have an order of their levels, so we'll convert to a factor where the levels are ordered according to myindex. Then this ordering is "baked in" to the column and we can use order normally on it.
mytable$coltoorder = factor(mytable$coltoorder, levels = myindex)
mytable[order(mytable$coltoorder), ]
# col1 col2 col3 coltoorder
# 8 895 2 1374 887
# 1 888 2 14 887
# 131 1018 3 1065 895
# 9 896 2 307 895
# 2 889 2 4 888
# 4 891 2 8 888
# 168 1055 2 971 1018
# 134 1021 2 87 1018
# 39 926 3 241 896
# 10 897 2 64 896
Do be careful - this column is now a factor not a numeric. If you want to recover the numeric values from a factor, you need to convert via character: original_values = as.numeric(as.character(mytable$coltoorder)).
Your data sample suggests that your desired sort order is equivalent to the first appearance in column coltoorder.
If this is true, the function fct_inorder() from Hadley Wickham's forcats package may be particular helpful here:
mytable$coltoorder <- forcats::fct_inorder(as.character(mytable$coltoorder))
mytable[order(mytable$coltoorder), ]
col1 col2 col3 coltoorder
1 895 2 1374 887
2 888 2 14 887
3 1018 3 1065 895
4 896 2 307 895
5 889 2 4 888
6 891 2 8 888
7 1055 2 971 1018
9 1021 2 87 1018
8 926 3 241 896
10 897 2 64 896
fct_inorder() reorders factors levels by first appearance. So, there is no need to create a separate myindex vector.
However, the caveats from Gregor's answer apply as well.

Calculating median based on segments in r [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Hi I want to calculate the median of certain values based on the segment they fall into which we get by another column. The initial data structure is like given below:
Column A Column B
559 1
559 1
322 1
661 2
661 2
662 2
661 2
753 3
752 3
752 3
752 3
752 3
328 4
328 4
328 4
The calculated medians would be based on column A and the output would look like this:
Column A Column B Median
559 1 559
559 1 559
322 1 559
661 2 661
661 2 661
662 2 661
661 2 661
753 3 752
752 3 752
752 3 752
752 3 752
752 3 752
328 4 328
328 4 328
328 4 328
Median is calculated based on column A and for the set of values of column B which are same. For example we should calculate medians of all values of column A where column B values are same and paste them in the column Median.
I need to do this operation in r but haven'e been able to crack it. Is there a way to do this through dplyr or any other package?
Thanks
you can use the library(data.table) and then put your data in a data.table
dt <- as.data.table(data)
dt[,Median:=median('Column A'),by="Column B"]
here it is, done in base R and data.table way. Apologies in advance - my base r approach might be a bit cumbersome - i do not use it too often.
exampleData=data.frame(A=runif(10,0,10),B=sample(2,10,replace=T))
# Data.frame option
exampleData$Median=tapply(exampleData$A,exampleData$B,median)[as.character(exampleData$B)]
# Data.table option
library(data.table)
exampleData=data.table(exampleData)
exampleData[,Median_Data_Table_Way:=median(A),by=B]

Checking the value from given threshold in a set of observation and continue till end of vector

Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.
Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590
Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.

How can I draw a boxplot of an R table?

I have a table produced by calling table(...) on a column of data, and I get a table that looks like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
346 351 341 333 345 415 421 425 429 437 436 469 379 424 387 419 392 396 381 421
I'd like to draw a boxplot of these frequencies, but calling boxplot on the table results in an error:
Error in Axis.table(x = c(333, 368.5, 409.5, 427, 469), side = 2) :
only for 1-D table
I've tried coercing the table to an array with as.array but it seems to make no difference. What am I doing wrong?
If I understand you correctly, boxplot(c(tab)) or boxplot(as.vector(tab)) should work (credit to #joran as well).

Resources