Calculating median based on segments in r [duplicate] - r

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Hi I want to calculate the median of certain values based on the segment they fall into which we get by another column. The initial data structure is like given below:
Column A Column B
559 1
559 1
322 1
661 2
661 2
662 2
661 2
753 3
752 3
752 3
752 3
752 3
328 4
328 4
328 4
The calculated medians would be based on column A and the output would look like this:
Column A Column B Median
559 1 559
559 1 559
322 1 559
661 2 661
661 2 661
662 2 661
661 2 661
753 3 752
752 3 752
752 3 752
752 3 752
752 3 752
328 4 328
328 4 328
328 4 328
Median is calculated based on column A and for the set of values of column B which are same. For example we should calculate medians of all values of column A where column B values are same and paste them in the column Median.
I need to do this operation in r but haven'e been able to crack it. Is there a way to do this through dplyr or any other package?
Thanks

you can use the library(data.table) and then put your data in a data.table
dt <- as.data.table(data)
dt[,Median:=median('Column A'),by="Column B"]

here it is, done in base R and data.table way. Apologies in advance - my base r approach might be a bit cumbersome - i do not use it too often.
exampleData=data.frame(A=runif(10,0,10),B=sample(2,10,replace=T))
# Data.frame option
exampleData$Median=tapply(exampleData$A,exampleData$B,median)[as.character(exampleData$B)]
# Data.table option
library(data.table)
exampleData=data.table(exampleData)
exampleData[,Median_Data_Table_Way:=median(A),by=B]

Related

recoding a numerical variable based on a specific criterion in r

I would like to recode a numerical variable based on a cut score criterion. If the cut scores are not available in the variable, I would like to recode the closest smaller value as a cut score. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores <- c(512,531,541,555,562,565,570,572,573,588)
data <- data.frame(ids, scores)
> data
ids scores
1 1 512
2 2 531
3 3 541
4 4 555
5 5 562
6 6 565
7 7 570
8 8 572
9 9 573
10 10 588
cuts <- c(531, 560, 575)
The first cut score (531) is in the dataset. So it will stay the same as 531. However, 560 and 575 were not available. I would like to recode the closest smaller value (555) to the second cut score as 560 in the new column, and for the third cut score, I'd like to recode 573 as 575.
Here is what I would like to get.
ids scores rescored
1 1 512 512
2 2 531 531
3 3 541 541
4 4 555 560
5 5 562 562
6 6 565 565
7 7 570 570
8 8 572 572
9 9 573 575
10 10 588 588
Any thoughts?
Thanks
One option would be to find the index with findInterval and then get the pmax of the 'scores' corresponding to that index with the 'cuts' and updated the 'rescored' column elements on that index
i1 <- with(data, findInterval(cuts, scores))
data$rescored <- data$scores
data$rescored[i1] <- with(data, pmax(scores[i1], cuts))
data
# ids scores rescored
#1 1 512 512
#2 2 531 531
#3 3 541 541
#4 4 555 560
#5 5 562 562
#6 6 565 565
#7 7 570 570
#8 8 572 572
#9 9 573 575
#10 10 588 588

function similar to dplyr::distinct with a cutoff

I have a dataframe of points with x,y positions (in pixels) and would like to filter out all the points +/- 5 pixels. Is there a function similar to dplyr::distinct() but with a cutoff.
Example dataset:
X.1 X Y
1 637 614
2 559 503
3 601 459
4 601 459
5 603 462
6 604 460
I am expecting an output of :
X.1 X Y
1 637 614
2 559 503
3 601 459 <- the first element is preserved.
Thanks
A simple solution is to round your data to the nearest multiple of 5 and then use a regular distinct function:
X.1$x <- round(X.1$x/5)*5
X.1$y <- round(X.1$y/5)*5
distinct(X.1,.keep_all = TRUE)
#Output:
X.1 X Y
1 635 615
2 560 505
3 600 560
Your problem may require a higher level of accuracy however.

Find range in which each vector element falls in [duplicate]

This question already has answers here:
Cut by Defined Interval
(2 answers)
Closed 4 years ago.
I have a list of random numbers.
x=sample(1:1000, 3)
Is there a simple way to get a list of range values in which each element falls in?
id=seq(1, 1000, by=50)
[1] 1 51 101 151 201 251 301 351 401 451 501 551
[13] 601 651 701 751 801 851 901 951
eg.
x
[1] 637 374 68
distribution
[1] "601~650" "351~400" "51~100"
Try this easy solution using findInterval:
cbind(x,lim_inf=id[findInterval(x,id)],lim_sup=id[findInterval(x,id)+1])
x lim_inf lim_sup
[1,] 378 351 401
[2,] 609 601 651
[3,] 496 451 501

Custom sorting of dataframe by column with duplicates

I wanted to ask for help, because I am having difficulties ordering my table, because the column for the table to be ordered has duplicates (coltoorder). This is a tiny part of my table. The desired order is custom, roughly speaking, it is based on the order of the first column, except for the first value (887).
text<-"col1 col2 col3 coltoorder
895 2 1374 887
888 2 14 887
1018 3 1065 895
896 2 307 895
889 2 4 888
891 2 8 888
1055 2 971 1018
926 3 241 896
1021 2 87 1018
897 2 64 896"
mytable<-read.table(text=text, header = T)
mytable
desired order
myindex<-c(887,895,888,1018,896) # equivalent to
myindex2<-c(887,887,895,895,888,888,1018,1018,896,896)
some failed attemps
try1<-mytable[match(myindex, mytable$coltoorder),]
try2<-mytable[match(myindex2, mytable$coltoorder),]
try3<-mytable[mytable$coltoorder %in% myindex,]
try3<-mytable[myindex %in% mytable$coltoorder,]
try4<-mytable[myindex2 %in% mytable$coltoorder,]
rownames(mytable) <- mytable$coltoorder # error
It seems like coltoorder should be treated categorically, not numerically. All factors have an order of their levels, so we'll convert to a factor where the levels are ordered according to myindex. Then this ordering is "baked in" to the column and we can use order normally on it.
mytable$coltoorder = factor(mytable$coltoorder, levels = myindex)
mytable[order(mytable$coltoorder), ]
# col1 col2 col3 coltoorder
# 8 895 2 1374 887
# 1 888 2 14 887
# 131 1018 3 1065 895
# 9 896 2 307 895
# 2 889 2 4 888
# 4 891 2 8 888
# 168 1055 2 971 1018
# 134 1021 2 87 1018
# 39 926 3 241 896
# 10 897 2 64 896
Do be careful - this column is now a factor not a numeric. If you want to recover the numeric values from a factor, you need to convert via character: original_values = as.numeric(as.character(mytable$coltoorder)).
Your data sample suggests that your desired sort order is equivalent to the first appearance in column coltoorder.
If this is true, the function fct_inorder() from Hadley Wickham's forcats package may be particular helpful here:
mytable$coltoorder <- forcats::fct_inorder(as.character(mytable$coltoorder))
mytable[order(mytable$coltoorder), ]
col1 col2 col3 coltoorder
1 895 2 1374 887
2 888 2 14 887
3 1018 3 1065 895
4 896 2 307 895
5 889 2 4 888
6 891 2 8 888
7 1055 2 971 1018
9 1021 2 87 1018
8 926 3 241 896
10 897 2 64 896
fct_inorder() reorders factors levels by first appearance. So, there is no need to create a separate myindex vector.
However, the caveats from Gregor's answer apply as well.

Add column to data frame based on values of another column in another row

I was searching for an answer to my specific problem, but I didn't find a conclusion. I found this: Add column to Data Frame based on values of other columns , but it was'nt exactly what I need in my specific case.
I'm really a beginner in R, so I hope maybe someone can help me or has a good hint for me.
Here an example of what my data frame looks like:
ID answer 1.partnerID
125 3 715
235 4 845
370 7 985
560 1 950
715 5 235
950 5 560
845 6 370
985 6 125
I try to describe what I want to do on an example:
In the first row is the data of the person with the ID 125. The first partner of this person is the person with ID 715. I want to create a new column, with the value of the answer of each person´s partner in it. It should look like this:
ID answer 1.partnerID 1.partneranswer
125 3 715 5
235 4 845 6
370 7 985 6
560 1 950 5
715 5 235 4
950 5 560 1
845 6 370 7
985 6 125 3
So R should take the value of the column 1.partnerID, which is in this case "715" and search for the row, where "715" is the value in the column ID (there are no IDs more than once).
From this specific row R should take the value from the column answer (in this example that´s the "5") and put it into the new column "1.partneranswer", but in the row from person 125.
I hope someone can understand what I want to do ...
My problem is that I can imagine how to write this for each row per hand, but I think there need to be an easiear way to do it for all rows in once? (especially because in my original data.frame are 5 partners per person and there are more than one column from which the values should be transfered, so it would come to many hours work to write it for each single row per hand).
I hope someone can help.
Thank you!
One solution is to use apply as follows:
df$partneranswer <- apply(df, 1, function(x) df$answer[df$ID == x[3]])
Output will be as desired above. There may be a loop-less approach.
EDIT: Adding a loop-less (vectorized answer) using match:
df$partneranswer <- df$answer[match(df$X1.partnerID, df$ID)]
df
ID answer X1.partnerID partneranswer
1 125 3 715 5
2 235 4 845 6
3 370 7 985 6
4 560 1 950 5
5 715 5 235 4
6 950 5 560 1
7 845 6 370 7
8 985 6 125 3
Update: This can be done with self join; The first two columns define a map relationship from ID to answer, in order to find the answers for the partner IDs, you can merge the data frame with itself with first data frame keyed on partnerID and the second data frame keyed on ID:
Suppose df is (fixed the column names a little bit):
df
# ID answer partnerID
#1 125 3 715
#2 235 4 845
#3 370 7 985
#4 560 1 950
#5 715 5 235
#6 950 5 560
#7 845 6 370
#8 985 6 125
merge(df, df[c('ID', 'answer')], by.x = "partnerID", by.y = "ID")
# partnerID ID answer.x answer.y
#1 125 985 6 3
#2 235 715 5 4
#3 370 845 6 7
#4 560 950 5 1
#5 715 125 3 5
#6 845 235 4 6
#7 950 560 1 5
#8 985 370 7 6
Old answer:
If the ID and partnerID are mapped to each other one on one, you can try:
df$partneranswer <- with(df, answer[sapply(X1.partnerID, function(partnerID) which(ID == partnerID))])
df
# ID answer X1.partnerID partneranswer
#1 125 3 715 5
#2 235 4 845 6
#3 370 7 985 6
#4 560 1 950 5
#5 715 5 235 4
#6 950 5 560 1
#7 845 6 370 7
#8 985 6 125 3

Resources