Custom sorting of dataframe by column with duplicates

Custom sorting of dataframe by column with duplicates - r

I wanted to ask for help, because I am having difficulties ordering my table, because the column for the table to be ordered has duplicates (coltoorder). This is a tiny part of my table. The desired order is custom, roughly speaking, it is based on the order of the first column, except for the first value (887).
text<-"col1 col2 col3 coltoorder
895 2 1374 887
888 2 14 887
1018 3 1065 895
896 2 307 895
889 2 4 888
891 2 8 888
1055 2 971 1018
926 3 241 896
1021 2 87 1018
897 2 64 896"
mytable<-read.table(text=text, header = T)
mytable
desired order
myindex<-c(887,895,888,1018,896) # equivalent to
myindex2<-c(887,887,895,895,888,888,1018,1018,896,896)
some failed attemps
try1<-mytable[match(myindex, mytable$coltoorder),]
try2<-mytable[match(myindex2, mytable$coltoorder),]
try3<-mytable[mytable$coltoorder %in% myindex,]
try3<-mytable[myindex %in% mytable$coltoorder,]
try4<-mytable[myindex2 %in% mytable$coltoorder,]
rownames(mytable) <- mytable$coltoorder # error

It seems like coltoorder should be treated categorically, not numerically. All factors have an order of their levels, so we'll convert to a factor where the levels are ordered according to myindex. Then this ordering is "baked in" to the column and we can use order normally on it.
mytable$coltoorder = factor(mytable$coltoorder, levels = myindex)
mytable[order(mytable$coltoorder), ]
# col1 col2 col3 coltoorder
# 8 895 2 1374 887
# 1 888 2 14 887
# 131 1018 3 1065 895
# 9 896 2 307 895
# 2 889 2 4 888
# 4 891 2 8 888
# 168 1055 2 971 1018
# 134 1021 2 87 1018
# 39 926 3 241 896
# 10 897 2 64 896
Do be careful - this column is now a factor not a numeric. If you want to recover the numeric values from a factor, you need to convert via character: original_values = as.numeric(as.character(mytable$coltoorder)).

Your data sample suggests that your desired sort order is equivalent to the first appearance in column coltoorder.
If this is true, the function fct_inorder() from Hadley Wickham's forcats package may be particular helpful here:
mytable$coltoorder <- forcats::fct_inorder(as.character(mytable$coltoorder))
mytable[order(mytable$coltoorder), ]
col1 col2 col3 coltoorder
1 895 2 1374 887
2 888 2 14 887
3 1018 3 1065 895
4 896 2 307 895
5 889 2 4 888
6 891 2 8 888
7 1055 2 971 1018
9 1021 2 87 1018
8 926 3 241 896
10 897 2 64 896
fct_inorder() reorders factors levels by first appearance. So, there is no need to create a separate myindex vector.
However, the caveats from Gregor's answer apply as well.

Related

Pivot/Reshape data in R [duplicate]

This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.

Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]

In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173

tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70

Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70

Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)

recoding a numerical variable based on a specific criterion in r

I would like to recode a numerical variable based on a cut score criterion. If the cut scores are not available in the variable, I would like to recode the closest smaller value as a cut score. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores <- c(512,531,541,555,562,565,570,572,573,588)
data <- data.frame(ids, scores)
> data
ids scores
1 1 512
2 2 531
3 3 541
4 4 555
5 5 562
6 6 565
7 7 570
8 8 572
9 9 573
10 10 588
cuts <- c(531, 560, 575)
The first cut score (531) is in the dataset. So it will stay the same as 531. However, 560 and 575 were not available. I would like to recode the closest smaller value (555) to the second cut score as 560 in the new column, and for the third cut score, I'd like to recode 573 as 575.
Here is what I would like to get.
ids scores rescored
1 1 512 512
2 2 531 531
3 3 541 541
4 4 555 560
5 5 562 562
6 6 565 565
7 7 570 570
8 8 572 572
9 9 573 575
10 10 588 588
Any thoughts?
Thanks

One option would be to find the index with findInterval and then get the pmax of the 'scores' corresponding to that index with the 'cuts' and updated the 'rescored' column elements on that index
i1 <- with(data, findInterval(cuts, scores))
data$rescored <- data$scores
data$rescored[i1] <- with(data, pmax(scores[i1], cuts))
data
# ids scores rescored
#1 1 512 512
#2 2 531 531
#3 3 541 541
#4 4 555 560
#5 5 562 562
#6 6 565 565
#7 7 570 570
#8 8 572 572
#9 9 573 575
#10 10 588 588

Checking the value from given threshold in a set of observation and continue till end of vector

Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.

Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590

Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.

sort data.frame based on the number of identical repeats in a column

I want to sort the data.frame based on the highest number of times a given character is repeated in the last column
data=
chr start end name
1 234 267 ttn
2 345 367 Elm
3 445 489 ttn
4 544 598 Rm
5 644 680 ttn
i want some thing like this
chr start end name
1 234 267 ttn
3 445 489 ttn
5 644 680 ttn
2 345 367 Elm
4 544 598 Rm

Here's a quick data.table approach which will sort the data by reference
library(data.table)
setorder(setDT(df)[, indx := .N, by = name], -indx)[]
# chr start end name indx
# 1: 1 234 267 ttn 3
# 2: 3 445 489 ttn 3
# 3: 5 644 680 ttn 3
# 4: 2 345 367 Elm 1
# 5: 4 544 598 Rm 1

Try
data[with(data, order(-ave(seq_along(name), name, FUN=length))),]
# chr start end name
#1 1 234 267 ttn
#3 3 445 489 ttn
#5 5 644 680 ttn
#2 2 345 367 Elm
#4 4 544 598 Rm
Or another base R approach is
data[order(factor(data$name, levels=names(sort(-table(data$name))))),]
# chr start end name
# 1 1 234 267 ttn
# 3 3 445 489 ttn
# 5 5 644 680 ttn
# 2 2 345 367 Elm
# 4 4 544 598 Rm
Or using dplyr
library(dplyr)
data %>%
group_by(name) %>%
mutate(n=n()) %>%
arrange(-n) %>%
select(-n)

Merge data frames from a list with each other

What I need:
I have a huge data frame with the following columns (and some more, but these are not important). Here's an example:
user_id video_id group_id x y
1 1 0 0 39 108
2 1 0 0 39 108
3 1 10 0 135 180
4 2 0 0 20 123
User, video and group IDs are factors, of course. For example, there are 20 videos, but each of them has several "observations" for each user and group.
I'd like to transform this data frame into the following format, where there are as many x.N, y.N as there are users (N).
video_id x.1 y.1 x.2 y.2 …
0 39 108 20 123
So, for video 0, the x and y values from user 1 are in columns x.1 and y.1, respectively. For user 2, their values are in columns x.2, y.2, and so on.
What I've tried:
I made myself a list of data frames that are solely composed of all the x, y observations for each video_id:
summaryList = dlply(allData, .(user_id), function(x) unique(x[c("video_id","x","y")]) )
That's how it looks like:
List of 15
$ 1 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 1 11 8 5 12 9 20 13 7 10 ...
..$ x : int [1:20] 39 135 86 122 28 167 203 433 549 490 ...
..$ y : int [1:20] 108 180 164 103 187 128 185 355 360 368 ...
$ 2 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 2 14 15 4 20 6 19 3 13 18 ...
..$ x : int [1:20] 128 688 435 218 528 362 299 134 83 417 ...
..$ y : int [1:20] 165 117 135 179 96 328 332 563 623 476 ...
Where I'm stuck:
What's left to do is:
Merge each data frame from the summaryList with each other, based on the video_id. I can't find a nice way to access the actual data frames in the list, which are summaryList[1]$`1`, summaryList[2]$`2`, et cetera.
#James found out a partial solution:
Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
Ensure the column names are renamed after the user ID and not kept as-is. Right now my summaryList doesn't contain any info about the user ID, and the output of Reduce has duplicate column names like x.x y.x x.y y.y x.x y.x and so on.
How do I go about doing this? Or is there any easier way to get to the result than what I'm currently doing?

I am still somewhat confused. However, I guess you simply want to melt and dcast.
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150

Reduce does the trick:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names afterwards:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Custom sorting of dataframe by column with duplicates - r

Related

Pivot/Reshape data in R [duplicate]

recoding a numerical variable based on a specific criterion in r

Checking the value from given threshold in a set of observation and continue till end of vector

sort data.frame based on the number of identical repeats in a column

Merge data frames from a list with each other

Categories

Resources