R: order data frame according to one column - r

I have a data like this
> bbT11
range X0 X1 total BR GDis BDis WOE IV Index
1 (1,23] 5718 194 5912 0.03281461 12.291488 8.009909 0.42822753 1.83348973 1.534535
2 (23,26] 5249 330 5579 0.05915039 11.283319 13.625103 -0.18858848 0.44163352 1.207544
3 (26,28] 3105 209 3314 0.06306578 6.674549 8.629232 -0.25685394 0.50206815 1.292856
4 (28,33] 6277 416 6693 0.06215449 13.493121 17.175888 -0.24132650 0.88874916 1.272937
5 (33,37] 4443 239 4682 0.05104656 9.550731 9.867878 -0.03266713 0.01036028 1.033207
6 (37,41] 4277 237 4514 0.05250332 9.193895 9.785301 -0.06234172 0.03686928 1.064326
7 (41,46] 4904 265 5169 0.05126717 10.541702 10.941371 -0.03721203 0.01487247 1.037913
8 (46,51] 4582 230 4812 0.04779717 9.849527 9.496284 0.03652287 0.01290145 1.037198
9 (51,57] 4039 197 4236 0.04650614 8.682287 8.133774 0.06526000 0.03579599 1.067437
10 (57,76] 3926 105 4031 0.02604813 8.439381 4.335260 0.66612734 2.73386708 1.946684
I need to add an additional column "Bin" that will show numbers from 1 to 10, depending on BR column being in descending order, so for example 10th row becomes first, then first row becomes second, etc.
Any help would be appreciated

A very straightforward way is to use one of the rank functions from "dplyr" (eg: dense_rank, min_rank). Here, I've actually just used rank from base R. I've deleted some columns below just for presentation purposes.
library(dplyr)
mydf %>% mutate(bin = rank(BR))
# range X0 X1 total BR ... Index bin
# 1 (1,23] 5718 194 5912 0.03281461 ... 1.534535 2
# 2 (23,26] 5249 330 5579 0.05915039 ... 1.207544 8
# 3 (26,28] 3105 209 3314 0.06306578 ... 1.292856 10
# 4 (28,33] 6277 416 6693 0.06215449 ... 1.272937 9
# 5 (33,37] 4443 239 4682 0.05104656 ... 1.033207 5
# 6 (37,41] 4277 237 4514 0.05250332 ... 1.064326 7
# 7 (41,46] 4904 265 5169 0.05126717 ... 1.037913 6
# 8 (46,51] 4582 230 4812 0.04779717 ... 1.037198 4
# 9 (51,57] 4039 197 4236 0.04650614 ... 1.067437 3
# 10 (57,76] 3926 105 4031 0.02604813 ... 1.946684 1
If you just want to reorder the rows, use arrange instead:
mydf %>% arrange(BR)

bbT11$Bin[order(bbT11$BR)] <- 1:nrow(bbT11)

Related

Pivot/Reshape data in R [duplicate]

This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)

R Time Series, date isn't being read properly

I have this data that I want to plot as a time series.
Date Units.Sold
1 Jan-16 588
2 Feb-16 448
3 Mar-16 490
4 Apr-16 512
5 May-16 528
6 Jun-16 432
7 Jul-16 470
8 Aug-16 446
9 Sep-16 465
10 Oct-16 388
11 Nov-16 429
12 Dec-16 414
However, when I use ts(datasetName), I get this:
Time Series:
Start = 1
End = 12
Frequency = 1
Date Units.Sold
1 5 588
2 4 448
3 8 490
4 1 512
5 9 528
6 7 432
7 6 470
8 2 446
9 12 465
10 11 388
11 10 429
12 3 414
As you can see, the dates are in the wrong order. I want January to correspond with 1, February with 2, and so on. Can anybody help?
You need to convert your column named 'Date' to a Date - class object first. You can use as.Date for that, but you'll need to add a year first.
your_year <- 2018
df$Date <- as.Date(paste0(df$Date, '-', your_year), format = '%b-%d-%Y')

Stacking time series data vertically

I am struggling with manipulation of time series data. The dataset has first Column containing information about time points of data collection, 2nd column onwards contains data from different studies.I have several hundred studies. As an example I have included sample data for 5 studies. I want to stack the dataset vertically with time and datapoints for each study. Example data set looks like data provided below:
TIME Study1 Study2 Study3 Study4 Study5
0.00 52.12 53.66 52.03 50.36 51.34
90.00 49.49 51.71 49.49 48.48 50.19
180.00 47.00 49.83 47.07 46.67 49.05
270.00 44.63 48.02 44.77 44.93 47.95
360.00 42.38 46.28 42.59 43.25 46.87
450.00 40.24 44.60 40.50 41.64 45.81
540.00 38.21 42.98 38.53 40.08 44.78
I am looking for an output in the form of:
TIME Study ID
0 52.12 1
90 49.49 1
180 47 1
270 44.63 1
360 42.38 1
450 40.24 1
540 38.21 1
0 53.66 2
90 51.71 2
180 49.83 2
270 48.02 2
360 46.28 2
450 44.6 2
540 42.98 2
0 52.03 3
90 49.49 3
180 47.07 3
270 44.77 3
...
This is a classic 'wide to long' dataset manipulation. Below, I show the use of the base function ?reshape for your data:
d.l <- reshape(d, varying=list(c("Study1","Study2","Study3","Study4","Study5")),
v.names="Y", idvar="TIME", times=1:5, timevar="Study",
direction="long")
d.l <- d.l[,c(2,1,3)]
rownames(d.l) <- NULL
d.l
# Study TIME Y
# 1 1 0 52.12
# 2 1 90 49.49
# 3 1 180 47.00
# 4 1 270 44.63
# 5 1 360 42.38
# 6 1 450 40.24
# 7 1 540 38.21
# 8 2 0 53.66
# 9 2 90 51.71
# 10 2 180 49.83
# 11 2 270 48.02
# 12 2 360 46.28
# 13 2 450 44.60
# 14 2 540 42.98
# 15 3 0 52.03
# 16 3 90 49.49
# 17 3 180 47.07
# ...
However, there are many ways to do this in R: the most basic reference on SO (of which this is probably a duplicate) is Reshaping data.frame from wide to long format, but there are many other relevant threads (see this search: [r] wide to long). Beyond using reshape, #lmo's method can be used, as well as methods based on the reshape2, tidyr, and data.table packages (presumably among others).
Here is one method using cbind and stack:
longdf <- cbind(df$TIME, stack(df[,-1], ))
names(longdf) <- c("TIME", "Study", "id")
This returns
longdf
TIME Study id
1 0 52.12 Study1
2 90 49.49 Study1
3 180 47.00 Study1
4 270 44.63 Study1
5 360 42.38 Study1
6 450 40.24 Study1
7 540 38.21 Study1
8 0 53.66 Study2
9 90 51.71 Study2
...
If you want to change id to integers as in your example, use
longdf$id <- as.integer(longdf$id)

row average of columns that match string

I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.

How to transform particular rows into columns in R

I'm new to R and my question might seem easy for most of you. I have a data like this
> data.frame(table(dat),total)
AGEintervals mytest.G_B_FLAG Freq total
1 (1,23] 0 5718 5912
2 (23,26] 0 5249 5579
3 (26,28] 0 3105 3314
4 (28,33] 0 6277 6693
5 (33,37] 0 4443 4682
6 (37,41] 0 4277 4514
7 (41,46] 0 4904 5169
8 (46,51] 0 4582 4812
9 (51,57] 0 4039 4236
10 (57,76] 0 3926 4031
11 (1,23] 1 194 5912
12 (23,26] 1 330 5579
13 (26,28] 1 209 3314
14 (28,33] 1 416 6693
15 (33,37] 1 239 4682
16 (37,41] 1 237 4514
17 (41,46] 1 265 5169
18 (46,51] 1 230 4812
19 (51,57] 1 197 4236
20 (57,76] 1 105 4031
As you might have noticed age intervals start to repeating on 11 row.
All I need is to get 10 rows and 0's and 1' in different columns. Like this
AGEintervals 1 0 total
1 (1,23] 194 5718 5912
2 (23,26] 330 5249 5579
3 (26,28] 209 3105 3314
4 (28,33] 416 6277 6693
5 (33,37] 239 4443 4682
6 (37,41] 237 4277 4514
7 (41,46] 265 4904 5169
8 (46,51] 230 4582 4812
9 (51,57] 197 4039 4236
10 (57,76] 105 3926 4031
Many thanks
This is a straightforward "long" to "wide" transformation that is easy to achieve with reshape from base R:
reshape(mydf, idvar = c("AGEintervals", "total"),
timevar = "mytest.G_B_FLAG", direction = "wide")
# AGEintervals total Freq.0 Freq.1
# 1 (1,23] 5912 5718 194
# 2 (23,26] 5579 5249 330
# 3 (26,28] 3314 3105 209
# 4 (28,33] 6693 6277 416
# 5 (33,37] 4682 4443 239
# 6 (37,41] 4514 4277 237
# 7 (41,46] 5169 4904 265
# 8 (46,51] 4812 4582 230
# 9 (51,57] 4236 4039 197
# 10 (57,76] 4031 3926 105
Other alternatives include:
reshape2
library(reshape2)
dcast(mydf, ... ~ mytest.G_B_FLAG, value.var='Freq')
tidyr
library(tidyr)
spread(df, mytest.G_B_FLAG, Freq)
Update
This problem is possibly avoidable in the first place.
Run the following example code and compare the output at each stage:
## Create some sample data
set.seed(1)
dat <- data.frame(V1 = sample(letters[1:3], 20, TRUE),
V2 = sample(c(0, 1), 20, TRUE))
## View the output
dat
## Look what happens when we use `data.frame` on a `table`
data.frame(table(dat))
## Compare it with `as.data.frame.matrix`
as.data.frame.matrix(table(dat))
## The total can be added automatically with `addmargins`
as.data.frame.matrix(addmargins(table(dat), 2, sum))

Resources