Ranking data that have the same values [duplicate] - r

This question already has answers here:
Rank vector with some equal values [duplicate]
(3 answers)
Closed 4 years ago.
I have a large data set including a column of counts for different genetic markers. I want to generate an overall ranking that takes into account the count number regardless of the genetic marker. For instance if 2 or more genetic markers all have a count of 5 they should all have the same rank number and I want the rank numbers to be displayed in a separate column. I have this dataframe;
SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8
I want the output to be:
SNP count rank
a1 26 1
a2 18 2
a3 16 3
a4 15 4
a8 15 4
a5 14 5
a6 14 5
a7 14 5
a9 13 7
a10 12 8
a11 12 8
a12 11 9
a13 10 10
a14 9 11
a15 8 12
Note that SNPs a4 and a8 are the same, a5, a6 a7 have equal count values and also a10 and a11. I've tried
transform(df, x= ave(count,FUN=function(x) order(x,decreasing=T)))
but it's not want I want

What you are looking for is the rleid function from the data.table package.
data.table::rleid(df$count)
[1] 1 2 3 4 5 5 5 6 7 8 8 9 10 11 12
df is obtained like so:
df <- read.table(text ="SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8",
stringsAsFactors =FALSE,
header = TRUE)
And for thoroughness:
df$rank <- data.table::rleid(df$count)
df
SNP count rank
1 a1 26 1
2 a2 18 2
3 a3 16 3
4 a4 15 4
5 a5 14 5
6 a6 14 5
7 a7 14 5
8 a8 15 6
9 a9 13 7
10 a10 12 8
11 a11 12 8
12 a12 11 9
13 a13 10 10
14 a14 9 11
15 a15 8 12
Edit:
Thanks to #Frank, a better solution would be to sort the data frame by count before applying rleid:
setDT(df)[order(-count), rank := rleid(count)]
Which gives:
df
SNP count rank
1: a1 26 1
2: a2 18 2
3: a3 16 3
4: a4 15 4
5: a5 14 5
6: a6 14 5
7: a7 14 5
8: a8 15 4
9: a9 13 6
10: a10 12 7
11: a11 12 7
12: a12 11 8
13: a13 10 9
14: a14 9 10
15: a15 8 11

Related

R create new column based on data range at a certain time point

I have large data frame (>50 columns). A sample of the relevant columns are here:
tb <- data.frame(RowID=c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11", "A12", "A13", "A14", "A15"),
Patient=c("001", "001", "001", "002", "002", "035", "035", "035", "035", "035", "100", "100", "105", "105", "105"),
Time=c(1,2,3,1,2,1,2,3,4,5,1,2,1,2,3),
Value=c(NA,10,23,100,30,10,15,NA,60,56.7,30,51,3,13,77))
I am trying to create a new column (Value_status) that ranks the initial value for each patient as either low or high (Value <50, Value >=50). The Value_status should be carried through to the other rows for that patient.
Here's what I have:
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low", "high"))
I thought I had solved it by adding group_by, but it doesn't give the same value for each individual patient as I hoped. I think I need to nest the if_else with more conditions, something like this?
Note: If a patient is missing Value at a time point other than 1, then they can still be grouped according to high/low.
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low",
if_else(Time == 1 & >= 50, "high",
if_else(#Apply the value from time point 1#))))
The output I am trying to get should look like this:
It should group patients based on whether or not their baseline values are high
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10.0 <NA>
3 A3 001 3 23.0 <NA>
4 A4 002 1 100.0 high
5 A5 002 2 30.0 high
6 A6 035 1 10.0 low
7 A7 035 2 15.0 low
8 A8 035 3 NA low
9 A9 035 4 60.0 low
10 A10 035 5 56.7 low
11 A11 100 1 30.0 low
12 A12 100 2 51.0 low
13 A13 105 1 3.0 low
14 A14 105 2 13.0 low
15 A15 105 3 77.0 low
Instead of if_else nested, we could use case_when where we can have multiple conditions created, then do a group_by with 'Patient' and fill the 'Value_status' NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
tb %>%
mutate(Value_status = case_when(Time == 1 & Value < 50 ~ "low",
Time == 1 & Value >= 50 ~ "high"
)) %>%
group_by(Patient) %>%
fill(Value_status) %>%
ungroup
-outupt
# A tibble: 15 x 5
RowID Patient Time Value Value_status
<chr> <chr> <dbl> <dbl> <chr>
1 A1 001 1 NA <NA>
2 A2 001 2 10 <NA>
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 high
6 A6 035 1 10 low
7 A7 035 2 15 low
8 A8 035 3 NA low
9 A9 035 4 60 low
10 A10 035 5 56.7 low
11 A11 100 1 30 low
12 A12 100 2 51 low
13 A13 105 1 3 low
14 A14 105 2 13 low
15 A15 105 3 77 low
Here a solution with a nested ifelse
tb %>%
mutate(Value_status = ifelse(Time != 1 & Value ==10, "medium",
ifelse(Time == 1 & Value < 50, "low",
ifelse(Time == 1 & Value >= 50, "high", NA)
)
))
Output:
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10 medium
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 <NA>
6 A6 035 1 10 low
7 A7 035 2 15 <NA>
8 A8 035 3 NA <NA>
9 A9 035 4 60 <NA>
10 A10 035 5 57 <NA>
11 A11 100 1 30 low
12 A12 100 2 51 <NA>
13 A13 105 1 3 low
14 A14 105 2 13 <NA>
15 A15 105 3 77 <NA>

R: Dataframe Manipulation

I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!
This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19
Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)

Split up a dataframe by number of NAs in each row

Consider a dataframe made up of thousand rows and columns that inclues several NAs. I'd like to split this dataframe up into smaller ones based on the number of NAs in each row. All rows that contain the same number of NAs, if there is any, should be in the same group. The new data frames are then saved separately.
> DF
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
ff 12 NA NA 23 13
ee 67 23 NA NA 21
jj 31 14 NA 41 11
ss NA 15 11 12 11
The desired output will be:
> DF_chunk_1
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
jj 31 14 NA 41 11
ss NA 15 11 12 11
> DF_chunk_2
ID C1 C2 C3 C4 C5
ff 12 NA NA 23 13
ee 67 23 NA NA 21
I appreciate any suggestion.
Try this following useful comments. You can split() and use apply() to build a group:
#Code
new <- split(DF,apply(DF[,-1],1,function(x)sum(is.na(x))))
Output:
$`1`
ID C1 C2 C3 C4 C5
1 aa 12 13 10 NA 12
4 jj 31 14 NA 41 11
5 ss NA 15 11 12 11
$`2`
ID C1 C2 C3 C4 C5
2 ff 12 NA NA 23 13
3 ee 67 23 NA NA 21
A more practical way (Many thanks and credits to #RuiBarradas):
#Code2
new <- split(DF, rowSums(is.na(DF[-1])))
Same output.

How to insert a place holder in a list, where pieces of data are missing using R?

I have a set of data out of an experiment that I have to analyse. But as there is also a lot of data in there that is not important for me, I wanted to tidy those files a bit up using R, as it is too much work to do manually.
As the data in those .csv files is out of time course experiments, the order of the different measurements matters and the different numbers have to be in a specific order.
Until now, I have already managed to select all the columns that I need and to sort them by the different conditions using the following code:
used_columns <- select(df,
ImageNumber,
FrameNumber,
Treatment,
Intensity1,
Intensity2)
used_columns.t <- as.tibble(used_columns)
df_sorted <- used_columns.t %>%
filter(Treatment == "B2") %>%
.[order(as.integer(.$FrameNumber),decreasing = FALSE), ]
Using this code, df_sorted yields a data frame that looks like this:
ImageNumber FrameNumber Treatment Intensity1 Intensity2
1 1 B2 1598,45 0,14
2 1 B2 930,40 0,11
3 1 B2 107,86 0,04
4 1 B2 881,09 0,11
7 1 B2 2201,98 0,15
8 1 B2 161,30 0,04
9 1 B2 1208,14 0,17
4 2 B2 831,75 0,12
5 2 B2 1027,41 0,14
7 2 B2 2052,16 0,15
8 2 B2 159,63 0,05
9 2 B2 1111,49 0,16
10 2 B2 1312,15 0,12
1 3 B2 863,79 0,10
2 3 B2 104,06 0,04
3 3 B2 816,02 0,11
4 3 B2 1053,02 0,14
5 3 B2 132,32 0,03
6 3 B2 2059,03 0,14
7 3 B2 153,49 0,04
8 3 B2 1118,69 0,15
9 3 B2 1632,66 0,18
10 3 B2 1302,15 0,12
However, I would like to have a table like this, where the missing values are indicated as NA (or whatever other placeholder):
ImageNumber FrameNumber Treatment Intensity1 Intensity2
1 1 B2 1598,45 0,14
2 1 B2 930,40 0,11
3 1 B2 107,86 0,04
4 1 B2 881,09 0,11
5 NA NA NA NA
6 NA NA NA NA
7 1 B2 2201,98 0,15
8 1 B2 161,30 0,04
9 1 B2 1208,14 0,17
10 NA NA NA NA
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 B2 831,75 0,12
5 2 B2 1027,41 0,14
6 NA NA NA NA
7 2 B2 2052,16 0,15
8 2 B2 159,63 0,05
9 2 B2 1111,49 0,16
10 2 B2 1312,15 0,12
1 3 B2 863,79 0,10
2 3 B2 104,06 0,04
3 3 B2 816,02 0,11
4 3 B2 1053,02 0,14
5 3 B2 132,32 0,03
6 3 B2 2059,03 0,14
7 3 B2 153,49 0,04
8 3 B2 1118,69 0,15
9 3 B2 1632,66 0,18
10 3 B2 1302,15 0,12
This is just a very short extract of the table that I had and in reality, depending on the condition, the ImageNumber may go up to 1441. Do you know any possibility, how I could solve this problem?
I would be very grateful, if anybody could help me hereby!
Here is a split-apply-combine approach in base R
out <- do.call(rbind,
by(
data = df1,
INDICES = df1$FrameNumber,
FUN = merge,
y = data.frame(ImageNumber = seq(min(df1$ImageNumber), max(df1$ImageNumber))),
all.y = TRUE
))
out
# ImageNumber FrameNumber Treatment Intensity1 Intensity2
#1.1 1 1 B2 1598,45 0,14
#1.2 2 1 B2 930,40 0,11
#1.3 3 1 B2 107,86 0,04
#1.4 4 1 B2 881,09 0,11
#1.5 5 NA <NA> <NA> <NA>
#1.6 6 NA <NA> <NA> <NA>
#1.7 7 1 B2 2201,98 0,15
#1.8 8 1 B2 161,30 0,04
#1.9 9 1 B2 1208,14 0,17
#1.10 10 NA <NA> <NA> <NA>
#2.1 1 NA <NA> <NA> <NA>
#2.2 2 NA <NA> <NA> <NA>
#2.3 3 NA <NA> <NA> <NA>
#2.4 4 2 B2 831,75 0,12
# ...
We split your data by FrameNumber, merge each list element with a data frame that contains a single column called ImageNumber. That columns contains the values from min(df1$ImageNumber) to max(df1$ImageNumber) - that is from 1 to 10 in your example. The argument all.y = TRUE - which belongs to merge - turns implicit missing values into explicit missing values.
Finally we combine the list back to a data frame with do.call(rbind, ...).

How to extract date based on condition over two different variables in R

I have a dataset of 100 observations which contain patient id,drugcode,prescription date. I want to create a new column "index date" which is the date when the patient changed drug for the third time.
PatientID DrugCode Prescriptiondate
A1 3 07-08-2014
A1 3 08-09-2014
A1 7 19-09-2014
A1 5 30-09-2014
A2 4 11-07-2014
A2 4 21-07-2014
A2 3 13-08-2014
A2 5 26-08-2014
A2 5 30-09-2014
A3 2 16-08-2014
A3 3 17-09-2014
A4 5 08-06-2014
A4 5 29-06-2014
A4 6 20-08-2014
A4 6 24-09-2014
A4 4 22-10-2014
A4 4 25-10-2014
The data set should look like this:
PatientID DrugCode Prescriptiondate IndexDate
A1 3 07-08-2014 30-09-2014
A1 3 08-09-2014 30-09-2014
A1 7 19-09-2014 30-09-2014
A1 5 30-09-2014 30-09-2014
A2 4 11-07-2014 26-08-2014
A2 4 21-07-2014 26-08-2014
A2 3 13-08-2014 26-08-2014
A2 5 26-08-2014 26-08-2014
A2 5 30-09-2014 26-08-2014
A3 2 16-08-2014 NA
A3 3 17-09-2014 NA
A4 5 08-06-2014 22-10-2014
A4 5 29-06-2014 22-10-2014
A4 6 20-08-2014 22-10-2014
A4 6 24-09-2014 22-10-2014
A4 4 22-10-2014 22-10-2014
A4 4 25-10-2014 22-10-2014
In the above case,patient A1 & A2 changed the drug third time to drug 5 on 30-09-2014 and 26-08-2014 respectively;A3 have not changed drug third time and A4 has changed to drug 4 on 22-10-2014, so the index date should be 30-09-2014,26-08-2014,NA,22-10-2014 respectively.
Please if anyone can assist in writing the code for such problem.
This is a possible dplyr solution:
df %>% group_by(PatientID) %>% mutate(IndexDate = Prescriptiondate[match(unique(DrugCode)[3], DrugCode)])
# Source: local data frame [17 x 4]
# Groups: PatientID
#
# PatientID DrugCode Prescriptiondate IndexDate
# 1 A1 3 07-08-2014 30-09-2014
# 2 A1 3 08-09-2014 30-09-2014
# 3 A1 7 19-09-2014 30-09-2014
# 4 A1 5 30-09-2014 30-09-2014
# 5 A2 4 11-07-2014 26-08-2014
# 6 A2 4 21-07-2014 26-08-2014
# 7 A2 3 13-08-2014 26-08-2014
# 8 A2 5 26-08-2014 26-08-2014
# 9 A2 5 30-09-2014 26-08-2014
# 10 A3 2 16-08-2014 NA
# 11 A3 3 17-09-2014 NA
# 12 A4 5 08-06-2014 22-10-2014
# 13 A4 5 29-06-2014 22-10-2014
# 14 A4 6 20-08-2014 22-10-2014
# 15 A4 6 24-09-2014 22-10-2014
# 16 A4 4 22-10-2014 22-10-2014
# 17 A4 4 25-10-2014 22-10-2014
I guess it's the same idea with data.table
dt[, IndexDate := Prescriptiondate[match(unique(DrugCode)[3], DrugCode)], PatientID]
# PatientID DrugCode Prescriptiondate IndexDate
# 1: A1 3 07-08-2014 30-09-2014
# 2: A1 3 08-09-2014 30-09-2014
# 3: A1 7 19-09-2014 30-09-2014
# 4: A1 5 30-09-2014 30-09-2014
# 5: A2 4 11-07-2014 26-08-2014
# 6: A2 4 21-07-2014 26-08-2014
# 7: A2 3 13-08-2014 26-08-2014
# 8: A2 5 26-08-2014 26-08-2014
# 9: A2 5 30-09-2014 26-08-2014
# 10: A3 2 16-08-2014 NA
# 11: A3 3 17-09-2014 NA
# 12: A4 5 08-06-2014 22-10-2014
# 13: A4 5 29-06-2014 22-10-2014
# 14: A4 6 20-08-2014 22-10-2014
# 15: A4 6 24-09-2014 22-10-2014
# 16: A4 4 22-10-2014 22-10-2014
# 17: A4 4 25-10-2014 22-10-2014
match works because it stops once it finds a match. So if a drug is used over many days or one, it will not change the outcome. We look for the first instance of the DrugCode changing for the third time. unique works because it arranges it's values in the order that they appear. So unique(x)[3] will give the third change in that value.
Here's a base R solution, shamelessly stealing Pierre Lafortune's brilliant match-unique idea:
df <- data.frame(PatientID=c('A1','A1','A1','A1','A2','A2','A2','A2','A2','A3','A3','A4','A4','A4','A4','A4','A4'),DrugCode=c(3,3,7,5,4,4,3,5,5,2,3,5,5,6,6,4,4),Prescriptiondate=as.Date(c('07-08-2014','08-09-2014','19-09-2014','30-09-2014','11-07-2014','21-07-2014','13-08-2014','26-08-2014','30-09-2014','16-08-2014','17-09-2014','08-06-2014','29-06-2014','20-08-2014','24-09-2014','22-10-2014','25-10-2014'),'%d-%m-%Y'));
df$IndexDate <- do.call('c',by(df,df$PatientID,function(g) rep(g$Prescriptiondate[match(unique(g$DrugCode)[3],g$DrugCode)],nrow(g))));
df;
## PatientID DrugCode Prescriptiondate IndexDate
## 1 A1 3 2014-08-07 2014-09-30
## 2 A1 3 2014-09-08 2014-09-30
## 3 A1 7 2014-09-19 2014-09-30
## 4 A1 5 2014-09-30 2014-09-30
## 5 A2 4 2014-07-11 2014-08-26
## 6 A2 4 2014-07-21 2014-08-26
## 7 A2 3 2014-08-13 2014-08-26
## 8 A2 5 2014-08-26 2014-08-26
## 9 A2 5 2014-09-30 2014-08-26
## 10 A3 2 2014-08-16 <NA>
## 11 A3 3 2014-09-17 <NA>
## 12 A4 5 2014-06-08 2014-10-22
## 13 A4 5 2014-06-29 2014-10-22
## 14 A4 6 2014-08-20 2014-10-22
## 15 A4 6 2014-09-24 2014-10-22
## 16 A4 4 2014-10-22 2014-10-22
## 17 A4 4 2014-10-25 2014-10-22

Resources