Related
I have large data frame (>50 columns). A sample of the relevant columns are here:
tb <- data.frame(RowID=c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11", "A12", "A13", "A14", "A15"),
Patient=c("001", "001", "001", "002", "002", "035", "035", "035", "035", "035", "100", "100", "105", "105", "105"),
Time=c(1,2,3,1,2,1,2,3,4,5,1,2,1,2,3),
Value=c(NA,10,23,100,30,10,15,NA,60,56.7,30,51,3,13,77))
I am trying to create a new column (Value_status) that ranks the initial value for each patient as either low or high (Value <50, Value >=50). The Value_status should be carried through to the other rows for that patient.
Here's what I have:
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low", "high"))
I thought I had solved it by adding group_by, but it doesn't give the same value for each individual patient as I hoped. I think I need to nest the if_else with more conditions, something like this?
Note: If a patient is missing Value at a time point other than 1, then they can still be grouped according to high/low.
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low",
if_else(Time == 1 & >= 50, "high",
if_else(#Apply the value from time point 1#))))
The output I am trying to get should look like this:
It should group patients based on whether or not their baseline values are high
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10.0 <NA>
3 A3 001 3 23.0 <NA>
4 A4 002 1 100.0 high
5 A5 002 2 30.0 high
6 A6 035 1 10.0 low
7 A7 035 2 15.0 low
8 A8 035 3 NA low
9 A9 035 4 60.0 low
10 A10 035 5 56.7 low
11 A11 100 1 30.0 low
12 A12 100 2 51.0 low
13 A13 105 1 3.0 low
14 A14 105 2 13.0 low
15 A15 105 3 77.0 low
Instead of if_else nested, we could use case_when where we can have multiple conditions created, then do a group_by with 'Patient' and fill the 'Value_status' NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
tb %>%
mutate(Value_status = case_when(Time == 1 & Value < 50 ~ "low",
Time == 1 & Value >= 50 ~ "high"
)) %>%
group_by(Patient) %>%
fill(Value_status) %>%
ungroup
-outupt
# A tibble: 15 x 5
RowID Patient Time Value Value_status
<chr> <chr> <dbl> <dbl> <chr>
1 A1 001 1 NA <NA>
2 A2 001 2 10 <NA>
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 high
6 A6 035 1 10 low
7 A7 035 2 15 low
8 A8 035 3 NA low
9 A9 035 4 60 low
10 A10 035 5 56.7 low
11 A11 100 1 30 low
12 A12 100 2 51 low
13 A13 105 1 3 low
14 A14 105 2 13 low
15 A15 105 3 77 low
Here a solution with a nested ifelse
tb %>%
mutate(Value_status = ifelse(Time != 1 & Value ==10, "medium",
ifelse(Time == 1 & Value < 50, "low",
ifelse(Time == 1 & Value >= 50, "high", NA)
)
))
Output:
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10 medium
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 <NA>
6 A6 035 1 10 low
7 A7 035 2 15 <NA>
8 A8 035 3 NA <NA>
9 A9 035 4 60 <NA>
10 A10 035 5 57 <NA>
11 A11 100 1 30 low
12 A12 100 2 51 <NA>
13 A13 105 1 3 low
14 A14 105 2 13 <NA>
15 A15 105 3 77 <NA>
I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!
This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19
Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)
Consider a dataframe made up of thousand rows and columns that inclues several NAs. I'd like to split this dataframe up into smaller ones based on the number of NAs in each row. All rows that contain the same number of NAs, if there is any, should be in the same group. The new data frames are then saved separately.
> DF
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
ff 12 NA NA 23 13
ee 67 23 NA NA 21
jj 31 14 NA 41 11
ss NA 15 11 12 11
The desired output will be:
> DF_chunk_1
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
jj 31 14 NA 41 11
ss NA 15 11 12 11
> DF_chunk_2
ID C1 C2 C3 C4 C5
ff 12 NA NA 23 13
ee 67 23 NA NA 21
I appreciate any suggestion.
Try this following useful comments. You can split() and use apply() to build a group:
#Code
new <- split(DF,apply(DF[,-1],1,function(x)sum(is.na(x))))
Output:
$`1`
ID C1 C2 C3 C4 C5
1 aa 12 13 10 NA 12
4 jj 31 14 NA 41 11
5 ss NA 15 11 12 11
$`2`
ID C1 C2 C3 C4 C5
2 ff 12 NA NA 23 13
3 ee 67 23 NA NA 21
A more practical way (Many thanks and credits to #RuiBarradas):
#Code2
new <- split(DF, rowSums(is.na(DF[-1])))
Same output.
I have a set of data out of an experiment that I have to analyse. But as there is also a lot of data in there that is not important for me, I wanted to tidy those files a bit up using R, as it is too much work to do manually.
As the data in those .csv files is out of time course experiments, the order of the different measurements matters and the different numbers have to be in a specific order.
Until now, I have already managed to select all the columns that I need and to sort them by the different conditions using the following code:
used_columns <- select(df,
ImageNumber,
FrameNumber,
Treatment,
Intensity1,
Intensity2)
used_columns.t <- as.tibble(used_columns)
df_sorted <- used_columns.t %>%
filter(Treatment == "B2") %>%
.[order(as.integer(.$FrameNumber),decreasing = FALSE), ]
Using this code, df_sorted yields a data frame that looks like this:
ImageNumber FrameNumber Treatment Intensity1 Intensity2
1 1 B2 1598,45 0,14
2 1 B2 930,40 0,11
3 1 B2 107,86 0,04
4 1 B2 881,09 0,11
7 1 B2 2201,98 0,15
8 1 B2 161,30 0,04
9 1 B2 1208,14 0,17
4 2 B2 831,75 0,12
5 2 B2 1027,41 0,14
7 2 B2 2052,16 0,15
8 2 B2 159,63 0,05
9 2 B2 1111,49 0,16
10 2 B2 1312,15 0,12
1 3 B2 863,79 0,10
2 3 B2 104,06 0,04
3 3 B2 816,02 0,11
4 3 B2 1053,02 0,14
5 3 B2 132,32 0,03
6 3 B2 2059,03 0,14
7 3 B2 153,49 0,04
8 3 B2 1118,69 0,15
9 3 B2 1632,66 0,18
10 3 B2 1302,15 0,12
However, I would like to have a table like this, where the missing values are indicated as NA (or whatever other placeholder):
ImageNumber FrameNumber Treatment Intensity1 Intensity2
1 1 B2 1598,45 0,14
2 1 B2 930,40 0,11
3 1 B2 107,86 0,04
4 1 B2 881,09 0,11
5 NA NA NA NA
6 NA NA NA NA
7 1 B2 2201,98 0,15
8 1 B2 161,30 0,04
9 1 B2 1208,14 0,17
10 NA NA NA NA
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 B2 831,75 0,12
5 2 B2 1027,41 0,14
6 NA NA NA NA
7 2 B2 2052,16 0,15
8 2 B2 159,63 0,05
9 2 B2 1111,49 0,16
10 2 B2 1312,15 0,12
1 3 B2 863,79 0,10
2 3 B2 104,06 0,04
3 3 B2 816,02 0,11
4 3 B2 1053,02 0,14
5 3 B2 132,32 0,03
6 3 B2 2059,03 0,14
7 3 B2 153,49 0,04
8 3 B2 1118,69 0,15
9 3 B2 1632,66 0,18
10 3 B2 1302,15 0,12
This is just a very short extract of the table that I had and in reality, depending on the condition, the ImageNumber may go up to 1441. Do you know any possibility, how I could solve this problem?
I would be very grateful, if anybody could help me hereby!
Here is a split-apply-combine approach in base R
out <- do.call(rbind,
by(
data = df1,
INDICES = df1$FrameNumber,
FUN = merge,
y = data.frame(ImageNumber = seq(min(df1$ImageNumber), max(df1$ImageNumber))),
all.y = TRUE
))
out
# ImageNumber FrameNumber Treatment Intensity1 Intensity2
#1.1 1 1 B2 1598,45 0,14
#1.2 2 1 B2 930,40 0,11
#1.3 3 1 B2 107,86 0,04
#1.4 4 1 B2 881,09 0,11
#1.5 5 NA <NA> <NA> <NA>
#1.6 6 NA <NA> <NA> <NA>
#1.7 7 1 B2 2201,98 0,15
#1.8 8 1 B2 161,30 0,04
#1.9 9 1 B2 1208,14 0,17
#1.10 10 NA <NA> <NA> <NA>
#2.1 1 NA <NA> <NA> <NA>
#2.2 2 NA <NA> <NA> <NA>
#2.3 3 NA <NA> <NA> <NA>
#2.4 4 2 B2 831,75 0,12
# ...
We split your data by FrameNumber, merge each list element with a data frame that contains a single column called ImageNumber. That columns contains the values from min(df1$ImageNumber) to max(df1$ImageNumber) - that is from 1 to 10 in your example. The argument all.y = TRUE - which belongs to merge - turns implicit missing values into explicit missing values.
Finally we combine the list back to a data frame with do.call(rbind, ...).
I have a dataset of 100 observations which contain patient id,drugcode,prescription date. I want to create a new column "index date" which is the date when the patient changed drug for the third time.
PatientID DrugCode Prescriptiondate
A1 3 07-08-2014
A1 3 08-09-2014
A1 7 19-09-2014
A1 5 30-09-2014
A2 4 11-07-2014
A2 4 21-07-2014
A2 3 13-08-2014
A2 5 26-08-2014
A2 5 30-09-2014
A3 2 16-08-2014
A3 3 17-09-2014
A4 5 08-06-2014
A4 5 29-06-2014
A4 6 20-08-2014
A4 6 24-09-2014
A4 4 22-10-2014
A4 4 25-10-2014
The data set should look like this:
PatientID DrugCode Prescriptiondate IndexDate
A1 3 07-08-2014 30-09-2014
A1 3 08-09-2014 30-09-2014
A1 7 19-09-2014 30-09-2014
A1 5 30-09-2014 30-09-2014
A2 4 11-07-2014 26-08-2014
A2 4 21-07-2014 26-08-2014
A2 3 13-08-2014 26-08-2014
A2 5 26-08-2014 26-08-2014
A2 5 30-09-2014 26-08-2014
A3 2 16-08-2014 NA
A3 3 17-09-2014 NA
A4 5 08-06-2014 22-10-2014
A4 5 29-06-2014 22-10-2014
A4 6 20-08-2014 22-10-2014
A4 6 24-09-2014 22-10-2014
A4 4 22-10-2014 22-10-2014
A4 4 25-10-2014 22-10-2014
In the above case,patient A1 & A2 changed the drug third time to drug 5 on 30-09-2014 and 26-08-2014 respectively;A3 have not changed drug third time and A4 has changed to drug 4 on 22-10-2014, so the index date should be 30-09-2014,26-08-2014,NA,22-10-2014 respectively.
Please if anyone can assist in writing the code for such problem.
This is a possible dplyr solution:
df %>% group_by(PatientID) %>% mutate(IndexDate = Prescriptiondate[match(unique(DrugCode)[3], DrugCode)])
# Source: local data frame [17 x 4]
# Groups: PatientID
#
# PatientID DrugCode Prescriptiondate IndexDate
# 1 A1 3 07-08-2014 30-09-2014
# 2 A1 3 08-09-2014 30-09-2014
# 3 A1 7 19-09-2014 30-09-2014
# 4 A1 5 30-09-2014 30-09-2014
# 5 A2 4 11-07-2014 26-08-2014
# 6 A2 4 21-07-2014 26-08-2014
# 7 A2 3 13-08-2014 26-08-2014
# 8 A2 5 26-08-2014 26-08-2014
# 9 A2 5 30-09-2014 26-08-2014
# 10 A3 2 16-08-2014 NA
# 11 A3 3 17-09-2014 NA
# 12 A4 5 08-06-2014 22-10-2014
# 13 A4 5 29-06-2014 22-10-2014
# 14 A4 6 20-08-2014 22-10-2014
# 15 A4 6 24-09-2014 22-10-2014
# 16 A4 4 22-10-2014 22-10-2014
# 17 A4 4 25-10-2014 22-10-2014
I guess it's the same idea with data.table
dt[, IndexDate := Prescriptiondate[match(unique(DrugCode)[3], DrugCode)], PatientID]
# PatientID DrugCode Prescriptiondate IndexDate
# 1: A1 3 07-08-2014 30-09-2014
# 2: A1 3 08-09-2014 30-09-2014
# 3: A1 7 19-09-2014 30-09-2014
# 4: A1 5 30-09-2014 30-09-2014
# 5: A2 4 11-07-2014 26-08-2014
# 6: A2 4 21-07-2014 26-08-2014
# 7: A2 3 13-08-2014 26-08-2014
# 8: A2 5 26-08-2014 26-08-2014
# 9: A2 5 30-09-2014 26-08-2014
# 10: A3 2 16-08-2014 NA
# 11: A3 3 17-09-2014 NA
# 12: A4 5 08-06-2014 22-10-2014
# 13: A4 5 29-06-2014 22-10-2014
# 14: A4 6 20-08-2014 22-10-2014
# 15: A4 6 24-09-2014 22-10-2014
# 16: A4 4 22-10-2014 22-10-2014
# 17: A4 4 25-10-2014 22-10-2014
match works because it stops once it finds a match. So if a drug is used over many days or one, it will not change the outcome. We look for the first instance of the DrugCode changing for the third time. unique works because it arranges it's values in the order that they appear. So unique(x)[3] will give the third change in that value.
Here's a base R solution, shamelessly stealing Pierre Lafortune's brilliant match-unique idea:
df <- data.frame(PatientID=c('A1','A1','A1','A1','A2','A2','A2','A2','A2','A3','A3','A4','A4','A4','A4','A4','A4'),DrugCode=c(3,3,7,5,4,4,3,5,5,2,3,5,5,6,6,4,4),Prescriptiondate=as.Date(c('07-08-2014','08-09-2014','19-09-2014','30-09-2014','11-07-2014','21-07-2014','13-08-2014','26-08-2014','30-09-2014','16-08-2014','17-09-2014','08-06-2014','29-06-2014','20-08-2014','24-09-2014','22-10-2014','25-10-2014'),'%d-%m-%Y'));
df$IndexDate <- do.call('c',by(df,df$PatientID,function(g) rep(g$Prescriptiondate[match(unique(g$DrugCode)[3],g$DrugCode)],nrow(g))));
df;
## PatientID DrugCode Prescriptiondate IndexDate
## 1 A1 3 2014-08-07 2014-09-30
## 2 A1 3 2014-09-08 2014-09-30
## 3 A1 7 2014-09-19 2014-09-30
## 4 A1 5 2014-09-30 2014-09-30
## 5 A2 4 2014-07-11 2014-08-26
## 6 A2 4 2014-07-21 2014-08-26
## 7 A2 3 2014-08-13 2014-08-26
## 8 A2 5 2014-08-26 2014-08-26
## 9 A2 5 2014-09-30 2014-08-26
## 10 A3 2 2014-08-16 <NA>
## 11 A3 3 2014-09-17 <NA>
## 12 A4 5 2014-06-08 2014-10-22
## 13 A4 5 2014-06-29 2014-10-22
## 14 A4 6 2014-08-20 2014-10-22
## 15 A4 6 2014-09-24 2014-10-22
## 16 A4 4 2014-10-22 2014-10-22
## 17 A4 4 2014-10-25 2014-10-22