Related
I have a dataset in R with information about individuals and diagnoses. The variables are group, age, weight, id and diagnosis. So an individual can have one row with X in diagnosis (meaning no diagnosis) or one or more rows with diagnoses. Now I want to make a new variable with the number of diagnoses each individual got so that each individual has one row in the dataset with the variables group, age, weight, id and number of diagnoses. In this new column with diagnosis I want individuals with no diagnosis to get the number 0, with one diagnosis the number 1, with two diagnoses the number 2 and etcetera. Can anyone help me?
I am using R. I tried to use group_by and count but I can not get the number 0 for individuals with no diagnosis (X in the diagnosis column) and I can not see the other variables like group, age and weight.
Here is the data:
pr <- read_csv("~/Desktop/Data.csv")
head(pr)
Data
dput(pr)
structure(list(GROUP = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3,
3, 3, 3, 3, 4, 4, 4), AGE = c(23, 34, 61, 23, 45, 34, 34, 55,
56, 43, 56, 49, 61, 49, 74, 49, 51, 46, 75), WEIGHT = c(56, 72,
70, 56, 101, 72, 72, 62, 60, 78, 60, 55, 79, 55, 89, 55, 67,
60, 105), ID = c(4, 1, 2, 4, 3, 1, 1, 5, 7, 6, 7, 8, 9, 8, 10,
8, 11, 12, 13), DIAGNOSIS = c("J01", "J01", "X", "J01", "J01",
"J01", "J01", "J01", "J01", "J01", "J01", "J01", "X", "J01",
"J01", "J01", "X", "J01", "J01")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -19L), spec = structure(list(
cols = list(GROUP = structure(list(), class = c("collector_double",
"collector")), AGE = structure(list(), class = c("collector_double",
"collector")), WEIGHT = structure(list(), class = c("collector_double",
"collector")), ID = structure(list(), class = c("collector_double",
"collector")), DIAGNOSIS = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Picture of the desired output:
Desired output
One way to approach this is to group_by multiple columns, if the information is repeated for a given individual (which it does in this example). You will get these columns in your results in the end. Also, you can summarise where the DIAGNOSIS is not "X" - instead of count, so that you will get zero for cases where DIAGNOSIS is "X".
library(dplyr)
pr %>%
group_by(GROUP, ID, AGE, WEIGHT) %>%
summarise(NUMBER = sum(DIAGNOSIS != "X"))
Output
GROUP ID AGE WEIGHT NUMBER
<dbl> <dbl> <dbl> <dbl> <int>
1 1 1 34 72 3
2 1 2 61 70 0
3 1 3 45 101 1
4 1 4 23 56 2
5 2 5 55 62 1
6 2 6 43 78 1
7 2 7 56 60 2
8 3 8 49 55 3
9 3 9 61 79 0
10 3 10 74 89 1
11 4 11 51 67 0
12 4 12 46 60 1
13 4 13 75 105 1
I have two tables first table has stress score recorded at various time points and second table has date of treatment. I want to get the stress scores before and after treatment for each participant who has received the treatment. Also I want a column that gives information on when was the stress score recorded before and after treatment. I do not understand from where do I begin,and what should my code look like.
score.dt = data.table(
participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44),
repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1),
date.recorded = c(
'2017-07-13',
'2017-06-26',
'2018-09-17',
'2016-04-14',
'2014-03-24',
'2016-05-30',
'2018-06-20',
'2014-08-03',
'2015-07-06',
'2014-12-17',
'2014-09-05',
'2013-06-10',
'2015-10-04',
'2016-11-04',
'2016-04-18',
'2014-02-13',
'2013-05-24',
'2014-09-10',
'2014-11-25'
),
subscale = c(
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress"
),
score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)
)
date.treatment.dt = data.table (
participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26),
date.treatment = c(
'2018 - 06 - 27',
'2001 - 07 - 16',
'2009 - 12 - 09',
'2009 - 05 - 20',
'2009 - 07 - 22',
'2008-07 - 02',
'2009 - 11 - 25',
'2009 - 09 - 16',
'1991 - 07 - 30',
'2016 - 05 - 25',
'2012 - 07 - 25',
'2007 - 03 - 19',
'2012 - 01 - 25',
'2011 - 09 - 21',
'2000 - 03 - 06',
'2001 - 09 - 25',
'1999 - 12 - 20',
'1997 -07 - 28',
'2002 - 03 - 12',
'2008 - 01 - 23'
))
Desired output columns: is something like this
score.date.dt = c("candidate.index.x", "repeat.instance", "subscale", "score", "date.treatment", "date.recorded", "score.before.treatment", "score.after.treatment", "months.before.treatment", "months.after.treatment")
Here the columns months.before.treatment indicates how many months before treatment the stress score was measured and month.after.treatment indicates how many months after treatment the stress score was measured.
In your example set, you only have four individuals with stress scores that have any rows in the treatment table (participants 1,4,21,and 25). Only one of these, participant 1, has both a pre-treatment stress measures and post-treatment stress measure...
Here is one way to produce the information you need:
inner_join(score.dt,date.treatment.dt, by="participant.index") %>%
group_by(participant.index, date.treatment) %>%
summarize(pre_treatment = min(date.recorded[date.recorded<=date.treatment]),
post_treatment = max(date.recorded[date.recorded>=date.treatment])) %>%
pivot_longer(cols = -(participant.index:date.treatment), names_to = "period", values_to = "date.recorded") %>%
left_join(score.dt, by=c("participant.index", "date.recorded" )) %>%
mutate(period=str_extract(period,".*(?=_)"),
months = abs(as.numeric(date.treatment-date.recorded))/(365.25/12)) %>%
pivot_wider(id_cols = participant.index:date.treatment, names_from = period, values_from=c(date.recorded, subscale, months,score))
Output:
participant.index date.treatment date.recorded_pre date.recorded_post subscale_pre subscale_post months_pre months_post score_pre score_post
<dbl> <date> <date> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2018-06-27 2017-06-26 2018-09-17 stress stress 12.0 2.69 10 18
2 4 2001-07-16 NA 2016-05-30 NA stress Inf 178. NA 30
3 21 2000-03-06 NA 2015-07-06 NA stress Inf 184. NA 12
4 25 2002-03-12 NA 2014-12-17 NA stress Inf 153. NA 40
Note: you will have to fix the date inputs to the two source files, like this:
# first correct, your date.treatment column, and convert to date
date.treatment.dt[, date.treatment := as.Date(str_replace_all(date.treatment," ",""), "%Y-%m-%d")]
# second, similarly fix the date column in your stress score table
score.dt[,date.recorded := as.Date(date.recorded,"%Y-%m-%d")]
It seems like there are a few parts to what you're asking. First, you need to merge the two tables together. Here I use dplyr::inner_join() which automatically detects that the candidate.index is the only column in common and merges on that while discarding records found in only one of the tables. Second, we convert to a date format for both dates to enable the calculation of elapsed months.
library(tidyverse)
library(data.table)
library(lubridate)
score.dt <- structure(list(participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44), repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1), date.recorded = c("2017-07-13", "2017-06-26", "2018-09-17", "2016-04-14", "2014-03-24", "2016-05-30", "2018-06-20", "2014-08-03", "2015-07-06", "2014-12-17", "2014-09-05", "2013-06-10", "2015-10-04", "2016-11-04", "2016-04-18", "2014-02-13", "2013-05-24", "2014-09-10", "2014-11-25"), subscale = c("stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress"), score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)), row.names = c(NA, -19L), class = c("data.table", "data.frame"))
date.treatment.dt <- structure(list(participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26), date.treatment = c("2018 - 06 - 27", "2001 - 07 - 16", "2009 - 12 - 09", "2009 - 05 - 20", "2009 - 07 - 22", "2008-07 - 02", "2009 - 11 - 25", "2009 - 09 - 16", "1991 - 07 - 30", "2016 - 05 - 25", "2012 - 07 - 25", "2007 - 03 - 19", "2012 - 01 - 25", "2011 - 09 - 21", "2000 - 03 - 06", "2001 - 09 - 25", "1999 - 12 - 20", "1997 -07 - 28", "2002 - 03 - 12", "2008 - 01 - 23")), row.names = c(NA, -20L), class = c("data.table", "data.frame"))
inner_join(date.treatment.dt, score.dt) %>%
mutate(across(contains("date"), as_date)) %>%
mutate(months.after = interval(date.treatment, date.recorded) %/% months(1)) %>%
mutate(months.before = 0 - months.after)
#> Joining, by = "participant.index"
#> participant.index date.treatment repeat.instance date.recorded subscale
#> 1: 1 2018-06-27 2 2017-07-13 stress
#> 2: 1 2018-06-27 3 2017-06-26 stress
#> 3: 1 2018-06-27 6 2018-09-17 stress
#> 4: 4 2001-07-16 1 2014-03-24 stress
#> 5: 4 2001-07-16 2 2016-05-30 stress
#> 6: 21 2000-03-06 1 2014-08-03 stress
#> 7: 21 2000-03-06 2 2015-07-06 stress
#> 8: 25 2002-03-12 1 2014-12-17 stress
#> score months.after months.before
#> 1: 18 -11 11
#> 2: 10 -12 12
#> 3: 18 2 -2
#> 4: 16 152 -152
#> 5: 30 178 -178
#> 6: 10 172 -172
#> 7: 12 184 -184
#> 8: 40 153 -153
Created on 2022-04-05 by the reprex package (v2.0.1)
This question already has an answer here:
Create a matrix of dummy variables from my data frame; use `NA` for missing values
(1 answer)
Closed last year.
How do I generate a dummy variable which is zero before year and takes the value 1 from year and onwards to 2019?
Original data:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), Year = c(2017,
2015, 2018, 2018, 2018, 2018, 2018, 2018)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
what I need:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), Year = c(2017,
2015, 2018, 2018, 2018, 2018, 2018, 2018), `2015` = c(NA, 1,
NA, NA, NA, NA, NA, NA), `2016` = c(NA, 1, NA, NA, NA, NA, NA,
NA), `2017` = c(1, 1, NA, NA, NA, NA, NA, NA), `2018` = c(1,
1, 1, 1, 1, 1, 1, 1), `2019` = c(1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
split on id, extend year range i:2019, then reshape from long-to-wide:
res <- reshape(stack(sapply(split(df2$Year, df2$id), function(i) i:2019)),
timevar = "values", v.names = "values", idvar = "ind",
direction = "wide")
# fix the column names order
res <- res[ sort(colnames(res)) ]
res
# ind values.2015 values.2016 values.2017 values.2018 values.2019
# 1 1 NA NA 2017 2018 2019
# 4 2 2015 2016 2017 2018 2019
# 9 3 NA NA NA 2018 2019
# 11 4 NA NA NA 2018 2019
# 13 5 NA NA NA 2018 2019
# 15 6 NA NA NA 2018 2019
# 17 7 NA NA NA 2018 2019
# 19 8 NA NA NA 2018 2019
I have some data from an api that gives timestamps in an unusual format that includes day of the week and day of the year at the end. For example [2021, 8, 22, 22, 0, 20, 6, 234] is 2021/08/22 22:00:20 on the 6th day of the week, 234th day of the year. I want to convert this into a lubridate date-time object but don't know how to strip out the last two values.
For example I'd like to take this data
example <- tibble(timestamp = c("[2021, 8, 22, 22, 0, 20, 6, 234]", "[2021, 8, 22, 22, 0, 30, 6, 234]", "[2021, 8, 22, 22, 0, 41, 6, 234]"), temperature = c(28,29,30)) and turn the timestamp column into a lubridate date-time type. Any ideas?
You can use strptime and then supply a proper format string
example %>% dplyr::mutate(
datetime = strptime(timestamp, format = "[%Y, %m, %d, %H, %M, %S"))
# A tibble: 3 x 3
timestamp temperature datetime
<chr> <dbl> <dttm>
1 [2021, 8, 22, 22, 0, 20, 6, 234] 28 2021-08-22 22:00:20
2 [2021, 8, 22, 22, 0, 30, 6, 234] 29 2021-08-22 22:00:30
3 [2021, 8, 22, 22, 0, 41, 6, 234] 30 2021-08-22 22:00:41
How about this.
library(tidyverse)
example <- tibble(timestamp = c("[2021, 8, 22, 22, 0, 20, 6, 234]", "[2021, 8, 22, 22, 0, 30, 6, 234]", "[2021, 8, 22, 22, 0, 41, 6, 234]"), temperature = c(28,29,30))
example %>%
mutate(timestamp = str_split(timestamp, ","),
timestamp = map_chr(timestamp, ~paste(parse_number(.x[1:6]), collapse = ".")),
timestamp = lubridate::ymd_hms(timestamp))
#> # A tibble: 3 x 2
#> timestamp temperature
#> <dttm> <dbl>
#> 1 2021-08-22 22:00:20 28
#> 2 2021-08-22 22:00:30 29
#> 3 2021-08-22 22:00:41 30
I just split the list, parse the numbers to remove the brackets, then collapse the list omitting the last two elements, and lastly parse the date time.
I have data in the following format, with Variables, data by years and where A, B, C, D are the row id's.
Variable 1 blank column Variable 2
2008 2009 2010 2011 2008 2009 2010 2011
A 1 5 9 13 5 10 15 20
B 2 6 10 14 25 30 35 40
C 3 7 11 15 45 50 55 60
D 4 8 12 16 65 70 75 80
I would like to get it in this format:
Variable Year Data
A Variable1 2008 1
A Variable1 2009 5
.....
.....
D Variable2 2010 75
D Variable2 2011 80
I thought of using gather from library(tidyr) but I cant figure out how to do it. Sorry do not have a reproducible example.
structure(list(X1 = c(NA, "A", "B", "C", "D"), Variable1 = c(2008,
1, 2, 3, 4), X3 = c(2009, 5, 6, 7, 8), X4 = c(2010, 9, 10, 11,
12), X5 = c(2011, 13, 14, 15, 16), Variable1 = c(2008, 5, 25,
45, 65), X7 = c(2009, 10, 30, 50, 70), X8 = c(2010, 15, 35, 55,
75), X9 = c(2011, 20, 40, 60, 80)), .Names = c("X1", "Variable1",
"X3", "X4", "X5", "Variable1", "X7", "X8", "X9"), row.names = c(NA,
5L), class = "data.frame")
library(tidyverse)
names(df) <- c("row_name",
paste(c(t(replicate(4, names(df)[1 + seq(1, length.out=floor(length(names(df))/4), by=4)]))),
df[1,-1],
sep="_"))
df[-1,] %>%
gather(Variable_Year, Data, -row_name) %>%
separate(Variable_Year, into=c("Variable", "Year"), sep="_") %>%
arrange(row_name, Variable, Year)
Note that you can't have non-unique values as "row names" of a dataframe so you may need to think of an alternative way to handle below row_name column.
Output is:
row_name Variable Year Data
1 A Variable1 2008 1
2 A Variable1 2009 5
...
31 D Variable2 2010 75
32 D Variable2 2011 80
Sample data:
df -> structure(list(row_name = c(NA, "A", "B", "C", "D"), Variable1_2008 = c(2008,
1, 2, 3, 4), Variable1_2009 = c(2009, 5, 6, 7, 8), Variable1_2010 = c(2010,
9, 10, 11, 12), Variable1_2011 = c(2011, 13, 14, 15, 16), Variable2_2008 = c(2008,
5, 25, 45, 65), Variable2_2009 = c(2009, 10, 30, 50, 70), Variable2_2010 = c(2010,
15, 35, 55, 75), Variable2_2011 = c(2011, 20, 40, 60, 80)), .Names = c("row_name",
"Variable1_2008", "Variable1_2009", "Variable1_2010", "Variable1_2011",
"Variable2_2008", "Variable2_2009", "Variable2_2010", "Variable2_2011"
), row.names = c(NA, 5L), class = "data.frame")