I am trying to work with a big dataset and started using R for such purpose. I am trying to create a variable named time to diagnosis (time_to_dx) which is a time to event variable (yearHb - Dx) but for each patient ID. I would also like to drop all those measurements done prior the diagnosis but i am guessing once I am able to create the time_to_dx variable, that should be straightforward.
I am attaching an example of the dataset and the expected outcome.
Many thanks for your help.
ID
Dx
Hb
yearHb
time_to_dx
1
2001
16.5
1997
1
2001
21.3
2002
1
1
2001
19.5
2005
4
2
2005
14.5
2002
2
2005
15.6
2004
2
2005
21
2006
1
2
2005
22
2007
2
2
2005
17.9
2003
3
2006
18.1
2003
3
2006
19.7
2006
0
3
2006
19.1
2008
2
3
2006
17.3
2007
1
Assuming that dt is the name of your data.frame
Using dplyr:
library(dplyr)
result <- dt %>%
mutate(
time_to_dx = yearHb - Dx,
time_to_dx = ifelse(time_to_dx < 0, NA, time_to_dx)
)
Using base R:
dt$time_to_dx = dt$yearHb - dt$Dx
dt$time_to_dx = ifelse(dt$time_to_dx < 0, NA, dt$time_to_dx )
I´ve working for days on this problem but i am not able to solve this problem.
I have a matrix
gesamt
year <10 10-60 60-100 100-150 >150
2001 2001 376 57 7 0 0
2002 2015 322 60 10 2 0
2003 2016 324 59 5 2 0
which i convert to data.frame: df <- data.frame(gesamt)
and melt it:
df.molten <- melt(df, id.vars='year', value.name='mean')
df.molten
year variable mean
1 2001 <10 376
2 2015 <10 322
3 2016 <10 324
4 2001 10-60 57
5 2015 10-60 60
6 2016 10-60 59
7 2001 60-100 7
8 2015 60-100 10
9 2016 60-100 5
10 2001 100-150 0
11 2015 100-150 2
12 2016 100-150 2
13 2001 >150 0
14 2015 >150 0
15 2016 >150 0
I would like to get a solution like this:
But i am not able to plot it (didn´t had enough time to study ggplot2), whatever i tried.
Any hints?
This is an easy task, should be able to do it with:
library(tidyverse)
df %>%
ggplot(aes(year, mean, fill = variable)) +
geom_col(position = position_dodge())
Since your dataset has a large gap in years, you can convert the years column to character with df$year <- as.character(df$year) and you won't have that large gap in your plot.
I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.
I want to create a time series from 01/01/2004 until 31/12/2010 of daily mortality data in R. The raw data that I have now (.csv file), has as columns day - month - year and every row is a death case. So if the mortality on a certain day is for example equal to four, there are four rows with that date. If there is no death case reported on a specific day, that day is omitted in the dataset.
What I need is a time-series with 2557 rows (from 01/01/2004 until 31/12/2010) wherein the total number of death cases per day is listed. If there is no death case on a certain day, I still need that day to be in the list with a "0" assigned to it.
Does anyone know how to do this?
Thanks,
Gosia
Example of the raw data:
day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004
What I need:
day month year deaths
1 1 2004 1
2 1 2004 0
3 1 2004 3
4 1 2004 0
5 1 2004 0
6 1 2004 1
df <- read.table(text="day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004",header=TRUE)
#transform to dates
dates <- as.Date(with(df,paste(year,month,day,sep="-")))
#contingency table
tab <- as.data.frame(table(dates))
names(tab)[2] <- "deaths"
tab$dates <- as.Date(tab$dates)
#sequence of dates
res <- data.frame(dates=seq(from=min(dates),to=max(dates),by="1 day"))
#merge
res <- merge(res,tab,by="dates",all.x=TRUE)
res[is.na(res$deaths),"deaths"] <- 0
res
# dates deaths
#1 2004-01-01 1
#2 2004-01-02 0
#3 2004-01-03 3
#4 2004-01-04 0
#5 2004-01-05 0
#6 2004-01-06 1
#7 2004-01-07 1
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
Related to How to get column mean for specific rows only?
I am trying to create a new column in my dataframe that scales the "Score" column into sections based off the "Round" column.
Score Quarter
98.7 QTR 1 2011
88.6 QTR 1 2011
76.5 QTR 1 2011
93.5 QTR 2 2011
97.7 QTR 2 2011
89.1 QTR 1 2012
79.4 QTR 1 2012
80.3 QTR 1 2012
Would look like this
Unit Score Quarter Scale
6 98.7 QTR 1 2011 1.01
1 88.6 QTR 1 2011 .98
3 76.5 QTR 1 2011 .01
5 93.5 QTR 2 2011 2.0
6 88.6 QTR 2 2011 2.5
9 89.1 QTR 1 2012 2.2
1 79.4 QTR 1 2012 -.09
3 80.3 QTR 1 2012 -.01
3 98.7 QTR 1 2011 -2.2
I do not want to standardize the entire column because I want to trend the data and truly see how units did relative to each other quarter to quarter rather than scale(data$Score) which would compare all points to each other regardless of round.
I've tried variants of something like this:
data$Score_Scale <- with (data, scale(Score), findInterval(QTR, c(-Inf,"2011-01-01","2011-06-30", Inf)), FUN= scale)
Using ave might be a good option here:
Get your data:
test <- read.csv(textConnection("Score,Quarter
98.7,Round 1 2011
88.6,Round 1 2011
76.5,Round 1 2011
93.5,Round 2 2011
97.7,Round 2 2011
89.1,Round 1 2012
79.4,Round 1 2012
80.3,Round 1 2012"),header=TRUE)
scale the data within each Quarter group:
test$score_scale <- ave(test$Score,test$Quarter,FUN=scale)
test
Score Quarter score_scale
1 98.7 Round 1 2011 0.96866054
2 88.6 Round 1 2011 0.05997898
3 76.5 Round 1 2011 -1.02863953
4 93.5 Round 2 2011 -0.70710678
5 97.7 Round 2 2011 0.70710678
6 89.1 Round 1 2012 1.15062301
7 79.4 Round 1 2012 -0.65927589
8 80.3 Round 1 2012 -0.49134712
Just to make it obvious that this works, here are the individual results for each Quarter group:
> as.vector(scale(test$Score[test$Quarter=="Round 1 2011"]))
[1] 0.96866054 0.05997898 -1.02863953
> as.vector(scale(test$Score[test$Quarter=="Round 2 2011"]))
[1] -0.7071068 0.7071068
> as.vector(scale(test$Score[test$Quarter=="Round 1 2012"]))
[1] 1.1506230 -0.6592759 -0.4913471