I have a time series data.frame where all values are below each other. But on every date there are more cases that come back regulary. Based on the time series I am adding a column with some calculations. These calculations are done case-specific. But for these calculations I need the value of the previous date of that case. I have now idea about which function to use. Can anybody point me to a function or an example somewhere on the net? Thanks!!
To be clear, this is what I mean. On date 1 the old value (before the score) for case 'a' is 1200. Based on the score of 1 the new value becomes 1250. On date 2 the I want this new value 1250 for placed in the column 'old value' (and than do some calculations to come to the new value, that new value has to be placed again in de column old value on date 4 or so et cetera). For case B the same. So the new value after the score on date 1 is 1190 and has to be placed in the correct row on date 3 (on date 2 there is now case B) et cetera for 1000's of cases and dates.
date name_case score old_value new_value
1 a 1 1200 1250
1 b 2 1275 1190
1 c 1 1300 1310
2 a 3 1250
2 c 1 1310
3 B 1 1190
Maybe this will do it. Assuming that we start with:
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 NA NA
5 2 c 1 NA NA
6 3 b 1 NA NA # note ... fixed cap issue
And then make a subset with values for new_value:
dat1 <- dat[ !is.na(dat$old_value), ]
And then replace the NA old_values with results from new_values in the subset by match-ing on name_case
dat[ is.na(dat$old_value) , "old_value" ] <-
dat1$new_value[ match(dat[ !is.na(dat$old_value) ,"name_case" ],
dat1$name_case)]
match generates a numeric vector that is used to index the new_values.
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 1250 NA
5 2 c 1 1190 NA
6 3 B 1 1310 NA
Related
I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0
We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000
I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala
There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4
Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0
I would like to assign overall industry/parent codes to a data.frame (df below) containing more detailed/child codes (called ChildCodes below). The following data serves to illustrate my data.frame containing the detailed codes:
> df <- as.data.frame(cbind(c(1,2,3,4,5,6),c(110,101,200,2041,3651,2102)))
> names(df) <- c('Id','ChildCodes')
> df
Id ChildCodes
1 1 110
2 2 101
3 3 200
4 4 2041
5 5 3651
6 6 2102
The industry/parent codes are in the .csv file here: https://www.dropbox.com/s/5qtb7ysys1ar0lj/IndustryCodes.csv
The problem for me is the format of the .csv file. The file shows the parent/industry code in column 1 and ranges of child/detailed codes in the next 2 columns. Here is a subset:
> IndustryCodes <- as.data.frame(cbind(c(1,1,2,5,6),c(100,200,2040,2100,3650),c(199,299,2046,2199,3651)))
> names(IndustryCodes) <- c('IndustryGroup','LowerRange','UpperRange')
> IndustryCodes
IndustryGroup LowerRange UpperRange
1 1 100 199
2 1 200 299
3 2 2040 2046
4 5 2100 2199
5 6 3650 3651
So that ChildCode 110 corresponds industry group 1, 2041 to industry code 2 etc. How do best assign the industry/parent codes (IndustryGroup) to df in R?
Thanks!
You can use sapply to get the Industry code for every child code:
sapply(df$ChildCodes,
function(x) IndustryCodes$IndustryGroup[IndustryCodes$LowerRange <= x &
x <= IndustryCodes$UpperRange])
# [1] 1 1 1 2 6 5
My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596
I want to determine the length of the snow season in the following data frame:
DATE SNOW
1998-11-01 0
1998-11-02 0
1998-11-03 0.9
1998-11-04 1
1998-11-05 0
1998-11-06 1
1998-11-07 0.6
1998-11-08 1
1998-11-09 2
1998-11-10 2
1998-11-11 2.5
1998-11-12 3
1998-11-13 6.5
1999-01-01 15
1999-01-02 15
1999-01-03 19
1999-01-04 18
1999-01-05 17
1999-01-06 17
1999-01-07 17
1999-01-08 17
1999-01-09 16
1999-03-01 6
1999-03-02 5
1999-03-03 5
1999-03-04 5
1999-03-05 5
1999-03-06 2
1999-03-07 2
1999-03-08 1.6
1999-03-09 1.2
1999-03-10 1
1999-03-11 0.6
1999-03-12 0
1999-03-13 1
Snow season is defined by a snow depth (SNOW) of more than 1 cm for at least 10 consecutive days (so if there is snow one day in November but after it melts and depth is < 1 cm we consider the season not started).
My idea would be to determine:
1) the date of snowpack establishement (in my example 1998-11-08)
2) the date of "disappearing" (here 1999-03-11)
3) calculate the length of the period (nb of days between 1998-11-05 and 1999-03-11)
For the 3rd step I can easily get the number between 2 dates using this method.
But how to define the dates with conditions?
This is one way:
# copy data from clipboard
d <- read.table(text=readClipboard(), header=TRUE)
# coerce DATE to Date type, add event grouping variable that numbers the groups
# sequentially and has NA for values not in events.
d <- transform(d, DATE=as.Date(DATE),
event=with(rle(d$SNOW >= 1), rep(replace(ave(values, values, FUN=seq), !values, NA), lengths)))
# aggregate event lengths in days
event.days <- aggregate(DATE ~ event, data=d, function(x) as.numeric(max(x) - min(x), units='days'))
# get those events greater than 10 days
subset(event.days, DATE > 10)
# event DATE
# 3 3 122
You can also use the event grouping variable to find the start dates:
starts <- aggregate(DATE ~ event, data=d, FUN=head, 1)
# 1 1 1998-11-04
# 2 2 1998-11-06
# 3 3 1998-11-08
# 4 4 1999-03-13
And then merge this with event.days:
merge(event.days, starts, by='event')
# event DATE.x DATE.y
# 1 1 0 1998-11-04
# 2 2 0 1998-11-06
# 3 3 122 1998-11-08
# 4 4 0 1999-03-13