I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727
Related
I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727
I have a database with sales value for individual firms that belong to different industries.
In the example dataset below:
set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
year sale ind
1: 1980 114 A
2: 1980 102 A
3: 1980 130 B
4: 1980 113 B
5: 1981 136 A
6: 1981 148 A
7: 1981 141 B
8: 1981 142 B
9: 1982 124 A
10: 1982 125 A
11: 1982 104 A
12: 1982 126 B
13: 1983 108 A
14: 1983 128 A
15: 1983 140 B
16: 1983 127 B
17: 1984 134 A
18: 1984 107 A
19: 1984 106 A
20: 1984 146 B
The column "ind" represents industry and I have omitted the firm identifiers (no use in this example).
I want an average defined as follows:
For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.
For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.
If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.
If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.
The desired output, thus, will be as follows:
year sale ind mymean
1: 1980 130 B NA
2: 1980 114 A NA
3: 1980 113 B NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 126 B 131.5000
12: 1982 104 A 125.0000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 146 B 135.2000
20: 1984 106 A 124.7143
A data.table solution is much preferred for fast implementation.
Many thanks in advance.
I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table
library(tidyverse)
df %>% group_by(ind, year) %>%
summarise(ds = sum(sale),
dn = n()) %>%
mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
) %>% select(ind, year, mymean = ds) %>%
right_join(df, by = c("ind", "year"))
`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
ind year mymean sale
<chr> <int> <dbl> <int>
1 A 1980 NA 114
2 A 1980 NA 102
3 A 1981 NA 136
4 A 1981 NA 148
5 A 1982 125 124
6 A 1982 125 125
7 A 1982 125 104
8 A 1983 122. 108
9 A 1983 122. 128
10 A 1984 125. 134
11 A 1984 125. 107
12 A 1984 125. 106
13 B 1980 NA 130
14 B 1980 NA 113
15 B 1981 NA 141
16 B 1981 NA 142
17 B 1982 132. 126
18 B 1983 130. 140
19 B 1983 130. 127
20 B 1984 135. 146
You can use zoo's rollapply function to perform this rolling calculation. Note that there are dedicated functions to calculate rolling mean like frollmean in data.table and rollmean in zoo but they lack the argument partial = TRUE present in rollapply. partial = TRUE is useful here since you want to calculate the mean even if the window size is less than 3.
We can first calculate mean of sale value for each ind and year, then perform rolling mean calculation with window size of 3 and join this data with the original dataframe to get all the rows of original dataframe back.
library(data.table)
library(zoo)
df1 <- df[, .(sale = mean(sale)), .(ind, year)]
df2 <- df1[, my_mean := shift(rollapplyr(sale, 3, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA, partial = TRUE)), ind]
df[df2, on = .(ind, year)]
This can be written using dplyr as :
library(dplyr)
df %>%
group_by(ind, year) %>%
summarise(sale = mean(sale)) %>%
mutate(avg_mean = lag(rollapplyr(sale, 3, partial = TRUE, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA))) %>%
left_join(df, by = c('ind', 'year'))
Based on Ronak's answer (the mean of previous means), a more general way (the mean of all previous values), and a data.table solution then can be:
library(data.table)
library(roll)
df1 <- df[, .(sum_1 = sum(sale), n=length(sale)), .(ind, year)]
df1[,`:=`(
my_sum = roll_sum(shift(sum_1),3,min_obs = 2),
my_n = roll_sum(shift(n),3,min_obs = 2)
),by=.(ind)]
df1[,`:=`(my_mean=(my_sum/my_n))]
> df[df1[,!c("sum_1","n","my_sum","my_n")] ,on = .(ind, year)]
year sale ind my_mean
1: 1980 130 B NA
2: 1980 113 B NA
3: 1980 114 A NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 104 A 125.0000
12: 1982 126 B 131.5000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 106 A 124.7143
20: 1984 146 B 135.2000
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have a dataframe like this:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA NA
145 4 2005 NA NA NA
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA NA
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
fcmstat: type of first marital status change
secmstat: type of second marital status change
first marital status, for ID 4(19), fsmstat was changed in 2003(2002) and second marital status secmstat was changed in 2006(2004). So, for ID 4, in 2004 and 2005 marital status was same as fcmstat of 2003 and for ID 19, 2003's mstat should be same as fcmstat of 2002.
I want to fill in t he last column as follows:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA 2
145 4 2005 NA NA 2
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA 6
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
Also, before any first change, the mstatshould be same as before. Consider the following case.
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA NA
1177 61 1984 NA NA NA
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
the first change was in 1985. So, the missing mstat in 1984 and 1983 should be same as mstat of 1982. SO for this case, my desired output is:
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA 0
1177 61 1984 NA NA 0
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
As suggested by Schilker the code df$mstat_updated<-na.locf(df$mstat) gives the following:
ID year fcmstat secmstat mstat mstat_updated
138 4 1998 NA NA 1 1
139 4 1999 NA NA 1 1
140 4 2000 NA NA 1 1
141 4 2001 NA NA 1 1
142 4 2002 NA NA 1 1
143 4 2003 2 NA 2 2
144 4 2004 NA NA NA 2
145 4 2005 NA NA NA 2
146 4 2006 NA 3 3 3
147 4 2007 NA NA NA 3
148 4 2008 NA NA NA 3
However, I do want to fill in mstat for 2004 and 2005 but not in 2007 and 2008. I want to fill in NA's only between first marstat change, fcmstat and second marstat, secmstat change.
As I mentioned in my comment this a duplicate of here
library(zoo)
df<-data.frame(ID=c('4','4','4','4'),
year=c(2003,2004,2005,2006),
mstat=c(2,NA,NA,3))
df$mstat<-na.locf(df$mstat)
I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA
I'm trying to compare model-based forecasts from two different models. Model number 2 however requires more non-missing data and has thus more missing values (NA) than model 1.
I am wondering now how I can quickly query both dataframes for non-missing values and identify the common sample? I used to work with excel and the function
=IF(AND(ISVALUE(a1);ISVALUE(b1));then;else)
comes to my mind but I don't know how to do it properly with R.
This is my df from model 1: Every observation is clearly identified by id and time.
(the rownumbers on the left are from my overall dataframe and are identical in both dataframes.)
> head(model1)
id time f1 f2 f3 f4 f5
9 1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
10 1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
11 1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
12 1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
13 1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
14 1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737
and this model 2:
> head(model2)
id time meanf1 meanf2 meanf3 meanf4 meanf5
9 1 1995 4.56 5.14 6.05 NA NA
10 1 1996 4.38 4.94 NA NA NA
11 1 1997 4.05 4.51 NA NA NA
12 1 1998 4.07 5.04 6.52 NA NA
13 1 1999 3.61 4.96 NA NA NA
14 1 2000 4.35 4.83 6.46 NA NA
Thank you for your help and hints.
The function complete.cases gives non-missing data across all columns. The sets (f4,meanf4) and (f5,meanf5) have no "common" non-missing values in the sample data, hence have no observations. Is this what you were looking for
#Read Data
model1=read.table(text='id time f1 f2 f3 f4 f5
1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737',header=TRUE)
model2=read.table(text=' id time meanf1 meanf2 meanf3 meanf4 meanf5
1 1995 4.56 5.14 6.05 NA NA
1 1996 4.38 4.94 NA NA NA
1 1997 4.05 4.51 NA NA NA
1 1998 4.07 5.04 6.52 NA NA
1 1999 3.61 4.96 NA NA NA
1 2000 4.35 4.83 6.46 NA NA',header=TRUE)
#name indices of f1..f5 = 3..7
#merge data for each f1..f5 and keep only non-missing values using, complete.cases()
DF_list = lapply(3:7,function(x) {
DF=merge(model1[,c(1,2,x)],model2[,c(1,2,x)],by=c("id","time"));
DF=DF[complete.cases(DF),];
return(DF);
})
DF_list
#[[1]]
# id time f1 meanf1
#1 1 1995 16.351261 4.56
#2 1 1996 15.942914 4.38
#3 1 1997 24.187390 4.05
#4 1 1998 3.101094 4.07
#5 1 1999 33.562234 3.61
#6 1 2000 59.979666 4.35
#
#[[2]]
# id time f2 meanf2
#1 1 1995 -1.856662 5.14
#2 1 1996 -1.749530 4.94
#3 1 1997 15.099166 4.51
#4 1 1998 -10.455754 5.04
#5 1 1999 2.610512 4.96
#6 1 2000 -45.106093 4.83
#
#[[3]]
# id time f3 meanf3
#1 1 1995 6.577671 6.05
#4 1 1998 -9.674086 6.52
#6 1 2000 -100.352866 6.46
#
#[[4]]
#[1] id time f4 meanf4
#<0 rows> (or 0-length row.names)
#
#[[5]]
#[1] id time f5 meanf5
#<0 rows> (or 0-length row.names)