common data/sample in two dataframes in R - r

I'm trying to compare model-based forecasts from two different models. Model number 2 however requires more non-missing data and has thus more missing values (NA) than model 1.
I am wondering now how I can quickly query both dataframes for non-missing values and identify the common sample? I used to work with excel and the function
=IF(AND(ISVALUE(a1);ISVALUE(b1));then;else)
comes to my mind but I don't know how to do it properly with R.
This is my df from model 1: Every observation is clearly identified by id and time.
(the rownumbers on the left are from my overall dataframe and are identical in both dataframes.)
> head(model1)
id time f1 f2 f3 f4 f5
9 1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
10 1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
11 1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
12 1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
13 1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
14 1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737
and this model 2:
> head(model2)
id time meanf1 meanf2 meanf3 meanf4 meanf5
9 1 1995 4.56 5.14 6.05 NA NA
10 1 1996 4.38 4.94 NA NA NA
11 1 1997 4.05 4.51 NA NA NA
12 1 1998 4.07 5.04 6.52 NA NA
13 1 1999 3.61 4.96 NA NA NA
14 1 2000 4.35 4.83 6.46 NA NA
Thank you for your help and hints.

The function complete.cases gives non-missing data across all columns. The sets (f4,meanf4) and (f5,meanf5) have no "common" non-missing values in the sample data, hence have no observations. Is this what you were looking for
#Read Data
model1=read.table(text='id time f1 f2 f3 f4 f5
1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737',header=TRUE)
model2=read.table(text=' id time meanf1 meanf2 meanf3 meanf4 meanf5
1 1995 4.56 5.14 6.05 NA NA
1 1996 4.38 4.94 NA NA NA
1 1997 4.05 4.51 NA NA NA
1 1998 4.07 5.04 6.52 NA NA
1 1999 3.61 4.96 NA NA NA
1 2000 4.35 4.83 6.46 NA NA',header=TRUE)
#name indices of f1..f5 = 3..7
#merge data for each f1..f5 and keep only non-missing values using, complete.cases()
DF_list = lapply(3:7,function(x) {
DF=merge(model1[,c(1,2,x)],model2[,c(1,2,x)],by=c("id","time"));
DF=DF[complete.cases(DF),];
return(DF);
})
DF_list
#[[1]]
# id time f1 meanf1
#1 1 1995 16.351261 4.56
#2 1 1996 15.942914 4.38
#3 1 1997 24.187390 4.05
#4 1 1998 3.101094 4.07
#5 1 1999 33.562234 3.61
#6 1 2000 59.979666 4.35
#
#[[2]]
# id time f2 meanf2
#1 1 1995 -1.856662 5.14
#2 1 1996 -1.749530 4.94
#3 1 1997 15.099166 4.51
#4 1 1998 -10.455754 5.04
#5 1 1999 2.610512 4.96
#6 1 2000 -45.106093 4.83
#
#[[3]]
# id time f3 meanf3
#1 1 1995 6.577671 6.05
#4 1 1998 -9.674086 6.52
#6 1 2000 -100.352866 6.46
#
#[[4]]
#[1] id time f4 meanf4
#<0 rows> (or 0-length row.names)
#
#[[5]]
#[1] id time f5 meanf5
#<0 rows> (or 0-length row.names)

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

Processing data.frame that needs order and cumulative days

With the small reproducible example below, I'd like to identify the dplyr approach to arrive at the data.frame shown at the end of this note. The features of the dplyr output is that it will ensure that the data.frame is sorted by date (note that the dates 1999-04-13 and 1999-03-12 are out of order) and that it then "accumulate" the number of days within each wy grouping (wy = "water year"; Oct 1-Sep 30) that Q is above a threshold of 3.0.
dat <- read.table(text="
Date wy Q
1997-01-01 1997 9.82
1997-02-01 1997 3.51
1997-02-02 1997 9.35
1997-10-04 1998 0.93
1997-11-01 1998 1.66
1997-12-02 1998 0.81
1998-04-03 1998 5.65
1998-05-05 1998 7.82
1998-07-05 1998 6.33
1998-09-06 1998 0.55
1998-09-07 1998 4.54
1998-10-09 1999 6.50
1998-12-31 1999 2.17
1999-01-01 1999 5.67
1999-04-13 1999 5.66
1999-03-12 1999 4.67
1999-06-05 1999 3.34
1999-09-30 1999 1.99
1999-11-06 2000 5.75
2000-03-04 2000 6.28
2000-06-07 2000 0.81
2000-07-06 2000 9.66
2000-09-09 2000 9.08
2000-09-21 2000 6.72", header=TRUE)
dat$Date <- as.Date(dat$Date)
mdat <- dat %>%
group_by(wy) %>%
filter(Q > 3) %>%
?
Desired results:
Date wy Q abvThreshCum
1997-01-01 1997 9.82 1
1997-02-01 1997 3.51 2
1997-02-02 1997 9.35 3
1997-10-04 1998 0.93 0
1997-11-01 1998 1.66 0
1997-12-02 1998 0.81 0
1998-04-03 1998 5.65 1
1998-05-05 1998 7.82 2
1998-07-05 1998 6.33 3
1998-09-06 1998 0.55 3
1998-09-07 1998 4.54 4
1998-10-09 1999 6.50 1
1998-12-31 1999 2.17 1
1999-01-01 1999 5.67 2
1999-03-12 1999 4.67 3
1999-04-13 1999 5.66 4
1999-06-05 1999 3.34 5
1999-09-30 1999 1.99 5
1999-11-06 2000 5.75 1
2000-03-04 2000 6.28 2
2000-06-07 2000 0.81 2
2000-07-06 2000 9.66 3
2000-09-09 2000 9.08 4
2000-09-21 2000 6.72 5
library(dplyr)
dat %>%
arrange(Date) %>%
group_by(wy) %>%
mutate(abv = cumsum(Q > 3)) %>%
ungroup()
# # A tibble: 24 x 4
# Date wy Q abv
# <date> <int> <dbl> <int>
# 1 1997-01-01 1997 9.82 1
# 2 1997-02-01 1997 3.51 2
# 3 1997-02-02 1997 9.35 3
# 4 1997-10-04 1998 0.93 0
# 5 1997-11-01 1998 1.66 0
# 6 1997-12-02 1998 0.81 0
# 7 1998-04-03 1998 5.65 1
# 8 1998-05-05 1998 7.82 2
# 9 1998-07-05 1998 6.33 3
# 10 1998-09-06 1998 0.55 3
# # ... with 14 more rows
data.table approach
library(data.table)
setDT(dat, key = "Date")[, abvThreshCum := cumsum(Q > 3), by = .(wy)]

Create a Custom Function that Extracts Certain Rows

head(MYK)
X Analyte Subject Cohort DayNominal HourNominal Concentration uniqueID FS EF VTI deltaFS deltaEF deltaVTI HR
2 MYK-461 005-010 1 1 0.25 31.00 005-0100.25 31.82 64.86 0.00 3 -1 -100 58
3 MYK-461 005-010 1 1 0.50 31.80 005-0100.5 NA NA NA NA NA NA NA
4 MYK-461 005-010 1 1 1.00 9.69 005-0101 26.13 69.11 0.00 -15 6 -100 55
5 MYK-461 005-010 1 1 1.50 8.01 005-0101.5 NA NA NA NA NA NA NA
6 MYK-461 005-010 1 1 2.00 5.25 005-0102 NA NA NA NA NA NA NA
7 MYK-461 005-010 1 1 3.00 3.26 005-0103 29.89 60.99 23.49 -3 -7 9 55
105 MYK-461 005-033 2 1 0.25 3.4 005-0330.25 30.18 68.59 23.22 1 0 16 47
106 MYK-461 005-033 2 1 0.50 12.4 005-0330.5 NA NA NA NA NA NA NA
107 MYK-461 005-033 2 1 0.75 27.1 005-0330.75 NA NA NA NA NA NA NA
108 MYK-461 005-033 2 1 1.00 23.5 005-0331 32.12 69.60 21.06 7 2 5 43
109 MYK-461 005-033 2 1 1.50 16.8 005-0331.5 NA NA NA NA NA NA NA
110 MYK-461 005-033 2 1 2.00 15.8 005-0332 NA NA NA NA NA NA NA
organize = function(x, y) {
g1 = subset(x, Cohort == y)
g1 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=mean)
g1 = setNames(g1, c('HourNominal', 'Concentration'))
g2 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=sd)
g2 = setNames(g2, c('HourNominal', 'SD'))
g1[,'SD'] = g2$SD
g1$top = g1$Concentration + g1$SD
g1$bottom = g1$Concentration - g1$SD
return(g1)
}
I have a dataframe here, along with some code to subset the dataframe based on a certain Cohort, and to aggregate the Concentration based on Hour. However, all of the dataframes look the same.
CA1 = organize(MYK, 1)
CA2 = organize(MYK, 2)
Yet whenever I use these two commands, the two datasets are identical.
I want a dataset that looks like
HourNominal Concentration SD top bottom
1 0.25 27.287500 25.112204 52.399704 2.1752958
2 0.50 41.989722 32.856013 74.845735 9.1337094
3 0.75 49.866667 22.485254 72.351921 27.3814122
4 1.00 107.168889 104.612098 211.780987 2.5567908
5 1.50 191.766389 264.375466 456.141855 -72.6090774
6 1.75 319.233333 290.685423 609.918757 28.5479100
7 2.00 226.785278 272.983234 499.768512 -46.1979560
8 2.25 341.145833 301.555769 642.701602 39.5900645
9 2.50 341.145833 319.099679 660.245512 22.0461542
10 3.00 195.303333 276.530533 471.833866 -81.2271993
11 4.00 107.913889 140.251991 248.165880 -32.3381024
12 6.00 50.174167 64.700785 114.874952 -14.5266184
13 8.00 38.132639 47.099796 85.232435 -8.9671572
14 12.00 31.404444 39.667850 71.072294 -8.2634051
15 24.00 33.488583 41.267392 74.755975 -7.7788087
16 48.00 29.304833 38.233776 67.538609 -8.9289422
17 72.00 7.322792 6.548898 13.871690 0.7738932
18 96.00 7.002833 6.350251 13.353085 0.6525821
19 144.00 6.463875 5.612630 12.076505 0.8512452
20 216.00 5.007792 4.808156 9.815948 0.1996353
21 312.00 3.964727 4.351626 8.316353 -0.3868988
22 480.00 2.452857 3.220947 5.673804 -0.7680897
23 648.00 1.826625 2.569129 4.395754 -0.7425044
The problem is that the even why I try to separate the values by Cohort, the two dataframes have the same content. They should not be identical.

R Creating new data.table with specified rows of a single column from an old data.table

I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA

creating index conditioned on value in other column; differences over time

I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727

Resources