R: subsetting all observations of individuals that have one matched observation - r

Sorry for another dang subsetting question; I just can't find this case described, though it must be common. Boiled-down data looks like this:
Plot Year BA
A 1980 44
A 1990 54
A 2000 66
B 1980 58
B 1990 69
B 2000 80
I want all observations for any plot with BA < 50 in 1980 -- in the above, all three A rows. I understand subset(Df, BA<50 & Year==1980) but can't figure out the next level of indexing.
Also if anyone has a better way to phrase the title I'll change it. Every way I could think of to search on only turned up the &/| questions. (So many &/| questions...)

Index your condition on Plot, checking membership with %in% in case there is more than one Plot satisfying the condition in the real data.
subset(df, Plot %in% unique(Plot[BA < 50 & Year == 1980]))
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Or with standard evaluation [ subsetting,
df[with(df, Plot %in% unique(Plot[BA < 50 & Year == 1980])), ]
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66

Another option with dplyr, this assumes there is only one record equal to 1980 for each plot, otherwise you may want to wrap the condition with all() or any() depending on your desired logic:
library(dplyr)
df %>% group_by(Plot) %>% filter(BA[Year == 1980] < 50)
# Source: local data frame [3 x 3]
# Groups: Plot [1]
# Plot Year BA
# <fctr> <int> <int>
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Under circumstances where multiple 1980 exist for some plots, the logic by #DirtySockSniffer's answer is equivalent to df %>% group_by(Plot) %>% filter(any(BA[Year == 1980] < 50)) in dplyr.

We can use data.table
library(data.table)
setDT(df1)[, if(all(BA[Year == 1980] < 50)) .SD, by = Plot]
# Plot Year BA
#1: A 1980 44
#2: A 1990 54
#3: A 2000 66

Related

Subset a dataframe by unique combination of values from another dataframe in R

I have a large dataframe A similar to the following and a second one, B, containing only lat/lon values.
What I am trying to do is to subset dataframe A based on the unique combinations of lat/lon from dataframe B.
So far, I have tried the following but does not work.
How should I change my code in order to effectively do this?
head(A)
vals time lon lat mo year
1 5 1978-11-01 100 32 01 1988
2 3 1978-11-02 100 45 02 1988
3 3 1978-11-03 100 45 01 1998
4 9 1978-11-04 100 50 05 1998
5 1 1978-11-05 100 60 05 1998
6 4 1978-11-06 100 32 05 1998
A_subset <-subset(A, A[, "lon"] %in% B$lon | A[, "lat"]
%in% B$lat)
Consider running an expand.grid on data frame B for all combination of unique coordinates. Then merge to data frame A:
B_all_combns <- expand.grid(lon = unique(B$lon), lat = unique(B$lat))
A_subset <- merge(A, B_all_combns, by=c("lon", "lat"))

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

tapply based on multiple indexes in R

I have a data frame, much like this one:
ref=rep(c("A","B"),each=240)
year=rep(rep(2014:2015,each=120),2)
month=rep(rep(1:12,each=10),4)
values=c(rep(NA,200),rnorm(100,2,1),rep(NA,50),rnorm(40,4,2),rep(NA,90))
DF=data.frame(ref,year,month,values)
I would like to compute the maximum number of consecutive NAs per reference, per year.
I have created a function, which works out the maximum number of consecutive NAs, but can only be based on one variable.
For example,
func <- function(x) {
max(rle(is.na(x))$lengths)
}
with(DF, tapply(values,ref, func))
# A B
# 200 90
with(DF, tapply(values,year, func))
# 2014 2015
# 120 90
So there are a maximum of 200 consecutive NAs in ref A in total, and maximum of 90 in ref B, which is correct. There are also 120 NAs in 2014, and 90 in 2015.
What I'd like is a result per ref and year, such as:
A 2015 80
A 2014 120
B 2015 90
B 2014 50
There are multiple ways of doing this, one is with the plyr library:
library(plyr)
ddply(DF,c('ref','year'),summarise,NAs=max(rle(is.na(values))$lengths))
ref year NAs
1 A 2014 120
2 A 2015 80
3 B 2014 60
4 B 2015 90
Using your function, you could also try:
with(DF, tapply(values,list(ref,year), func))
which gives a slightly different output
2014 2015
A 120 80
B 60 90
By using melt() you can however get to the same dataframe.
Very similar to the tapply solution above. I find aggregate give a better output than tapply though.
with(DF, aggregate(list(Value = values),list(Year = year,ref = ref), func))
Year ref Value
1 2014 A 120
2 2015 A 80
3 2014 B 60
4 2015 B 90
I like the recipe format
library(dplyr)
DF$values[is.na(DF$values)] <- 1
DF %>%
filter(values==1) %>%
group_by(ref,year) %>%
mutate(csum=cumsum(values)) %>%
group_by(ref,year) %>%
summarise(max(csum))
Source: local data frame [4 x 3]
Groups: ref [?]
ref year max(csum)
(fctr) (int) (dbl)
1 A 2014 120
2 A 2015 80
3 B 2014 50
4 B 2015 90

How to re-order data in R, and creating a new variable for the data?

I have been working with the CDC FluView dataset, retrieved by this code:
library(cdcfluview)
library(ggplot2)
usflu <- get_flu_data("national", "ilinet", years=1998:2015)
What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.
I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.
Here is what I have tried so far, I think it's supposed to be something like this:
wk_num <- 1
for(i in nrow(usflu)){
if(week == 31){
wk_num <- 1
wk_new[i] <- wk_num
wk_num <- wk_num+1
}
if(week < 53){
season[i] <- paste(Yr[i], '-', Yr[i] +1)
}
else{
}
Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.
Here's one way to accomplish this with the packages dplyr and tidyr:
library(dplyr)
library(tidyr)
usflu_df <- tbl_df(usflu)
usflu_df %>%
complete(YEAR, WEEK) %>%
filter(!(YEAR == 1998 & WEEK < 30)) %>%
mutate(season = cumsum(WEEK == 30),
season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
group_by(season) %>%
mutate(new_wk = seq_along(season)) %>%
select(YEAR, WEEK, new_wk, season, season_nm)
# YEAR WEEK new_wk season season_nm
# (int) (int) (int) (int) (chr)
# 1 1998 30 1 1 1998-1999
# 2 1998 31 2 1 1998-1999
# 3 1998 32 3 1 1998-1999
# 4 1998 33 4 1 1998-1999
# 5 1998 34 5 1 1998-1999
# 6 1998 35 6 1 1998-1999
# 7 1998 36 7 1 1998-1999
# 8 1998 37 8 1 1998-1999
# 9 1998 38 9 1 1998-1999
# 10 1998 39 10 1 1998-1999
Talking through this...
First, use tidyr::complete to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season and season_nm via cumsum and a simple paste function. The season simply increments anytime it sees WEEK == 30 -- this is useful because of leap years. We then group_by season so that we can seq_along season to create the new_wk variable.

Order multiple columns in R

Sample data:
now <- data.frame(id=c(123,123,123,222,222,222,135,135,135),year=c(2002,2001,2003,2006,2007,2005,2001,2002,2003),freq=c(3,1,2,2,3,1,3,1,2))
Desired output:
wanted <- data.frame(id=c(123,123,123,222,222,222,135,135,135),year=c(2001,2002,2003,2005,2006,2007,2001,2002,2003),freq=c(1,2,3,1,2,3,1,2,3))
This solution works, but I'm getting memory error (cannot assign 134kb...)
ddply(now,.(id), transform, year=sort(year))
Please note I need speedwise efficient solution as I have dataframe of length 300K and 50 columns. Thanks.
You can use dplyr to sort it (which is called arrange in dplyr). dplyr is also faster than plyr.
wanted <- now %>% arrange(id, year)
# or: wanted <- arrange(now, id, year)
> wanted
# id year freq
#1 123 2001 1
#2 123 2002 3
#3 123 2003 2
#4 135 2001 3
#5 135 2002 1
#6 135 2003 2
#7 222 2005 1
#8 222 2006 2
#9 222 2007 3
You could do the same with base R:
wanted <- now[order(now$id, now$year),]
However, there is a diffrence in your now and wanted data.frame for id == 123 and year 2002 (in your now df, the freq is 2 while it is 3 in the wanted df). Based on your question, I assume this is a typo and that you did not actually want to change the freq values.
You could use base R function here
now <- now[order(now$id, now$year), ]
or data.table for faster performance
library(data.table)
setDT(now)[order(id, year)]
or
now <- data.table(now, key = c("id", "year"))
or
setDT(now)
setkey(now, id, year)

Resources