Adding missing values after applying aggregate in R - r

I am trying to calculate monthly mean from daily values. My data has too many missing values and I want to fill them with NA values.
For example this is the desired output:
"MM","YY","RR"
10,1961,NA
10,1962,NA
10,1963,NA
10,1964,NA
10,1965,NA
10,1966,NA
10,1967,NA
10,1968,NA
10,1969,NA
10,1970,NA
10,1971,14.8290322580645
10,1972,5.92903225806452
10,1973,7.10645161290323
10,1974,9.25806451612903
10,1975,6.13225806451613
10,1976,NA
10,1977,NA
10,1978,NA
10,1979,11.358064516129
10,1980,NA
10,1981,20.8354838709677
10,1982,NA
10,1983,NA
10,1984,7.4741935483871
10,1985,NA
10,1986,NA
10,1987,NA
10,1988,NA
10,1989,NA
10,1990,NA
10,1991,NA
10,1992,NA
10,1993,NA
10,1994,NA
10,1995,NA
10,1996,NA
10,1997,NA
10,1998,NA
10,1999,NA
10,2000,NA
10,2001,12.2548387096774
10,2002,7.19354838709677
10,2003,4.34193548387097
10,2004,8.09354838709677
10,2005,10.3354838709677
10,2006,5.49677419354839
10,2007,9.58709677419355
10,2008,NA
10,2009,NA
10,2010,17.4548387096774
The test data can be downloaded from this link:
Link to Data
I am using the aggregate function to calculate the mean
Below is my script:
library(plyr)
dat<- read.csv("test.csv",header=TRUE,sep=",")
dat[dat == -999]<- NA
dat[dat == -888]<- 0
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean)
#Filter August Only
oct<-monthly_mean[which(monthly_mean$MM == 10),]
dat2 <- as.data.frame(oct)
#monthly_mean <- ddply(dat,.(MM, DD), sumaprise, mean_r =
mean(RR,na.rm=TRUE))
write.table(dat2,file="test_oct.csv",sep=",",col.names=T,row.names=F, na="NA")
Problems:
[1] When I ran this script, the missing years are also removed.
I'll appreciate any suggestions on how to do this correctly in R.

You can retain the NA columns by changing the aggregate function to,
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean,na.action=na.pass)

Related

Looping row numbers from one dataframe to create new data using logical operations in R

I would like to extract a dataframe that shows how many years it takes for NInd variable (dataset p1) to recover due to some culling happening, which is showed in dataframe e1.
I have the following datasets (mine are much bigger, but just to give you something to play with):
# Dataset 1
Batch <- c(2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,1,1,2,2,3,3,4,4)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33)
Species <- c(0,0,0,0,0,0,0,0,0,0)
Selected <- c(1,1,1,1,1,1,1,1,1,1)
Nculled <- c(811,4068,1755,449,1195,1711,619,4332,457,5883)
e1 <- data.frame(Batch,Rep,Year,RepSeason,PatchID,Species,Selected,Nculled)
# Dataset 2
Batch <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33)
Ncells <- c(6,5,6,4,4,5,6,5,5,5,6,5,6,4,4,5,6,7,3,5,4,4,3,3,4,4,5,5,6,4)
Species <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
NInd <- c(656,656,262,350,175,218,919,218,984,875,700,190,93,127,52,54,292,12,43,68,308,1000,98,29,656,656,262,350,175,300)
p1 <- data.frame(Batch, Rep, Year, RepSeason, PatchID, Ncells, Species, NInd)
The dataset called e1 shows only those year where some culled happened to the population on specific PatchID.
I have created the following script that basically use each row from e1 to create a Recovery number. Maybe there is an easier way to get to the end, but this is the one I managed to get...
When you run this, you are working on ONE row of e1, so we focus on the first PatchID encounter and then do some calculation to match that up with p1, and finally I get a number named Recovery.
Now, the thing is my dataframe has 50,000 rows, so doing this over and over looks quite tedious. So, that's where I thought a loop may be useful. But have tried and no luck on how to make it work at all...
#here is where I would like the loop
e2 <- e1[1,] # Trial for one row only # but the idea is having here a loop that keep doing of comes next for each row
e3 <- e2 %>%
select(1,2,4,5)
p2 <- p1[,c(1,2,4,5,3,6,7,8)] # Re-order
row2 <- which(apply(p2, 1, function(x) return(all(x == e3))))
p3 <- p1 %>%
slice(row2) # all years with that particular patch in that particular Batch
#How many times was this patch cull during this replicate?
e4 <- e2[,c(1,2,4,5,3,6,7,8)]
e4 <- e4 %>%
select(1,2,3,4)
c_batch <- e1[,c(1,2,4,5,3,6,7,8)]
row <- which(apply(c_batch, 1, function(x) return(all(x == e4))))
c4 <- c_batch %>%
slice(row)
# Number of year to recover to 95% that had before culled
c5 <- c4[1,] # extract the first time was culled
c5 <- c5 %>%
select(1:5)
row3 <- which(apply(p2, 1, function(x) return(all(x == c5))))
Before <- p2 %>%
slice(row3)
NInd <- Before[,8] # Before culling number of individuals
Year2 <- Before[,5] # Year number where first culling happened (that actually the number corresponds to individuals before culling, as the Pop file is developed during reproduction, while Cull file is developed after!)
Percent <- (95*NInd)/100 # 95% recovery we want to achieve would correspond to having 95% of NInd BEFORE culled happened (Year2)
After <- p3 %>%
filter(NInd >= Percent & Year > Year2) # Look rows that match number of ind and Year
After2 <- After[1,] # we just want the first year where the recovery was successfully achieved
Recovery <- After2$Year - Before$Year
# no. of years to reach 95% of the population immediately before the cull
I reckon that the end would have to change somehow to to tell R that we are creating a dataframe with the Recovery, something like:
Batch <- c(1,1,2,2)
Rep <- c(0,0,0,0)
PatchID <- c(17,25,30,12)
Recovery <- c(1,2,1,5)
Final <- data.frame(Batch, Rep, PatchID, Recovery)
Would that be possible? OR this is just too mess-up and I may should try a different way?
Does the following solve the problem correectly?
I have first added a unique ID to your data.frames to allow matching of the cull and population files (this saves most of you complicated look-up code):
# Add a unique ID for the patch/replicate etc. (as done in the example code)
e1$RepID = paste(e1$Batch, e1$Rep, e1$RepSeason, e1$PatchID, sep = ":")
p1$RepID = paste(p1$Batch, p1$Rep, p1$RepSeason, p1$PatchID, sep = ":")
If you want a quick overview of the number of times each patch was culled, the new RepID makes this easy:
# How many times was each patch culled?
table(p1$RepID)
Then you want a loop to check the recovery time after each cull.
My solutions uses an sapply loop (which also retains the RepIDs so you can match to other metadata later):
sapply(unique(e1$RepID), function(rep_id){
all_cull_events = e1[e1$RepID == rep_id, , drop = F]
first_year = order(all_cull_events$Year)[1] # The first cull year (assuming data might not be in temporal order)
first_cull_event = all_cull_events[first_year, ] # The row corresponding to the first cull event
population_counts = p1[p1$RepID == first_cull_event$RepID, ] # The population counts for this plot/replicate
population_counts = population_counts[order(population_counts$Year), ] # Order by year (assuming data might not be in temporal order)
pop_at_first_cull_event = population_counts[population_counts$Year == first_cull_event$Year, "NInd"]
population_counts_after_cull = population_counts[population_counts$Year > first_cull_event$Year, , drop = F]
years_to_recovery = which(population_counts_after_cull$NInd >= (pop_at_first_cull_event * .95))[1] # First year to pass 95% threshold
return(years_to_recovery)
})
2:0:0:17 2:0:0:25 2:0:0:19 2:0:0:16 2:0:0:21 2:0:0:24 2:0:0:23 2:0:0:20 2:0:0:18 2:0:0:33
1 2 1 NA NA NA NA NA NA NA
(The output contains some NAs because the first cull year was outside the range of population counts in the data you gave us)
Please check this against your expected output though. There were some aspects of the question and example code that were not clear (see comments).

Problem with factor and reordering facet_grid

I have constructed a dataset from the gss data (https://gss.norc.org/) associating data in decades
env_data <- select(gss, year, sex, degree, natenvir) %>% na.omit()
env_datadecades <- env_data %>%
mutate(decade=as.factor(ifelse(year<1980,
"70s",
ifelse(year>1980 & year<=1990,
"80s",
ifelse(year>1990 & year<2000, "90s", "00s")))))
I want to plot it with ggplot2 and facet_grid() and the order is not right so I made it as seen somewhere else
set.seed(6809)
env_datadecades$decade <- factor(env_datadecades$decade,
levels = c("Seventies", "Eighties", "Nineties", "Twothous"))
It worked the first time but when I try to run the code again I get NA for all data in decade. What is happening?
I just made a simple dataset of years
df <- data.frame(Years = sample(1970:2010, 20, replace = T))
Convert it into the required factors by this method,
df <- df %>%
mutate(Decades = case_when(Years < 1980 ~ "Seventies",
1980 <= Years & Years < 1990 ~ "Eighties",
1990 <= Years & Years < 2000 ~ "Nineties",
2000 <= Years ~ "TwoThousands"))
df$Decades <- factor(df$Decades, levels = c("Seventies", "Eighties", "Nineties", "TwoThousands"), ordered = T)
and now try faceting.
I think the problem with your code was that you gave the levels one set of names when you first converted the variables to a factor, and then in the second line of code, you give them another set of names. Stick to the same set, and it should work

Different methods to expand R data

I have the following data, and I would like to expand it. For example, if June has two successes, and one failure, my dataset should look like:
month | is_success
------------------
6 | T
6 | T
6 | F
Dataset is as follows:
# Months from July to December
months <- 7:12
# Number of success (failures) for each month
successes <- c(11,22,12,7,6,13)
failures <- c(20,19,11,16,13,10)
A sample solution is as follows:
dataset<-data.frame()
for (i in 1:length(months)) {
dataset <- rbind(dataset,cbind(rep(months[i], successes[i]), rep(T, successes[i])))
dataset <- rbind(dataset,cbind(rep(months[i], failures[i]), rep(F, failures[i])))
}
names(dataset) <- c("months", "is_success")
dataset[,"is_success"] <- as.factor(dataset[,"is_success"])
Question: What are the different ways to rewrite this code?
I am looking for a comprehensive solution with different but efficient ways (matrix, loop, apply).
Thank you!
Here is one way with rep. Create a dataset with 'months' and 'is_success' based on replication of 1 and 0. Then replicate the rows by the values of 'successes', 'failures', order if necessary and set the row names to 'NULL'
d1 <- data.frame(months, is_success = factor(rep(c(1, 0), each = length(months))))
d2 <- d1[rep(1:nrow(d1), c(successes, failures)),]
d2 <- d2[order(d2$months),]
row.names(d2) <- NULL
Now, we check whether this is equal to the data created from for loop
all.equal(d2, dataset, check.attributes = FALSE)
#[1] TRUE
Or as #thelatemail suggested, 'd1' can be created with expand.grid
d1 <- expand.grid(month=months, is_success=1:0)
using mapply you can try this:
createdf<-function(month,successes,failures){
data.frame(month=rep(x = month,(successes+failures)),
is_success=c(rep(x = T,successes),
rep(x = F,failures))
)
}
Now create a list of required data.frames:
lofdf<-mapply(FUN = createdf,months,successes,failures,SIMPLIFY = F)
And then combine using the plyr package's ldply function:
resdf<-ldply(lofdf,.fun = data.frame)

Create matrix after aggregate a table in R

I am really new at R (i've been learning it for 1 week now) and even newer at Stack Overflow. I have a doubt about how to use the aggregate function. I am trying the following code:
a = aggregate(dom$pesoA,
by = list(tipoE = addNA(dom$typeEsg), mun =dom$codMun),
FUN = sum,
na.rm=FALSE)
where:
- dom$pesoA has only the values that I need to sum
- dom$typeEsg has numbers from 1 to 6 and also many NAs
- dom$codMun has municipalities codes (no NAs)
Can I transform this data frame (a) into a matrix where tipoE are the columns, mun are the rows and the sum value of dom$pesoA are the elements of my matrix (there are some missing combination of mun and tipoE)?
I dont know if you can understand my explanation, if you have any questions I will try to answer it.
This is what my a df looks like
Thanks in advance
TR
If your dataframe really does look like that then there is a serious mismatch between your column names and your code.
dom <- data.frame(tipoE=sample(c(letters[1:4],NA), 30, rep=TRUE),
mun=rep(c(3200102,3200106,3200310) , each=10),
x=runif(30, 100,200) )
dom
This reworking succeeds:
a = aggregate(dom$x,
by = list(tipoE = addNA(dom$tipoE), mun =dom$
FUN = sum)
a
This use of xtabs then gives your requests:
> aT <- xtabs( x ~ tipoE + mun, a)
> aT
mun
tipoE 3200102 3200106 3200310
a 340.7700 367.1412 180.0594
b 280.9851 485.8780 798.4880
c 280.7682 236.3637 165.2295
d 176.6967 125.0732 132.5339
<NA> 376.4278 117.1063 251.2514

How to subsetting efficiently by using loop in R?

I have a csv file named "table_parameter". Please, download from here. Data look like this:
time avg.PM10 sill range nugget
1 2012030101 52.2692307692308 0.11054330 45574.072 0.0372612157
2 2012030102 55.3142857142857 0.20250974 87306.391 0.0483153769
3 2012030103 56.0380952380952 0.17711558 56806.827 0.0349567088
4 2012030104 55.9047619047619 0.16466350 104767.669 0.0307528346
.
.
.
25 2012030201 67.1047619047619 0.14349774 72755.326 0.0300378129
26 2012030202 71.6571428571429 0.11373430 72755.326 0.0320594776
27 2012030203 73.352380952381 0.13893530 72755.326 0.0311135434
28 2012030204 70.2095238095238 0.12642303 29594.037 0.0281416079
.
.
In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.
From this dataset I want subset (24*11) datframe like the table below:
for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.
I can do this subsetting manually by writing (24*11)240 lines code like this!
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times ==24 & avg.PM10>100)
But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?
FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.
Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.
You can do this using some plyr, dplyr and tidyr magic :
library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway
# Read data
dfData <- read.csv("table_parameter.csv")
dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(hour, roundedPM.10) %>%
# Count the number of occurences per hour
count(roundedPM.10, hour) %>%
# Use spread (from tidyr) to transform it into wide format
spread(hour, n)
If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.
EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :
library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it
# for the round_any function anyway
# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")
dfDataPlot <- dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(roundedPM.10, hour, range)
# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) +
geom_boxplot() +
facet_grid(roundedPM.10~.)
How about a double loop like this:
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]
t_list=seq(1,24,1)
PM_list=seq(0,100,10)
for (t in t_list){
#t=t_list[1]
for (PM in PM_list){
#PM=PM_list[4]
PM2=PM+10
sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
if (length(sub$X)!=0) { #to avoid errors because of empty sub
name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
sub$name = name
sub.df <- rbind(sub.df , sub) }
}
}
sub.df #print data frame

Resources