In R, subset data using longest contiguous stretch of non-NA values - r

I am prepping data for linear regression and want to address missing values (NA) by using the longest contiguous stretch of non-NA values in a given year and site.
I have tried na.contiguous() but my code is not applying the function by year or site
Thanks for your assistance
The test data is a multivariate time series that spans 2 years and 2 sites. My hope is that the solution will accommodate data with many more years and 32 sites- so some level of automation and qa/qc is appreciated.
library(dataRetrieval)
library(dplyr)
# read in Data, q is discharge and wt is stream temperature
wt<-readNWISdv(siteNumbers=c("08181800","07308500"),
parameterCd=c("00010","00060"), statCd=c("00003"),
startDate="1998-07-01", endDate="1999-09-30" )
dfwt<-wt%>%
group_by(site_no)%>%
select(Date,site_no,X_00010_00003,X_00060_00003)%>%
rename(wt=X_00010_00003,q=X_00060_00003)
#Subset summer season, add dummy air temp (at).
dfwt$Date<-ymd(dfwt$Date, tz=Sys.timezone())
dfwt$month<-month(dfwt$Date)
dfwt$year<-year(dfwt$Date)
df<- dfwt %>%
group_by(site_no)%>%
subset(month>=7 & month<=9)%>%
mutate(at=wt*1.3)
# add NA
df[35:38,3]<-NA
df[155,3]<-NA
df[194,3]<-NA
test<-df%>%
group_by(site_no, year)%>%
na.contiguous(df)

Using a for loop I found the following solution,
library(zoo)
library(plyr)
library(lubridate)
zoo(df)
sites<-as.vector(unique(df$site_no))
bfi_allsites<-data.frame()
for(i in 1:2){
site1<-subset(dfz, site_no==sites[i])
str(site1)
ss1<-split(site1,site1$year)
site1result<-lapply(ss1,na.contiguous)#works
site_df <- ldply(site1result,data.frame)
bfi_allsites<-rbind(bfi_allsites, site_df)
}
head(bfi_allsites)

Related

R Sample Programming: Show the absolute counts of female and male patients and status at end of observation in the sample [duplicate]

This question already has answers here:
Counting the frequency of an element in a data frame [duplicate]
(2 answers)
Closed 1 year ago.
Use the Aids2 dataset(see below). Use a sample size of 50 for each of the following.
Show the sample drawn using simple random sampling with replacement.
Show the frequencies (the absolute counts) of female and male patients (sex) and status at end of observation (status) in the sample.
packages prob and sampling may be used
My codes here:
aids <- read.csv(file="D:/Aids2.csv", header=T)
library(sampling)
nrow(aids)
s <- srswor(50, 2843)
rows <- (1:nrow(aids))[s!=0]
rows <- rep(rows, s[s!=0])
female <-rows[aids$sex=="F"]
male <- rows[aids$sex=="M"]
table(female)
table(male)
dead<-rows[aids$status=="D"]
alive<-rows[aids$status=="A"]
table(dead)
table(alive)
So its like I know everything runs fine, but Im not sure how to achieve exactly what the question asking. Can anyone help me to fix my script
links of the file: https://drive.google.com/file/d/1vVKYQ_oDu6Fv00P-vgypifxfMgnyV7qw/view?usp=sharing
I don't have access to the data but something like this should work.
aids <- read.csv(file="D:/Aids2.csv", header=T)
#Select 50 random rows with replacement
sample_data <- sample_data[sample(nrow(sample_data), 50, replace = TRUE), ]
#Count sex in sample_data
table(sample_data$sex)
#Count status
table(sample_data$status)

Loop of missing data by year

I need to run a loop for my dataset, ds. the dim of ds is 4000, 11. Each country of the world of represented. And each country has data from 1970 to 1999.
The data set has missing data amongst its 8 rows. I need to run a loop that calculates how much missing data there is PER year. the year is in df$year.
I am pretty sure the years (1970, 1971, 1972...) are numeric values.
This is my current code
missingds<-c()
for (i in 1:length(ds)){
missingds[names(ds)[i]]<-sum(is.na(ds[i]))/4000
}
This gives me the proportion of missing data per variable of ds. I just cannot figure out how to get it report the proportion of all the variables per year.
I do have an indicator variable ds$missing which reports 1 if there is an NA in any of the columns of that row and 0 if not.
A picture of ds
To count number of NA values in each column using dplyr you can do :
library(dplyr)
result <- data %>%
group_by(Year) %>%
summarise(across(gdp_growth:polity, ~sum(is.na(.))))
In base R you can use aggregate :
aggregate(cbind(gdp_growth, gdp_per_capita, inf_mort, pop_density, polity)~year,
data, function(x) sum(is.na(x)))
Replace sum with mean if you want to count the proportions of NA value in each year.
Using data.table
library(data.table)
setDT(data)[, lapply(.SD, function(x) sum(is.na(x))),
by = Year, .SDcols = gdp_growth:polity]

Creating a data frame from a data set with values over (certain point) [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 2 years ago.
So I have a data set from a package and I want to create a new data frame, with only Cities with crime rates above 30% for one column.
The data set has a column, Crime, which has the crime rates for cities. The values are in decimal form.
df2 <- Cities[,"Crime" > .30]
But it's not returning only the cities with crime rates above 0.30, it's returning all of them. I'm not sure why this is since I've specified > 0.30 in the code? I just spent some time looking around for help on subsetting and creating data frames and none of them were helpful with this type of problem, they were only general subsetting ones where you're selecting the whole column.
I feel like I'm very close and I've tried other things but I'm getting frustrated.
You have to index correctly, in your code you are trying to index at column level. If you want to filter you have to index at row levels, this is the left side of the , in brackets. Here an example:
set.seed(123)
#Data
Cities <- data.frame(Crime=runif(100,0,1))
#Filter
df2 <- Cities[Cities$Crime > .30,,drop=F]
Rows in dataframes:
#Original
nrow(Cities)
[1] 100
#Filtered
nrow(df2)
[1] 65
Or using dplyr:
library(dplyr)
#Code
df2 <- Cities %>% filter(Crime > .30)
Or base R subset():
#Code 2
df2 <- subset(Cities,Crime > .30)

How can I add the populations of males and females together to remove gender as a variable in a demographics table. In R Studio [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
This is my first time posting a question, so may not have the correct info to start, apologies in advance. Am new to R. Prefer to use dplyr or tidyverse because those are the packages we've used so far. I did search for a similar question, but most gender/sex related questions are around separating the data, or performing operations on each separately.
I have a table of population counts, with variables (factors) Age Range, Year and Sex, with Population as the dependent variable. I want to create a plot to show if the population is aging - that is, showing how the relative proportion of different ages groups changes over time. But gender is not relevant, so I want to add together the population counts for males and females, for each year and age range.
I don't know how to provide a copy of the raw data .csv file, so if you have any suggestions, please let me know.
This is a sample of the data(output table):
And here is the code so far:
file_name <- "AusPopDemographics.csv"
AusDemo_df = read.table(file_name,",", header=TRUE)
(grp_AusDemo_df <- AusDemo_df %>% group_by(Year, Age))
I am guessing it may be something like pivot(wider) to bring male and female up as column headings, then transmute() to sum them and create a new population column.
Thanks for your help.
With dplyr you could do something like this
library(dplyr)
grp_AusDemo_df <- AusDemo_df %>%
group_by(Year, Age) %>%
summarise(Population = sum(Population, na.rm = TRUE))

NA variables in dplyr summary r

I am trying to create a table, which includes relative frequencies (counts) of variables taken from two groups (A and B) that fall within pre-given temporal intervals. My problem is that if a row starts with 0 seconds (see start_sec), the variable does not fall within the 0-5 seconds interval but is marked as NA (see Output). My wish is to include these cases within the above-mentioned interval.
This is a dummy example:
Variables
group <- c("A","A","A","A","A","A","B","B","B")
person <- c("p1","p1","p1","p3","p2","p2","p1","p1","p2")
start_sec <- c(0,10.7,11.8,3.9,7.4,12.1,0,3.3,0)
dur_sec <- c(7.1,8.2,9.3,10.4,11.5,12.6,13.7,14.8,15.9)
Data frame
df <- data.frame(group,person,start_sec,dur_sec)
df
Pipeline
df %>%
group_by(group,person, interval=cut(start_sec, breaks=c(0,5,10,15))) %>%
summarise(counts= n(),sum_dur_sec=sum(dur_sec))
Output (so far)
Thank you in advance for all comments and feedback!

Resources