I am trying to interpolate/extrapolate NA values. The dataset that I have is from a measuring station that measures soil temperature at 4 depths every 5 minutes. In this specific example there are erroneous data (-888.88) at the end of the measurements for the 0 cm depth variable and 1-5 cm depth variable. I transformed this to NA. Now my professor wants me interpolate/extrapolate for this and all other datasets that I have. I am aware that extrapolating for so much values after the last observation could be statistically inaccurate but I am trying to at least come up with a working code. As of now I tried to extrapolate for one of the variables (SoilTemp_1.5cm). The final line runs but when I open the data frame, the NAs are still there.
library(dplyr)
library(Hmisc)
MyD <- read.csv("2319538_Bodentemp_braun_WILDKOGEL_17_18 - Copy.csv",header=TRUE, sep=";")
MyD$date <- as.Date(MyD$Date, "%d.%m.%Y")
MyD$datetime <- as.POSIXct(MyD$Date.Time..GMT.01.00, format = "%d.%m.%Y %H:%M")
MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] <- NA #convert erroneous data to NA
MyD %>% mutate(`SoilTemp_1.5cm`=approxExtrap(x=SoilTemp_5cm, y=SoilTemp_1.5cm, xout=SoilTemp_5cm)$y)
I also tried this way which gives me a list of 2 which has a lot of columns instead of rows when I convert to data frame. I am not going to lie that this approxExtrap syntax confuses me a little bit.
MyD1 <- approxExtrap(MyD$SoilTemp_5cm, MyD$SoilTemp_1.5cm,xout=MyD$SoilTemp_5cm)
MyD1
I am honestly not sure how to reproduce the data so here is pastebin link of a dput() output https://pastebin.com/NFZdmm4L. I tried to include as much output as I could. Have in mind that I excluded some of the columns when running the dput() so the code MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] might differ. Anyways, the dput() output already has the NAs included so you might not even need it.
Thanks in advance.
Best regards,
Zorin
na.approx will fill in NAs with interpolated values and rule=2 will extend the first and last values.
library(zoo)
x <- c(NA, 4, NA, 5, NA) # test input
na.approx(x, rule = 2)
## [1] 4.0 4.0 4.5 5.0 5.0
Related
I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662
Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.
I have data frame that is missing some data in the end_station_id. It was read in properly as a csv file (3,489,749 rows) with 147,242 rows missing data as NA
I would like to fill in the missing end_station_id by finding a match using the end latitude/longitude pairs of a known end_station_id
```{r}
end_station_id <chr> end_lat<dbl> end_lng <dbl>
NA 41.92 -87.70
NA 41.92 -87.70
NA 41.86 -87.63
ta52 NA NA
499 41.9306 -87.7238
255 41.92 -87.7078
```
So in the above example I would like to replace the first two NAs with 255 because the gps pairs match.
I know that I have to lapply somehow but I have no clue.
The next complication comes in the form of that because the way the gps was recorded it might not be an exact match because the bicycles were put in racks and some of the bikes recorded better gps significant digits than others.
so to make the matching easier I was thinking about trying to find the mean lat/lng for each station to make the matching easier is one thought I had. So create a new DF with the unique station ids and the mean of all the gps points for each id. Then replace those mean points back into the original df so that there are only 709 station gps points.
OR
I think there are enough lat/lon points that just scanning the entire DF there should be an exact match somewhere in the data set.
So how do I do the lapply() or apply() to see if there is a match on lat/lon and then save the matching station id in the df?
It would seem I would first need a DF with no missing IDs so I can filter that to clean. Then as I find a match I cbind the fixed row to the clean DF
Sorry but I just don't have enough R training on apply( x, function) yet if that helps.
So, to finish. I have a df with missing data that could be extrapolated by comparing other columns to fill in the missing data.
I'd worry about the accuracy of your first method. Rounding the lat/long values to two decimals wouldn't give you the matches you're looking for, as rounding the lon of station 255 to two digits would give you -87.71, which is different from the NA station lon (-87.70).
Here's an implementation of your second method, using dplyr:
library(dplyr)
# Separate data into those with and without ids
df_clean <- df %>% filter(!is.na(end_station_id))
df_na <- df %>% filter(is.na(end_station_id))
# match stations to NAs based on lat/log
df_matched <- df_na %>%
left_join(df_clean,
by = c("end_lat", "end_lng"),
suffix = c(".na", ".clean")) %>%
mutate(end_station_id = end_station_id.clean) %>%
select(-end_station_id.na, -end_station_id.clean)
# Recombine data
df_cleaned <- rbind(df_clean, df_matched)
Maybe rounding the values before joining would give you better matching.
Another (better?/more involved) way to go about it would be to define min and max allowable values for each station, then assign the station based on being within those ranges. Or find the station that is the smallest distance away.
I am prepping data for linear regression and want to address missing values (NA) by using the longest contiguous stretch of non-NA values in a given year and site.
I have tried na.contiguous() but my code is not applying the function by year or site
Thanks for your assistance
The test data is a multivariate time series that spans 2 years and 2 sites. My hope is that the solution will accommodate data with many more years and 32 sites- so some level of automation and qa/qc is appreciated.
library(dataRetrieval)
library(dplyr)
# read in Data, q is discharge and wt is stream temperature
wt<-readNWISdv(siteNumbers=c("08181800","07308500"),
parameterCd=c("00010","00060"), statCd=c("00003"),
startDate="1998-07-01", endDate="1999-09-30" )
dfwt<-wt%>%
group_by(site_no)%>%
select(Date,site_no,X_00010_00003,X_00060_00003)%>%
rename(wt=X_00010_00003,q=X_00060_00003)
#Subset summer season, add dummy air temp (at).
dfwt$Date<-ymd(dfwt$Date, tz=Sys.timezone())
dfwt$month<-month(dfwt$Date)
dfwt$year<-year(dfwt$Date)
df<- dfwt %>%
group_by(site_no)%>%
subset(month>=7 & month<=9)%>%
mutate(at=wt*1.3)
# add NA
df[35:38,3]<-NA
df[155,3]<-NA
df[194,3]<-NA
test<-df%>%
group_by(site_no, year)%>%
na.contiguous(df)
Using a for loop I found the following solution,
library(zoo)
library(plyr)
library(lubridate)
zoo(df)
sites<-as.vector(unique(df$site_no))
bfi_allsites<-data.frame()
for(i in 1:2){
site1<-subset(dfz, site_no==sites[i])
str(site1)
ss1<-split(site1,site1$year)
site1result<-lapply(ss1,na.contiguous)#works
site_df <- ldply(site1result,data.frame)
bfi_allsites<-rbind(bfi_allsites, site_df)
}
head(bfi_allsites)
I am trying to subset a data frame based on a range of time. Someone has asked this question in the past and the answer was to use R CMD INSTALL lubridate_1.3.1.tar.gz (see link: subset rows according to a range of time.
The issue with this answer is that I get the following warning:
> install.packages("lubridate_1.3.2.tar.gz")
Warning in install.packages :
package ‘lubridate_1.3.2.tar.gz’ is not available (for R version 3.1.2)
I am looking for something very similar to this answer but I cannot figure out how to do this. I have a MasterTable with all of my data organized into columns. One of my columns is called maxNormalizedRFU.
My question is simple:
How can I subset my maxNormalizedRFU column by time?
I would simply like to add another column which only displays the maxNormalizedRFU the data between 10 hours and 14 hours. Here is what I have up to now:
#Creates the master table
MasterTable <- inner_join(LongRFU, LongOD, by= c("Time.h", "Well", "Conc.nM", "Assay"))
#normalizes my data by fluorescence (RFU) and optical density (OD) based on 6 different subsets called "Assay"
MasterTable$NormalizedRFU <- MasterTable$AvgRFU/MasterTable$AvgOD
#creates a column that only picks the maximum value of each "Assay"
MasterTable <- ddply(MasterTable, .(Conc.nM, Assay), transform, maxNormalizedRFU=max(NormalizedRFU))
#The issue
MasterTable$CutmaxNormalizedRFU <- ddply(maxNormalizedRFU, "Time.h", transform, [MasterTable$Time.h < 23.00 & MasterTable$Time.h > 10.00,])
Attached is a sample of my dataset. Since the original file has over 90 000 lines, I have only attached a small fraction of it (only one assay and one concentration).
My line is currently using ddply to do the subset but this simply does not work. Does anyone have a suggestion as to how to fix this issue?
Thank you in advance!
Marty
I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))
Hi I am using R to analyze my data. I have time-series data in following format:
dates ID
2008-02-12 3
2008-03-12 3
2008-05-12 3
2008-09-12 3
2008-02-12 8
2008-04-12 6
I would like to create a plot with dates at the x axis and ID on Y axis. Such that it draws a point if id is reported for that data and nothing if there is no data for that.
In the original dataset I only have id if the value is reported on that date. For e.g. for 2008-02-12 for id 6 there is no data reported hence it is missing in my dataset.
I was able to get all the dates with unique(df$dates) function, but dont know enough about R data structures on how to loop through data and make matrix with 1 0 for all ids and then plot it.
I will be grateful if you guys can help me with the code or give me some pointers on what could be effective way to approach this problem.
Thanks in advance.
It seems you want something like a scatter-plot :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
# convert first column from characters to dates
DF$Year <- as.POSIXct(DF$Year,format='%Y-%m-%d',tz='GMT')
# scatter plot
plot(x=DF$Year,y=DF$ID,type='p',xlab='Date',ylab='ID',
main='Reported Values',pch=19,col='red')
Result :
But this approach has a problem. For example if you have unique(ids) = c(1,2,1000) the space on the y axis between id=2 and id=1000 will be very big (the same holds for the dates on the x axis).
Maybe you want a sort of "map" id-dates, like the following :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
dates <- as.factor(DF$Year)
ids <- as.factor(DF$ID)
plot(x=as.integer(dates),y=as.integer(ids),type="p",
xlim=c(0.5,length(levels(dates))+0.5),
ylim=c(0.5,length(levels(ids))+0.5),
xaxs="i", yaxs="i",
xaxt="n",yaxt="n",main="Reported Values",
xlab="Date",ylab="ID",pch=19,col='red')
axis(1,at=1:length(levels(dates)),labels=levels(dates))
axis(2,at=1:length(levels(ids)),labels=levels(ids))
# add grid
abline(v=(1:(length(levels(dates))-1))+0.5,,col="Gray80",lty=2)
abline(h=(1:(length(levels(ids))-1))+0.5,col="Gray80",lty=2)
Result :