I am creating a raster layer for an area with multiple environmental variables. All data formats have usually been netCDF files (arrays) containing lat, long, date and the variable in question - in this case sea_ice_fraction.
The data for sea surface temperature (sst), came in an understandable format, at least from the point of view of trying to make a prediction grid:
, , Date = 2019-11-25
Long
Lat 294.875 295.125 295.375 295.625 295.875 296.125 296.375 296.625 296.875 297.125
-60.125 2.23000002 2.04 1.83 1.53 1.18 1.00 0.9800000 1.06 1.25 1.40999997
-60.375 2.06999993 1.79 1.60 1.31 1.09 0.97 1.0000000 1.15 1.30 1.42999995
-60.625 1.93999994 1.64 1.45 1.28 1.14 1.02 0.9899999 1.03 1.10 1.13000000
Each row is one single latitude coordinate (of the resolution of the data), and each column is a longitude coordinate paired with the date.
My goal is to calculate the mean of all the date-values for each coordinate cell. Which in the array case is easy:
sst.c1 <- apply(sst.c1, c(1,2), mean)
Then project to a Raster layer
However, the format of the sea ice data is in a dataframe, with 4 columns: lat, long, date, and sea_ice_fraction:
time lat lon sea_ice_fraction
<chr> <dbl> <dbl> <dbl>
1 2019-11-25T12:00:00Z -66.1 -65.1 0.580
2 2019-11-25T12:00:00Z -66.1 -65.1 NA
3 2019-11-25T12:00:00Z -66.1 -65.0 NA
4 2019-11-25T12:00:00Z -66.1 -65.0 NA
5 2019-11-25T12:00:00Z -66.1 -64.9 NA
How can I turn this dataframe into an array similar to the sst data? Or directly into a raster finding the mean of the values for the dates per cell in the dataframe?
Can you not just do this using dplyr?
The following should work fine:
library(dplyr)
df %>%
group_by(lat, lon) %>%
summarize(sea_ice_fraction = mean(sea_ice_fraction)) %>%
ungroup()
should work fine
Related
I have temperature data points in different depth intervals with associated lat long values across the study area. I want to make a raster and then interpolate between cells of the raster where there is no data. I can do it using Krig in the fields package, but wonder whether there are better approaches? The data points are irregularly spaced and we want to take space into account. For each depth interval, we want to create separate rasters.
This is an example of what my data looks like:
# A tibble: 21 x 8
date.time id lon.x lat.y depthbin1 depthbin2 depthbin3 depthbin4
<dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-09 23:09:44 Kopaitic_Inc1_KI04 451144. 2985192. -0.7 -0.742 -0.838 NA
2 2018-12-09 23:12:25 Kopaitic_Inc1_KI04 451076. 2985416. -0.9 NA NA NA
3 2018-12-09 23:13:15 Kopaitic_Inc1_KI04 451054. 2985489. -0.51 -0.546 -0.595 -0.622
4 2018-12-09 23:16:00 Kopaitic_Inc1_KI04 450985. 2985731. -0.474 -0.525 -0.575 -0.645
5 2018-12-09 23:17:56 Kopaitic_Inc1_KI04 450940. 2985903. -0.6 NA NA NA
6 2018-12-09 23:18:36 Kopaitic_Inc1_KI04 450926. 2985962. -0.544 -0.526 -0.592 -0.639
7 2018-12-09 23:21:39 Kopaitic_Inc1_KI04 450870. 2986226. -0.6 -0.595 -0.627 -0.665
8 2018-12-09 23:25:10 Kopaitic_Inc1_KI04 450820. 2986512. -0.5 -0.526 -0.567 -0.576
9 2018-12-09 23:29:41 Kopaitic_Inc1_KI04 450777. 2986829. -0.4 -0.405 -0.512 -0.610
10 2018-12-09 23:32:19 Kopaitic_Inc1_KI04 450763. 2986985. -0.896 NA NA NA
# ... with 11 more rows
>
There are date, id, longitude and latitude variables. And then the mean temperature was measured by a device in every depthbin if the animal dived to that depth interval. If the animal didn't dive that deep, the depthbin value is empty.
This is how I am interpolating at the moment:
# Make a raster layer
library(raster)
# projection
utm.prj = " +proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs "
# create a SpatialPointsDataFrame
coordinates(divetemps) = ~lon.x+lat.y
proj4string(divetemps) <-CRS(utm.prj)
# create an empty raster object to the extent of the points
rast <- raster(ext=extent(divetemps),crs = CRS(utm.prj), resolution = 500) # 500 m x 500 m
rast
# rasterize your irregular points
rasOut<-raster::rasterize(divetemps, rast, divetemps$depthbin1, fun = mean) # we use a mean function here to regularly grid the irregular input points
plot(rasOut)
library(fields)
# Function to Krig
krigR <- function(rast){
xy <- data.frame(raster::xyFromCell(rast, 1:ncell(rast)))
v <- getValues(rast)
krg <- fields::Krig(xy, v)
ras.int <- raster::interpolate(rast, krg)
proj4string(ras.int) <- proj4string(rast)
return(ras.int)
}
surface = krigR(rasOut)
plot(surface)
This is an example of the plots that I get when using the fields::Krig function to interpolate the temperature values for depth bin 1 across the whole study area:
Interpolated temperature values over study area
I am not entirely happy with the plots that I am getting when using the fields::Krig function because I don't know how accurate they are. I know there is not a big difference in temperatures across the study area. But I am sure that my plots can look better than this.
So I would like to try out other R packages and functions to interpolate temperature values across a study area. Does anyone have any suggestions of functions or packages that I can look into and try out that you have used before?
I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format
I have two data frames. The first (REF) contains daily temperatures (along the columns) from a climate model on 2.5° x 2.5° grids (along the rows) covering the whole globe, and the other (OBS) contains daily temperatures for the same time period on finer 0.5° x 0.5° grids also covering the whole globe.
REF 5760 obs. of 9864 variables
X Y 1979.01.01 1979.01.02 1979.01.03
0.00 40.00 10.50 10.40 11.20
2.50 40.00 9.65 8.45 9.30
5.00 40.00 7.75 10.80 8.80
OBS 61143 obs. of 9864 variables
X Y 1979.01.01 1979.01.02 1979.01.03
0.00 40.00 9.50 8.60 10.10
0.50 40.00 8.65 8.75 9.70
1.00 40.00 8.75 9.80 8.10
I wish to interpolate the daily temperature values from the REF data frame to match the finer spatial resolution of the coordinates from the OBS data frame. My output (REF2) should therefore be a dataframe with the same dimensions as the OBS data frame. I have checked various interp solutions but am getting lost on all this. Any suggestions?
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
I like to regress the first column means market return (as y) with rest of the columns (as X) and create a data frame with the list of monthly slope coefficients. My data frame is like this
Date Marker return AFARAK GROUP PLC AFFECTO OYJ
1/3/2007 -0.45 0.00 0.85
1/4/2007 -0.92 2.47 -0.85
1/5/2007 -1.98 3.98 -1.14
The expected output of slope coefficient data frame is like this
Date AFARAK GROUP PLC AFFECTO OYJ
Jan-07 1 0.5
Feb-07 2 1.5
Mar-07 2 1
Apr-07 3 2
Could someone help me in this regard?