How to group repeating sequences of numbers using R - r

The simplest description of what I am trying to do is that I have a column in a data.frame like 1,2,3,..., n, 1,2,3,...n,.... and I want group the first 1...n as 1 the second 1...n as 2 and so on.
The full context is; I am using the R spcosa package to do equal area stratification composite sampling on parcels of land. I start with a shape file from a GIS that contains a number of polygons (land parcels). The end result I want is a GIS file with each of the strata and sample locations in a GIS file format with each stratum and sample location labeled by land parcel, stratum and sample id. So far I can do all this except one bit which is identifying the stratum that the samples belongs too and including it in the sample label. The sample label needs to look like "parcel#-strata#-composite# (where # is the number). In practice I don't need this actual label but as separate attributes in GIS file.
The basic work flow is a follows
For each individual polygon using spcosa::stratify I divide it into a number of equal area strata like
strata.CSEA <- stratify(poly[i,], nStrata = n, nTry = 1, equalArea = TRUE, nGridCells = x)
Note spcosa::stratify generates a CompactStratificationEqualArea object. I cocerce this to a SpatialPixelData then use rasterToPolygon to be able to output it as a GIS file.
I then generate the sample locations as follows:
samples.SPRC <- spsample(strata.CSEA, n = n, type = "composite")
spcosa::spsample creates a SamplingPatternRandomComposite object. I coerce this to a SpatialPointsDataFrame
samples.SPDF <- as(samples.SPRC, "SpatialPointsDataFrame")
and add two columns to the #data slot
samples.SPDF#data$Strata <- "this is the bit I can't do yet"
samples.SPDF#data$CEA <- poly[i,]$name
I can then write samples.SPDF as a GIS file ( ie writeOGE) with all the wanted attributes.
As above the part I can't sort out is how the sample ids relate to the strata ids. The sample points are a vector like 1,2,3...n, 1,2,3...n,.... How do I extract which sample goes with which strata? As actual strata number are arbitrary, I can just group ( as per my simple question above) but ideally I would like to use the numbering of the actual strata so everything lines up.
To give any contributors access to a hands on example I copy below the code from the spcosa documentation slightly modified to generate the correct objects.
# Note: the example below requires the 'rgdal'-package You may consider the 'maptools'-package as an alternative
if (require(rgdal)) {
# read a vector representation of the `Farmsum' field
shpFarmsum <- readOGR(
dsn = system.file("maps", package = "spcosa"),
layer = "farmsum"
)
# stratify `Farmsum' into 50 strata
# NB: increase argument 'nTry' to get better results
set.seed(314)
myStratification <- stratify(shpFarmsum, nStrata = 50, nTry = 1, equalArea = TRUE)
# sample two sampling units per stratum
mySamplingPattern <- spsample(myStratification, n = 2 type = "composite")
# plot the resulting sampling pattern on
# top of the stratification
plot(myStratification, mySamplingPattern)
}

Maybe order() function can help you
n <- 10
dat <- data.frame(col1 = rep(1:n, 2), col2 = rnorm(2*n))
head(dat)
dat[order(dat$col1), ]

I did not get where the "ID" (1,2,3...n) is to be found; so let's assume you have your SpatialPolygonsDataFrame called shpFarmsum with a attribute data column "ID". You can access this column via shpFarmsum$ID. Therefore, if you want to create individual subsets for each ID this is one way to go:
for (i in unique(shpFarmsum$ID)) {
tempSubset shpFarmsum[shpFarmsum$ID == i,]
writeOGR(tempSubset, ".", paste0("subset_", i), driver = "ESRI Shapefile")
}
I added the line writeOGR(... so all subsets are written to your working direktory. However, you can change this line or add further analysis into the for-loop.
How it works
unique(shpFarmsum$ID) extracts all occuring IDs (compareable to your 1,2,3...n).
In each repetition of the for loop, another value of this IDs will be used to create a subset of the whole SpatialPolygonsDataFrame, which you can use for further analysis.

Related

Convert List of lists to data frame where each list within the list are the results from using Sapply + decompose on multiple columns

this is my first project using a coded environment so may not phrase things accurately. I am building an ARIMA forecast.
I want to forecast for multiple sectors (business areas) at a time. Using help forums I have managed to write code that takes my time series data as input, fits the model, and sends the outputs to CSV. I am happy with this.
My problem is that I would also like capture the results from the decomposition analysis on a sector level. Currently, when I use a solution I found elsewhere it outputs to CSV in a format that is unusable, where everything is spread by row and the different lists are half in one row and another.
Thanks In advance!
My current solution (probably not super efficient but like I say cobbled together based on forum tips)
Clean data down to TS
NLDemand <- read_excel("TS Demand 2018 + Non London no lockdown.xlsx")
NLDemand <- as_tibble(NLDemand)
NLDemand <- na.omit(NLDemand)
NLDemand <- subset(NLDemand, select = -c(Month,Year))
NLDemand <- subset(NLDemand, select = -c(YearMonth))
##this gets the data to a point where each column is has a header of business sector and the time series data underneath it with no categorical columns left E.G:
Sector 1a, sector1b, sector...
500,450,300
450,500,350
...,...,...
Season capture for all sectors
tsData<-sapply(NLDemand, FUN = ts, simplify = FALSE,USE.NAMES = TRUE,start=c(2018,1),frequency=12)
tsData
timeseriescomponents <- sapply(tsData,FUN=decompose,simplify = FALSE, USE.NAMES = TRUE)
timeseriescomponents
this produces a list of lists where each sublist is the decomposed elements of the sector time series.
##Covert all season captures to the same length
TSC <- list(timeseriescomponents[1:41])
n.obs <- sapply(TSC, length)
seq.max <- seq_len(max(n.obs))
mat <- t(sapply(TSC, "[", i = seq.max ))
##Export to CSV
write.csv(mat, "Non london 2018 + S-T componants.csv", row.names=FALSE)
***What I want as an output would be a table that showed each componant as a a column in a list
Desired output format
Current output(sample)

Extracting from the data frame produced using GageRR/GageRRDesign in R

How do I extract the 'VarCompContrib" column in the data frame produced using the gageRR function in R?
This is for a GageRR analysis of a measurement system. I'm trying to make a very user friendly program where other people can just enter the information required, like number of operators, parts, and measurements, as well as the measurements themselves, and output the correct analysis. I'm gonna use an if-statement later on to do the "analysis" portion, but I am having trouble actually managing the data frame produced with gageRR.
library(MASS)
library(Rsolnp)
library(qualityTools)
design = gageRRDesign(Operators=3, Parts=10, Measurements=2, randomize=FALSE)
response(design) = c(23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,
23,24,23,24,24,22,22,22,24,23,22,24,20,20,25,24,22,24,21,20,21,22,21,22,21,
21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23)
gdo=gageRR(design)
plot(gdo)
I am looking to get a 7 number column vector under VarCompContrib
For starters, you can look at the structure of gdo with str(gdo). From there, we see that Varcomp is a slot, so we can access it with gdo#Varcomp and just convert it to a data.frame:
library(qualityTools)
design <- gageRRDesign(Operators = 3, Parts = 10, Measurements = 2, randomize = FALSE)
response(design) <- c(
23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,23,24,23,24,24,22,22,22,24,23,22,24,
20,20,25,24,22,24,21,20,21,22,21,22,21,21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23
)
gdo <- gageRR(design)
data.frame(gdo#Varcomp)
# totalRR repeatability reproducibility a a_b bTob totalVar
# 1 1.66441 1.209028 0.4553819 0.4553819 0 1.781211 3.445621

Using value-labels in R with sjlabelled package

Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R

Creating SpatialLinesDataFrame from SpatialLines object and basic df

Using leaflet, I'm trying to plot some lines and set their color based on a 'speed' variable. My data start at an encoded polyline level (i.e. a series of lat/long points, encoded as an alphanumeric string) with a single speed value for each EPL.
I'm able to decode the polylines to get lat/long series of (thanks to Max, here) and I'm able to create segments from those series of points and format them as a SpatialLines object (thanks to Kyle Walker, here).
My problem: I can plot the lines properly using leaflet, but I can't join the SpatialLines object to the base data to create a SpatialLinesDataFrame, and so I can't code the line color based on the speed var. I suspect the issue is that the IDs I'm assigning SL segments aren't matching to those present in the base df.
The objects I've tried to join, with SpatialLinesDataFrame():
"sl_object", a SpatialLines object with ~140 observations, one for each segment; I'm using Kyle's code, linked above, with one key change - instead of creating an arbitrary iterative ID value for each segment, I'm pulling the associated ID from my base data. (Or at least I'm trying to.) So, I've replaced:
id <- paste0("line", as.character(p))
with
lguy <- data.frame(paths[[p]][1])
id <- unique(lguy[,1])
"speed_object", a df with ~140 observations of a single speed var and row.names set to the same id var that I thought I created in the SL object above. (The number of observations will never exceed but may be smaller than the number of segments in the SL object.)
My joining code:
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object)
And the result:
row.names of data and Lines IDs do not match
Thanks, all. I'm posting this in part because I've seen some similar questions - including some referring specifically to changing the ID output of Kyle's great tool - and haven't been able to find a good answer.
EDIT: Including data samples.
From sl_obj, a single segment:
print(sl_obj)
Slot "ID":
[1] "4763655"
[[151]]
An object of class "Lines"
Slot "Lines":
[[1]]
An object of class "Line"
Slot "coords":
lon lat
1955 -74.05228 40.60397
1956 -74.05021 40.60465
1957 -74.04182 40.60737
1958 -74.03997 40.60795
1959 -74.03919 40.60821
And the corresponding record from speed_obj:
row.names speed
... ...
4763657 44.74
4763655 34.8 # this one matches the ID above
4616250 57.79
... ...
To get rid of this error message, either make the row.names of data and Lines IDs match by preparing sl_object and/or speed_object, or, in case you are certain that they should be matched in the order they appear, use
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object, match.ID = FALSE)
This is documented in ?SpatialLinesDataFrame.
All right, I figured it out. The error wasn't liking the fact that my speed_obj wasn't the same length as my sl_obj, as mentioned here. ("data =
object of class data.frame; the number of rows in data should equal the number of Lines elements in sl)
Resolution: used a quick loop to pull out all of the unique lines IDs, then performed a left join against that list of uniques to create an exhaustive speed_obj (with NAs, which seem to be OK).
ids <- data.frame()
for (i in (1:length(sl_obj))) {
id <- data.frame(sl_obj#lines[[i]]#ID)
ids <- rbind(ids, id)
}
colnames(ids)[1] <- "linkId"
speed_full <- join(ids, speed_obj)
speed_full_short <- data.frame(speed_obj[,c(-1)])
row.names(speed_full_short) <- speed_full$linkId
splndf <- SpatialLinesDataFrame(sl_obj, data = speed_full_short, match.ID = T)
Works fine now!
I may have deciphered the issue.
When I am pulling in my spatial lines data and I check the class it reads as
"Spatial Lines Data Frame" even though I know it's a simple linear shapefile, I'm using readOGR to bring the data in and I believe this is where the conversion is occurring. With that in mind the speed assignment is relatively easy.
sl_object$speed <- speed_object[ match( sl_object$ID , row.names( speed_object ) ) , "speed" ]
This should do the trick, as I'm willing to bet your class(sl_object) is "Spatial Lines Data Frame".
EDIT: I had received the same error as OP, driving me to check class()
I am under the impression that the error that was populated for you is because you were trying to coerce a data frame into a data frame and R wasn't a fan of that.

R ncdf package - put.var.ncdf requiring incorrect number of dimensions

I am organizing weather data into netCDF files in R. Everything goes fine until I try to populate the netcdf variables with data, because it is asking me to specify only one dimension for two-dimensional variables.
library(ncdf)
These are the dimension tags for the variables. Each variable uses the Threshold dimension and one of the other two dimensions.
th <- dim.def.ncdf("Threshold", "level", c(5,6,7,8,9,10,50,75,100))
rt <- dim.def.ncdf("RainMinimum", "cm", c(5, 10, 25))
wt <- dim.def.ncdf("WindMinimum", "m/s", c(18, 30, 50))
The variables are created in a loop, and there are a lot of them, so for the sake of easy understanding, in my example I'll only populate the list of variables with one variable.
vars <- list()
v1 <- var.def.ncdf("ARMM_rain", "percent", list(th, rt), -1, prec="double")
vars[[length(vars)+1]] <- v1
ncdata <- create.ncdf("composite.nc", vars)
I use another loop to extract data from different data files into a 9x3 data frame named subframe while iterating through the variables of the netcdf file with varindex. For the sake of reproducing, I'll give a quick initialization for these values.
varindex <- 1
subframe <- data.frame(matrix(nrow=9, ncol=3, rep(.01, 27)))
The desired outcome from there is to populate each ncdf variable with the contents of subframe. The code to do so is:
for(x in 1:9) {
for(y in 1:3) {
value <- ifelse(is.na(subframe[x,y]), -1, subframe[x,y])
put.var.ncdf(ncdata, varindex, value, start=c(x,y), count=1)
}
}
The error message is:
Error in put.var.ncdf(ncdata, varindex, value, start = c(x, y), count = 1) :
'start' should specify 1 dims but actually specifies 2
tl;dr: I have defined two-dimensional variables using ncdf in R, I am trying to write data to them, but I am getting an error message because R believes they are single-dimensional variables instead.
Anyone know how to fix this error?

Resources