Extracting numbers from column then create new row - r

I have a dataset which includes multiple latitude and longitude points within the same column and also has columns with additional variables, like so:
What data currently looks like
What I would like to do is extract the numbers in multiples of 2 (i.e. 144.81803494458699788 and -37.80978699721590175 then 144.8183146450259926 -37.80819285880839686) into their own rows. The new rows will also duplicate the rest of the original row from which they came i.e.
What I would like the data to look like
I'm pretty new to R hence, perhaps, what might see like a basic question to you all. Update: I've now used
new$latlongs <- str_extract_all(roadchar$X.wkt_geom, "(?>-)*[0-9]+\\.[0-9]+")
and have the numbers/latlongs extracted including the negative sign :)

You can use a loop that combines gsub and strsplit:
## The data.frame
df <- data.frame ("Polyline" = c("MultiLineString((1.1 - 1.1, 2.2 - 2.2))",
"MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 5.5))"),
t(matrix(c(LETTERS[c(1:3,24:26)]), 3,
dimnames = list(c("Char1", "Char2", "Char3")))),
stringsAsFactors = FALSE)
# Polyline Char1 Char2 Char3
# 1 MultiLineString((1.1 - 1.1, 2.2 - 2.2)) A B C
# 2 MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 6.6)) X Y Z
## Function for splitting the line
split.polyline <- function(line, df) {
## Removing the text and brackets
cleaned_line <- gsub("\\)\\)", "", gsub("MultiLineString\\(\\(", "", as.character(df$Polyline[line])))
## Splitting the line
split_line <- strsplit(cleaned_line, split = ", ")[[1]]
## Making the line into a data.frame
df_out <- data.frame("Polyline" = split_line,
matrix(rep(df[line, -1], length(split_line)),
nrow = length(split_line), byrow = TRUE,
dimnames = list(c(), names(df)[-1]))
)
return(df_out)
}
## You can use the function like this for the first row for example
df_out <- split.polyline(1, df)
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
## Or loop through all the rows
for(line in 2:nrow(df)){
df_out <- rbind(df_out, split.polyline(line, df))
}
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
# 3 3.3 - 3.3 X Y Z
# 4 4.4 - 4.4 X Y Z
# 5 5.5 - 5.5 X Y Z

Related

How to create a ranked and sorted index using dplyr?

I am working on an R function that creates a ranked and sorted index with a given set of starting values (in a list) and a total number of slots to fill for the index. If the list values count is < total number of slots, then sequential numbers are inserted into the gaps. How would I modify the reproducible code provided at the bottom of this post, which accurately addresses Example 1 in the below image, so that it also covers Examples 2 and 3 in the below image?
The yellow highlighting shows the the index gaps that are to be sequentially filled (although I filled the gap for slot 5 in Example 1 using the dplyr::dense_rank(...) function.
Please note that Slot 1 in all cases must always = 1 or 1.1!
Also I'm open to any improvements in the below reproducible code including completely migrating it into dplyr!
Reproducible code correctly gives these Example 1 results:
Slot Value
1 1 1.1
2 2 1.2
3 3 2.1
4 4 2.2
5 5 3.0
Reproducible code:
library(dplyr)
# user inputs
Value <- c(2.1, 1.2, 1.1, 2.2)
totalSlots <- 5
# calculations:
Slot <- c(1:totalSlots) # set up left column of sequential numbers to use in DF
maxLength = max(length(Value), totalSlots) # calculate longer of Value or totalSlots
Value <- c(sort(Value), rep(NA, maxLength - length(Value))) # extend Value to maxLength with empty slots filled NA
indexDF <- data.frame(Slot,Value) # create DF
indexDF <- indexDF %>% mutate(Value = coalesce(Value,Slot)) # replace NA with corresponding index Slot
decimals <- indexDF[,"Value"] - as.integer(indexDF[,"Value"]) # extract decimals for later use
rankData <- dplyr::dense_rank(as.integer(indexDF[,"Value"])) # smooth out the ranks
rankData <- rankData + decimals # add back decimals to rankData and then Value
indexDF$Value <- rankData # replace DF column with sorted and smoothed values
print.data.frame(indexDF)
We could create the same codes in a pipe by expanding the data with complete based on the 'totalSlots', then use the same steps
library(dplyr)
library(tidyr)
tibble(Value) %>%
mutate(Slot = row_number()) %>%
complete(Slot = seq_len(totalSlots)) %>%
mutate(Value = coalesce(Value[order(Value)], Slot),
Value = dense_rank(as.integer(Value)) + Value - as.integer(Value))
-output
# A tibble: 5 × 2
Slot Value
<int> <dbl>
1 1 1.1
2 2 1.2
3 3 2.1
4 4 2.2
5 5 3
Below is a concise solution generating object indexDF, with different example scenarios for totalSlots and Value to prove it out:
# Example 1:
Value <- c(2.1, 1.2, 1.1, 2.2)
totalSlots <- 5
# Example 2:
Value <- c(2.1, 2.2)
totalSlots <- 3
# Example 3:
Value <- c(1.1, 1.2, 3.1, 3.2, 3.3, 6.1, 6.2)
totalSlots <- 10
# Example 4:
Value <- c(4.1, 4.2, 4.3)
totalSlots <- 6
indexDF <- data.frame(
Slot = c(1:totalSlots),
Value = sort(c(setdiff(1:totalSlots, floor(Value)), Value))[1:totalSlots])
indexDF

Backwards rollapply with zoo object

Suppose I have a zoo object:
> df <- data.frame(col1=c(1,2,3,4), col2=c("a","b","c","d"))
> v <- zoo(df, order.by = df$col2)
> v
col1 col2
a 1 a
b 2 b
c 3 c
d 4 d
I can calculate the mean as:
> rollapply(v, 2, by.column = F, function(x) { mean(as.numeric(x[,"col1"])) })
a b c
1.5 2.5 3.5
How do I rollapply mean in DESCENDING order? (please no solutions where you just reverse the results AFTER applying the regular rollapply)
I would like my output to look like:
d c b
3.5 2.5 1.5
The oo in zoo stands for ordered observations and such objects are always ordered by the index; however, what is shown in the question is not ordered by the index so it cannot be a valid zoo object.
Also, the line starting v <- in the question is not likely what is wanted since it seems to ask for a mix of numeric and character data. Fixing that line and creating a data frame with the order shown we have:
library(zoo)
v <- read.zoo(df, index = "col2", FUN = c)
r <- rollapplyr(v, 2, mean)
fortify.zoo(r)[length(r):1, ]
giving:
Index r
3 d 3.5
2 c 2.5
1 b 1.5
Per G. Grothendieck:
rollapply(rev.zoo(v), 2, by.column = F, function(x) { mean(as.numeric(x[,"col1"])) })

Group dataframes by columns and match by n elements

So here is my issue. I have two dataframes. A simplified version of them is below.
df1
ID String
1.1 a
1.1 a
1.1 b
1.1 c
...
1.2 a
1.2 a
1.2 c
1.2 c
...
2.1 a
2.1 n
2.1 o
2.1 o
...
2.2 a
2.2 n
2.2 n
2.2 o
...
3.1 a
3.1 a
3.1 x
3.1 x
...
3.2 a
3.2 x
3.2 a
3.2 x
...
4.1 a
4.1 b
4.1 o
4.1 o
...
4.2 a
4.2 b
4.2 b
4.2 o
Imagine each ID (ex: 1.1) has over 1000 rows. Another thing to take note is that in the cases of IDs with same number (ex: 1.1 and 1.2) are very similar. But not an exact match to one another.
df2
string2
a
b
a
c
The df2 is a test df.
I want to see which of the df1 ID is the closest match to df2. But I have one very important condition. I want to match by n elements. Not the whole dataframe against the other.
My pseudo code for this:
df2-elements-to-match <- df2$string2[1:n] #only the first n elements
group df1 by ID
df1-elements-to-match <- df1$String[1:n of every ID] #only the first n elements of each ID
Output a column with score of how many matches.
Filter df1 to remove ID groups with < m score. #m here could be any number.
Filtered df1 becomes new df1.
n <- n+1
df2-elements-to-match and df1-elements-to-match both slide down to the next n elements. Overlap is optional. (ex: if first was 1:2, then 3:4 or even 2:3 and then 3:4)
Reiterate loop with updated variables
If one ID remains stop loop.
The idea here is to get a predicted match without having to match the whole test dataframe.
## minimal dfs
df1 <- data.frame(ID=c(rep(1.1, 5),
rep(1.2, 6),
rep(1.3, 3)),
str=unlist(strsplit("aabaaaabcababc", "")), stringsAsFactors=F)
df2 <- data.frame(str=c("a", "b", "a", "b"), stringsAsFactors=F)
## functions
distance <- function(df, query.df, df.col, query.df.col) {
deviating <- df[, df.col] != query.df[, query.df.col]
sum(deviating, na.rm=TRUE) # if too few rows, there will be NA, ignore NA
}
distances <- function(dfs, query.df, dfs.col, query.df.col) {
sapply(dfs, function(df) distance(df, query.df, dfs.col, query.df.col))
}
orderedDistances <- function(dfs, query.df, dfs.col, query.df.col) {
dists <- distances(dfs, query.df, dfs.col, query.df.col)
sort(dists)
}
orderByDistance <- function(dfs, query.df, dfs.col, query.df.col, dfs.split.col) {
dfs.split <- split(dfs, dfs[, dfs.split.col])
dfs.split.N <- lapply(dfs.split, function(df) df[1:nrow(query.df), ])
orderedDistances(dfs.split.N, query.df, dfs.col, query.df.col)
}
orderByDistance(df1, df2, "str", "str", "ID")
# 1.3 1.1 1.2
# 1 3 3
# 1.3 is the closest to df2!
Your problem is kind of a Distance problem.
Minimalizing Distance = finding most similar sequence.
This kind of distance I show here, assumes that at equivalent positions between df2 and sub-df of df1, deviations are counted as 1 and equality as 0. The sum gives the unsimilarity-score between the compared data frames - sequences of strings.
orderByDistance takes dfs (df1) and a query df (df2), and the columns which should be compared, and column by which it should be split dfs (here "ID").
First it splits dfs, then it collects N rows of each sub-df (preparation for comparison), and then it applies orderedDistances on each sub.df with ensured N rows (N=number or rows of query df).

Create an arbitrary number of new columns using dplyr in R

I'm not sure if the title is worded well, but here is the situation:
I have a meta data dataset, which can have any number of rows in it, e.g.:
Control_DF <- cbind.data.frame(
Scenario = c("A","B","C")
,Variable = c("V1","V2","V3")
,Weight = c("w1","w2","w3")
)
Using the data contained in Control_DF, I want to create a new version of each Variable on my main dataset, where I multiply the variable by the weight. So if my main dataset looks like this:
Main_Data <- cbind.data.frame(
V1 = c(1,2,3,4)
,V2 = c(2,3,4,5)
,V2 = c(3,4,5,6)
,w1 = c(0.1,0.5,1,0.8)
,w2 = c(0.2,1,0.3,0.6)
,w2 = c(0.3,0.7,0.1,0.2)
)
Then, in open code, what I want to do looks like this:
New_Data <- Main_Data %>%
mutate(
weighted_V1 = V1 * w1
,weighted_V2 = V2 * w2
,weighted_V3 = V3 * w3
)
However, I need a way of not hard coding this, and such that the number of variables being referenced is arbitrary.
Can anyone help me?
In base R with lapply, Map and cbind you could do as follows:
# with Control_DF create a list with pairs of <varName,wgt>
controlVarList = lapply(Control_DF$Scenario,function(x)
as.vector(as.matrix(Control_DF[Control_DF$Scenario==x,c("Variable","Weight")] ))
)
controlVarList
#[[1]]
#[1] "V1" "w1"
#
#[[2]]
#[1] "V2" "w2"
#
#[[3]]
#[1] "V3" "w3"
# A custom function for multiplication of both columns
fn_weightedVars = function(x) {
# x = c("V1","w1"); hence x[1] = "V1",x[2] = "w2"
# reference these columns in Main_Data and do scaling
wgtedCol = matrix(Main_Data[,x[1]] * Main_Data[,x[2]],ncol=1)
#rename as required
colnames(wgtedCol)= paste0("weighted_",x[1])
#return var
wgtedCol
}
#call function on each each list element
scaledList = Map(fn_weightedVars ,controlVarList)
Output:
scaledDF = do.call(cbind,scaledList)
#combine datasets
New_Data = data.frame(Main_Data,scaledDF)
New_Data
# V1 V2 V3 w1 w2 w3 weighted_V1 weighted_V2 weighted_V3
#1 1 2 3 0.1 0.2 0.3 0.1 0.4 0.9
#2 2 3 4 0.5 1.0 0.7 1.0 3.0 2.8
#3 3 4 5 1.0 0.3 0.1 3.0 1.2 0.5
#4 4 5 6 0.8 0.6 0.2 3.2 3.0 1.2

Split arbitrary column into melted data frame

I have a data.frame with an ugly column with structured data. Each Column can hold from 1 to 40 values of interest. Each value is separated with a html break "<br />". The extracted value as the form of a 1.1, i.e. an integer a period and another integer.
How to separate and melt these columns into different rows?
I know lapply and tidy::separate probably are the ways to go. But I have not succeeded yet. So asking for help.
testdata is here:
testdata <- dget("http://pastebin.com/download.php?i=VS2cq2rB")
The data frame hold two coloumns: "id", and "moduler".
I'd like to have "id" and "value" instead. The end result should be something like this.
"id", "value"
1, 1.1
1, 1.2
1, 1.3
1, 2.4
2, 1.1
2, 1.3
2, 3.3
This it my latest take - pretty far from where I started with lapply.
origdf <- data.frame()
#names(newdf) <- c("id", 'pnummer', 'moduler')
for (i in 1:nrow(hs)) {
newdf <- data.frame()
newdf[i, 'id'] <- hs[i, 'id']
newdf[i, 'pnummer'] <- hs[i, 'pnummer']
tmp <- unlist(strsplit(as.character(hs[i,'moduler']), "<br />", fixed=T))
for (m in 3:length(tmp)+3) {
newdf[i, m] <- tmp[m]
}
origdf <- dplyr::bind_rows(newdf, origdf)
}
Here's a possible data.table approach. Basically I'm just splitting moduler by "<br />" or "<br />Installationsmontør" by id
library(data.table)
setDT(testdata)[, .(value = unlist(strsplit(as.character(moduler),
"<br />|<br />Installationsmontør"))), by = id]
# id value
# 1: 2862 1.1
# 2: 2862 1.2
# 3: 2862 1.3
# 4: 2862 1.4
# 5: 2862 1.5
# ---
# 132: 2877 3.6
# 133: 2877 4.1
# 134: 2877 4.4
# 135: 2877 4.5
# 136: 2877 4.6
Or similarly with the splitstackshape package
library(splitstackshape)
cSplit(testdata, splitCols = "moduler",
sep = "<br />|<br />Installationsmontør",
direction = "long", fixed = FALSE, stripWhite = FALSE)
I would try to use strsplit function with a simple loop:
newdata <- NULL
a <- 1
b <- 0
for (k in 1:length(testdata$moduler)) {
M <- unlist(strsplit(as.character(testdata$moduler[k]),"<br />|<br />Installationsmontør"))
b <- b + length(M)
newdata$moduler[a:b] <- M
newdata$id[a:b] <- testdata$id[k]
a <- b + 1
}
newdata <- as.data.frame(newdata)
Here is another option using unnest from tidyr. We extract the numeric part ([0-9.]+) using str_extract_all from library(stringr). The output is a list. We set the names of the list elements as the 'id' column of 'testdata' and unnest
library(tidyr)
library(stringr)
res <- unnest(setNames(lapply(str_extract_all(testdata$moduler, '[0-9.]+'),
as.numeric), testdata$id), id)
colnames(res)[2] <- 'value'
head(res)
# id value
#1 2862 1.1
#2 2862 1.2
#3 2862 1.3
#4 2862 1.4
#5 2862 1.5
#6 2862 1.6
dim(res)
#[1] 136 2
Or a base R approach would be to extract the numeric elements with regmatches/gregexpr in a list, get the length of the list element with lengths, replicate the 'id' column from 'testdata' based on that, unlist the 'lst' and create a new 'data.frame'.
lst <- lapply(regmatches(testdata$moduler, gregexpr('[0-9.]+',
testdata$moduler)), as.numeric)
res2 <- data.frame(id = testdata$id[rep(1:nrow(testdata), lengths(lst))],
value= unlist(lst))

Resources