Group dataframes by columns and match by n elements - r

So here is my issue. I have two dataframes. A simplified version of them is below.
df1
ID String
1.1 a
1.1 a
1.1 b
1.1 c
...
1.2 a
1.2 a
1.2 c
1.2 c
...
2.1 a
2.1 n
2.1 o
2.1 o
...
2.2 a
2.2 n
2.2 n
2.2 o
...
3.1 a
3.1 a
3.1 x
3.1 x
...
3.2 a
3.2 x
3.2 a
3.2 x
...
4.1 a
4.1 b
4.1 o
4.1 o
...
4.2 a
4.2 b
4.2 b
4.2 o
Imagine each ID (ex: 1.1) has over 1000 rows. Another thing to take note is that in the cases of IDs with same number (ex: 1.1 and 1.2) are very similar. But not an exact match to one another.
df2
string2
a
b
a
c
The df2 is a test df.
I want to see which of the df1 ID is the closest match to df2. But I have one very important condition. I want to match by n elements. Not the whole dataframe against the other.
My pseudo code for this:
df2-elements-to-match <- df2$string2[1:n] #only the first n elements
group df1 by ID
df1-elements-to-match <- df1$String[1:n of every ID] #only the first n elements of each ID
Output a column with score of how many matches.
Filter df1 to remove ID groups with < m score. #m here could be any number.
Filtered df1 becomes new df1.
n <- n+1
df2-elements-to-match and df1-elements-to-match both slide down to the next n elements. Overlap is optional. (ex: if first was 1:2, then 3:4 or even 2:3 and then 3:4)
Reiterate loop with updated variables
If one ID remains stop loop.
The idea here is to get a predicted match without having to match the whole test dataframe.

## minimal dfs
df1 <- data.frame(ID=c(rep(1.1, 5),
rep(1.2, 6),
rep(1.3, 3)),
str=unlist(strsplit("aabaaaabcababc", "")), stringsAsFactors=F)
df2 <- data.frame(str=c("a", "b", "a", "b"), stringsAsFactors=F)
## functions
distance <- function(df, query.df, df.col, query.df.col) {
deviating <- df[, df.col] != query.df[, query.df.col]
sum(deviating, na.rm=TRUE) # if too few rows, there will be NA, ignore NA
}
distances <- function(dfs, query.df, dfs.col, query.df.col) {
sapply(dfs, function(df) distance(df, query.df, dfs.col, query.df.col))
}
orderedDistances <- function(dfs, query.df, dfs.col, query.df.col) {
dists <- distances(dfs, query.df, dfs.col, query.df.col)
sort(dists)
}
orderByDistance <- function(dfs, query.df, dfs.col, query.df.col, dfs.split.col) {
dfs.split <- split(dfs, dfs[, dfs.split.col])
dfs.split.N <- lapply(dfs.split, function(df) df[1:nrow(query.df), ])
orderedDistances(dfs.split.N, query.df, dfs.col, query.df.col)
}
orderByDistance(df1, df2, "str", "str", "ID")
# 1.3 1.1 1.2
# 1 3 3
# 1.3 is the closest to df2!
Your problem is kind of a Distance problem.
Minimalizing Distance = finding most similar sequence.
This kind of distance I show here, assumes that at equivalent positions between df2 and sub-df of df1, deviations are counted as 1 and equality as 0. The sum gives the unsimilarity-score between the compared data frames - sequences of strings.
orderByDistance takes dfs (df1) and a query df (df2), and the columns which should be compared, and column by which it should be split dfs (here "ID").
First it splits dfs, then it collects N rows of each sub-df (preparation for comparison), and then it applies orderedDistances on each sub.df with ensured N rows (N=number or rows of query df).

Related

Extracting numbers from column then create new row

I have a dataset which includes multiple latitude and longitude points within the same column and also has columns with additional variables, like so:
What data currently looks like
What I would like to do is extract the numbers in multiples of 2 (i.e. 144.81803494458699788 and -37.80978699721590175 then 144.8183146450259926 -37.80819285880839686) into their own rows. The new rows will also duplicate the rest of the original row from which they came i.e.
What I would like the data to look like
I'm pretty new to R hence, perhaps, what might see like a basic question to you all. Update: I've now used
new$latlongs <- str_extract_all(roadchar$X.wkt_geom, "(?>-)*[0-9]+\\.[0-9]+")
and have the numbers/latlongs extracted including the negative sign :)
You can use a loop that combines gsub and strsplit:
## The data.frame
df <- data.frame ("Polyline" = c("MultiLineString((1.1 - 1.1, 2.2 - 2.2))",
"MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 5.5))"),
t(matrix(c(LETTERS[c(1:3,24:26)]), 3,
dimnames = list(c("Char1", "Char2", "Char3")))),
stringsAsFactors = FALSE)
# Polyline Char1 Char2 Char3
# 1 MultiLineString((1.1 - 1.1, 2.2 - 2.2)) A B C
# 2 MultiLineString((3.3 - 3.3, 4.4 - 4.4, 5.5 - 6.6)) X Y Z
## Function for splitting the line
split.polyline <- function(line, df) {
## Removing the text and brackets
cleaned_line <- gsub("\\)\\)", "", gsub("MultiLineString\\(\\(", "", as.character(df$Polyline[line])))
## Splitting the line
split_line <- strsplit(cleaned_line, split = ", ")[[1]]
## Making the line into a data.frame
df_out <- data.frame("Polyline" = split_line,
matrix(rep(df[line, -1], length(split_line)),
nrow = length(split_line), byrow = TRUE,
dimnames = list(c(), names(df)[-1]))
)
return(df_out)
}
## You can use the function like this for the first row for example
df_out <- split.polyline(1, df)
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
## Or loop through all the rows
for(line in 2:nrow(df)){
df_out <- rbind(df_out, split.polyline(line, df))
}
# Polyline Char1 Char2 Char3
# 1 1.1 - 1.1 A B C
# 2 2.2 - 2.2 A B C
# 3 3.3 - 3.3 X Y Z
# 4 4.4 - 4.4 X Y Z
# 5 5.5 - 5.5 X Y Z

Separate Million and Billion Data from one column

I am trying below code for separating "M" and "B" with their values in 2 different column.
I want output like this:
level 1 level 2
M 3.2 B 3.6
M 4 B 2.8
B 3.5
Input:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
#class(reve)
data=data.frame(reve)
Here is what I have tried.
index=which(grepl("M ",data$reve)
data$reve=gsub("M ","",data$reve)
data$reve=gsub("B ","",data$reve)
data$reve=as.numeric(data$reve)
If you have a data frame you can do that with dplyr separate()
I give you an example of this:
library(dplyr)
df <- tibble(coupe = c("M 2.3", "M 4.5", "B 1"))
df %>% separate(coupe, c("MorB","Quant"), " ")
OUTPUT
# MorB Quant
# <chr> <chr>
#1 M 2.3
#2 M 4.5
#3 B 1
Hope it help you!
For counting the number of "M" rows:
df %>% separate(YourColumn, c("MorB","Quant"), " ") %>%
filter(MorB == "M") %>% nrow()
Here is a base R approach.
lst <- split(reve, substr(reve, 1, 1))
df1 <- as.data.frame(lapply(lst, `length<-`, max(lengths(lst))))
df1
# B M
#1 B 3.6 M 3.2
#2 B 2.8 M 4
#3 B 3.5 <NA>
split the vector in two by the first letter. This gives you a list with entries of unequal length. Use lapply to make the entries having the same length, i.e. append the shorter one with NAs. Call as.data.frame.
If you want to change the names, you can use setNames
setNames(df1, c("level_2", "level_1"))
In case I misunderstood your desired output, try
df1 <- data.frame(do.call(rbind, (strsplit(reve, " "))), stringsAsFactors = FALSE)
df1[] <- lapply(df1, type.convert, as.is = TRUE)
df1
# X1 X2
#1 M 3.2
#2 B 3.6
#3 B 2.8
#4 B 3.5
#5 M 4.0
I think options rooted in regex may also be helpful for these types of problems
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve, stringsAsFactors = F) # handle your data as strings, not factors
# regex to extract M vals and B vals
mvals <- stringi::stri_extract_all_regex(data, "M+\\s[0-9]\\.[0-9]|M+\\s[0-9]")[[1]]
bvals <- stringi::stri_extract_all_regex(data, "B+\\s[0-9]\\.[0-9]|B+\\s[0-9]")[[1]]
# gluing things together into a single df
len <- max(length(mvals), length(bvals)) # find the length
data.frame(M = c(mvals, rep(NA, len - length(mvals))) # ensure vectors are the same size
,B = c(bvals, rep(NA, len - length(bvals)))) # ensure vectors are the same size
In case regex is unfamiliar, the first expression searches for "M" followed by a space, then by digits 0 through 9, then a period, then digits 0 through 9 again. The vertical pipe is on "or" operator, so the expression also searches for "M" followed by a space, then digits 0 through 9. The second half of the expression accounts for cases like "M 4". The second expression does the same thing, just for lines that contain "B" in lieu of "M".
These are quick and dirty regex statements. I'm sure cleaner formulations are possible to get the same results.
We can count Millions or Billions as follows:
Input datatset:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve)
Code
library(dplyr)
library(tidyr)
data %>%
separate(reve, c("Label", "Value"),extra = "merge") %>%
group_by(Label) %>%
summarise(n = n())
Output
# A tibble: 2 x 2
Label n
<chr> <int>
1 B 3
2 M 2

Split arbitrary column into melted data frame

I have a data.frame with an ugly column with structured data. Each Column can hold from 1 to 40 values of interest. Each value is separated with a html break "<br />". The extracted value as the form of a 1.1, i.e. an integer a period and another integer.
How to separate and melt these columns into different rows?
I know lapply and tidy::separate probably are the ways to go. But I have not succeeded yet. So asking for help.
testdata is here:
testdata <- dget("http://pastebin.com/download.php?i=VS2cq2rB")
The data frame hold two coloumns: "id", and "moduler".
I'd like to have "id" and "value" instead. The end result should be something like this.
"id", "value"
1, 1.1
1, 1.2
1, 1.3
1, 2.4
2, 1.1
2, 1.3
2, 3.3
This it my latest take - pretty far from where I started with lapply.
origdf <- data.frame()
#names(newdf) <- c("id", 'pnummer', 'moduler')
for (i in 1:nrow(hs)) {
newdf <- data.frame()
newdf[i, 'id'] <- hs[i, 'id']
newdf[i, 'pnummer'] <- hs[i, 'pnummer']
tmp <- unlist(strsplit(as.character(hs[i,'moduler']), "<br />", fixed=T))
for (m in 3:length(tmp)+3) {
newdf[i, m] <- tmp[m]
}
origdf <- dplyr::bind_rows(newdf, origdf)
}
Here's a possible data.table approach. Basically I'm just splitting moduler by "<br />" or "<br />Installationsmontør" by id
library(data.table)
setDT(testdata)[, .(value = unlist(strsplit(as.character(moduler),
"<br />|<br />Installationsmontør"))), by = id]
# id value
# 1: 2862 1.1
# 2: 2862 1.2
# 3: 2862 1.3
# 4: 2862 1.4
# 5: 2862 1.5
# ---
# 132: 2877 3.6
# 133: 2877 4.1
# 134: 2877 4.4
# 135: 2877 4.5
# 136: 2877 4.6
Or similarly with the splitstackshape package
library(splitstackshape)
cSplit(testdata, splitCols = "moduler",
sep = "<br />|<br />Installationsmontør",
direction = "long", fixed = FALSE, stripWhite = FALSE)
I would try to use strsplit function with a simple loop:
newdata <- NULL
a <- 1
b <- 0
for (k in 1:length(testdata$moduler)) {
M <- unlist(strsplit(as.character(testdata$moduler[k]),"<br />|<br />Installationsmontør"))
b <- b + length(M)
newdata$moduler[a:b] <- M
newdata$id[a:b] <- testdata$id[k]
a <- b + 1
}
newdata <- as.data.frame(newdata)
Here is another option using unnest from tidyr. We extract the numeric part ([0-9.]+) using str_extract_all from library(stringr). The output is a list. We set the names of the list elements as the 'id' column of 'testdata' and unnest
library(tidyr)
library(stringr)
res <- unnest(setNames(lapply(str_extract_all(testdata$moduler, '[0-9.]+'),
as.numeric), testdata$id), id)
colnames(res)[2] <- 'value'
head(res)
# id value
#1 2862 1.1
#2 2862 1.2
#3 2862 1.3
#4 2862 1.4
#5 2862 1.5
#6 2862 1.6
dim(res)
#[1] 136 2
Or a base R approach would be to extract the numeric elements with regmatches/gregexpr in a list, get the length of the list element with lengths, replicate the 'id' column from 'testdata' based on that, unlist the 'lst' and create a new 'data.frame'.
lst <- lapply(regmatches(testdata$moduler, gregexpr('[0-9.]+',
testdata$moduler)), as.numeric)
res2 <- data.frame(id = testdata$id[rep(1:nrow(testdata), lengths(lst))],
value= unlist(lst))

In R how to make two columns an ID and get a frequency histogram for each ID

Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed

How can this code be compacted?

Can the following code be made more "R like"?
Given data.frame inDF:
V1 V2 V3 V4
1 a ha 1;2;3 A
2 c hb 4 B
3 d hc 5;6 C
4 f hd 7 D
Inside df I want to
find all rows which for the "V3" column has multiple values
separated by ";"
then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
and then each replicated row receives in the "V3" column only one the initial values
Shortly, the output data.frame (= outDF) will look like:
V1 V2 V3 V4
1 a ha 1 A
1 a ha 2 A
1 a ha 3 A
2 c hb 4 B
3 d hc 5 C
3 d hc 6 C
4 f hd 7 D
So, if from inDF I want to get to outDF, I would write the following code:
#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)
#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])
# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)
#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value
for(i in rowlist[])
{
#count the number of individual values in one cell
value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1
# replicate each row a number of times equal with its value number, and transform it to character
extracted_inDF <- inDF[rep(i, times=value_nr[]),]
extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)
# split the values in V3 cell in individual values, place them in a list
value_ls <- str_split(inDF[i, 3], ";")
#initialize f, to use it later to increment both row number and element in the list of values
f = 1
# replace the multiple values with individual values
for(j in extracted_inDF[,3])
{
extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
f <- f+1
}
#put all the "demultiplied" rows in xDF
xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}
# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]
#create outDF
outDF <- merge(inDF, xDF, all=TRUE)
Could you please
I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)
However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.
I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.
For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.
For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.
Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.
Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.
Here is the above in action. I've named your dataset "mydf":
V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length) ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
# V1 V2 V3 V4
# 1 a ha 1 A
# 1.1 a ha 2 A
# 1.2 a ha 3 A
# 2 c hb 4 B
# 3 d hc 5 C
# 3.1 d hc 6 C
# 4 f hd 7 D
To share some more options...
The "data.table" package can actually be pretty useful for something like this.
library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")
Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.
library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ] ## Optional, perhaps
df2[order(df2$V1), ] ## Optional, perhaps
In this case, you can use the split-apply-combine paradigm for reshaping the data.
You want to split inDF by its rows, since you want to operate on each row separately. I've used the split function here to split it up by row:
spl = split(inDF, 1:nrow(inDF))
spl is a list that contains a 1-row data frame for each row in inDF.
Next, you'll want to apply a function to transform the split up data into the final format you need. Here, I'll use the lapply function to transform the 1-row data frames, using strsplit to break up the variable V3 into its appropriate parts:
transformed = lapply(spl, function(x) {
data.frame(V1=x$V1, V2=x$V2, V3=strsplit(x$V3, ";")[[1]], V4=x$V4)
})
tranformed is now a list where the first element has a 3-row data frame, the third element has a 2-row data frame, and the second and fourth have 1-row data frames.
The last step is to combine this list together into outDF, using do.call with the rbind function. That has the same effect of calling rbind with all of the elements of the transformed list.
outDF = do.call(rbind, transformed)
This yields the desired final data frame:
outDF
# V1 V2 V3 V4
# 1.1 a ha 1 A
# 1.2 a ha 2 A
# 1.3 a ha 3 A
# 2 c hb 4 B
# 3.1 d hc 5 C
# 3.2 d hc 6 C
# 4 f hd 7 D

Resources