How to find the given value from the range of values? - r

I have have the below data :
Y z
100-800 a
150-600 b
200-300 c
400-600 d
4000-12000 e
Any help would be really appreciated.
Based on given value of x (i.e x=100) it should find the values in the given ranges of Y and give the corresponding values of Y and z.If the given of x is not in the given ranges of Y then it should find the nearer range and give corresponding values of Y and Z.
DT[, list(OK = 1 %in% seq(Y, Y)), by = Z]
For given value of X=110
output should be
Y Z
100-800 a
For x=200
Y z
100-800 a
150-600 b
200-300 c
For x=12500
Y z
4000-12000 e

We can write a helper function using tidyr::separate to separate columns. In case if there are no indices which fall within the range we compare the value with lowest value and highest value in the dataframe and return the row accordingly.
subset_fun <- function(df, val) {
df1 <- tidyr::separate(df, Y, c("low", "high"), sep = "-",convert = TRUE)
inds <- with(df1, val >= low & val <= high)
if (any(inds))
df[inds, ]
else if (min(df1$low) > val) df[which.min(df1$low), ]
else df[which.max(df1$high), ]
}
subset_fun(df, 100)
# Y z
#1 100-800 a
subset_fun(df, 200)
# Y z
#1 100-800 a
#2 150-600 b
#3 200-300 c
subset_fun(df, 12500)
# Y z
#5 4000-12000 e
subset_fun(df, 0)
# Y z
#1 100-800 a
data
df <- structure(list(Y = structure(1:5, .Label = c("100-800", "150-600",
"200-300", "400-600", "4000-12000"), class = "factor"),
z = structure(1:5, .Label = c("a", "b", "c", "d", "e"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))

We can create a helper and use this to subset:
library(dplyr)
library(tidyr)
find_number <- function(x,high,low){
x >= low & x < high # might be able to use between
}
df %>%
separate(Y,c("Low","High")) -> new_df
new_df[new_df %>%
mutate(Logi=find_number(200,high = High,low=Low)) %>%
pull("Logi"),]
Low High z
1 100 800 a
2 150 600 b
3 200 300 c
EDIT: An attempt to automate this process. Using NSE might be a much better option since that would eliminate the need to have the exact same names as in this answer. In other words, redefine the function with a data and column name arguments. For now:
find_number <- function(x){
new_df[new_df %>%
mutate(Logi=x >= Low & x< High,
isMax=ifelse(High==max(High)
& x>High,
TRUE,Logi)) %>%
pull("isMax"),]
}
find_number(12500)
Low High z
5 4000 12000 e
Data:
new_df<-structure(list(Low = c(100, 150, 200, 400, 4000), High = c(800,
600, 300, 600, 12000), z = c("a", "b", "c", "d", "e")), class = "data.frame", row.names = c(NA,
-5L))

Related

Replacing cells in a data frame

Let's say I have such a data frame with columns X, Y, Z, T and over 100 rows:
order X Y Z T
i a k b n
j c a b n
As you see, if i-th X and j-th Y have the same value (i.e a), then i-th Z and j-th Z have the same value (b) and i-th T and j-th T have the same value (n)
What I want to do is that if i-th X and j-th Y have the same value (i.e a), then i-th Z = b and j-th Z = n and i-th T = n, j-th T = b
order X Y Z T
i a k b n
j c a n b
I have tried doing this in R by using if else and for loop, but I couldnt.
Can anyone help me do that in R?
It can be done with case_when
library(dplyr)
df1 <- df1 %>%
mutate(Z = case_when(lag(X) == Y~ T, TRUE ~ Z),
T = case_when(lag(X) == Y ~ lag(Z), TRUE ~ T))
-output
df1
# order X Y Z T
#1 i a k b n
#2 j c a n b
data
df1 <- structure(list(order = c("i", "j"), X = c("a", "c"), Y = c("k",
"a"), Z = c("b", "b"), T = c("n", "n")), class = "data.frame", row.names = c(NA,
-2L))

Add a column to dataframe in R based on grater/less than condition based on the values in existing columns

I have a dataframe and want to create a new column Z populated with a value of "tw" or "ok" for each record. If x > y, z = "ok" , IF x < y, z = "tw".
x y
a 1 2
b 2 3
c 5 1
result
x y z
a 1 2 tw
b 2 3 tw
c 5 1 ok
Maybe you can try ifelse() like below
df <- within(df,z <- ifelse(x>y,"ok","tw"))
If you do not define the output for the case "x==y", maybe you should add the following line
df$z[df$x==df$y] <- NA
alternatively you can do it on the dataframe directly:
# creation of dataframe
df = data.frame("x" = c(1, 2, 5), "y" = c(2, 3, 1))
# column creation of z
df$z[(df$x > df$y)] <- "ok"
df$z[(df$x < df$y)] <- "tw"
We can do this directly :
df$z <- c("tw", "ok")[(df$x > df$y) + 1]
df
# x y z
#a 1 2 tw
#b 2 3 tw
#c 5 1 ok
Not exactly clear what you want to do when x == y (above assigns "ok").
We can also use case_when from dplyr to assign values based on various condition.
library(dplyr)
df %>%
mutate(z = case_when(x > y ~"ok",
x < y ~"tw",
TRUE ~ NA_character_))
data
df <- structure(list(x = c(1L, 2L, 5L), y = c(2L, 3L, 1L), z = c("tw",
"tw", "ok")), row.names = c("a", "b", "c"), class = "data.frame")

Add column to list of data frames and do incremental addition / Loop through df for simple addition

I have this data frame (actually, a list of those dfs):
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
I want to add a column which adds up a certain value until the end of the data frame, like this:
0 + 0.05 = 0.05, 0.05 + 0.05 = 0.1, 0.1 + 0.05 = 0.15 and so on.
So, in my example the result would be
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"), z=(0,0.05,0.1)
I guess the way to go would be using cbind with lapply (assuming ALL is a list of dfs):
ALL<- lapply(ALL, function(x) cbind(x, z = ???))
But my brain simple doesn't come up with the right formula for z.
Your help is very appreciated.
Greetings
trumfnator
We can use transform to add new column with seq to generate the sequence.
lapply(list_df, function(df) transform(df, z = seq(0, by = 0.05, length.out = nrow(df))))
#[[1]]
# x y z
#1 1 a 0.00
#2 2 b 0.05
#3 3 c 0.10
#[[2]]
# x y z
#1 1 a 0.00
#2 2 b 0.05
#3 3 c 0.10
#4 4 d 0.15
In tidyverse we can do the same by
library(dplyr)
library(purrr)
map(list_df, ~.x %>% mutate(z = seq(0, by = 0.05, length.out = n())))
data
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
ALL1 <- data.frame(x = 1:4, y = c("a", "b", "c", "d"))
list_df <- list(ALL, ALL1)
Maybe this is what you want
ALL$z <- 0.05*(0:(nrow(ALL)-1))
We can do an assignment and then return the dataset
lapply(list_df, function(x) {x$z <- seq(0, by= 0.05, length.out = nrow(x));x})
data
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
ALL1 <- data.frame(x = 1:4, y = c("a", "b", "c", "d"))
list_df <- list(ALL, ALL1)

How to conditionally summarize on other entries in the group - R

In my dataset I have Cartesian coordinates of different items overtime identified by an EventID, event_type, ID number, x position, y position, identity type, broad category, and frame id number. What I need to do is go for each EventID, event_type pair, and frame id number go through each ID number and calculate which other ID number with a different broad category has the minimum distance from the current row. I would like to avoid using for loops for this because the dataset is several million lines long.
I tried formulating this as a group_by and summarize call using dplyr but couldn't quite wrap my head around how I could call a function on the current row x, an y against all other x, and ys and then choose the conditional minimum.
two_dim_euclid = function(x1, x2, y1, y2){
a <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
return(a)
}
# Example Data
df <- data.frame(stringsAsFactors = FALSE,
EventID = c(1003, 1003, 1003, 1003),
event_type = c(893, 893, 893, 893),
ID_number = c(80427, 2346, 24954, 27765),
x = c(86.07, 72.4, 43.08, 80.13),
y = c(35.58, 26.43, 34.8, 34.79),
identity_type = c("A", "C", "B", "B"),
broad_category = c("set1", "set1", "set2", "set2"),
frame_id = c(1, 1, 1, 1))
df
# EventID event_type ID_number x y identity_type broad_category frame_id
#1 1003 893 80427 86.07 35.58 A set1 1
#2 1003 893 2346 72.40 26.43 C set1 1
#3 1003 893 24954 43.08 34.80 B set2 1
#4 1003 893 27765 80.13 34.79 B set2 1
The expected result would return 5.992303 for row 1 it looks for all the entries not belonging to set1 with the same EventID, event_type, and frame_id and then returns the minimum euclidian distance given those parameters.
Also, I want to do this for every entry with identity type A. But, the identity_type and broad_category are not always tied together. A can belong to either set1 or set2.
Here's a base way that relies on dist().
res <- as.matrix(dist(cbind(df$x, df$y)))
res[res == 0] <- Inf
apply(res, 1, min)
1 2 3 4
5.992303 11.386066 30.491299 5.992303
# or potentially more performant
res[cbind(seq_len(nrow(res)), max.col(-res))]
[1] 5.992303 11.386066 30.491299 5.992303
A potential way with data.table would be to do a cartesian join but it would need a lot of memory and would likely be slower:
library(data.table)
dt <- as.data.table(df)
dt[, ID := .I]
CJ.dt = function(X,Y) {
stopifnot(is.data.table(X),is.data.table(Y))
k = NULL
X = X[, c(k=1, .SD)]
setkey(X, k)
Y = Y[, c(k=1, .SD)]
setkey(Y, NULL)
X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dt, dt)[ID != i.ID, min(sqrt((x - i.x)^2 + (y-i.y)^2)), by = i.ID]
i.ID V1
1: 1 5.992303
2: 2 11.386066
3: 3 30.491299
4: 4 5.992303
For data.table cartesian join, see here:
R: data.table cross-join not working
While I'm not sure about your criteria, it seems that you MUST use for loops in some way if you want to iterate. I'm sure others can provide you with Rcpp solutions that are very quick. In the meantime, here is one possible way with base R.
# In the future, please provide the code to create your example data
dat <- structure(list(EventID = c(1003L, 1003L, 1003L, 1003L),
event_type = c(893L, 893L, 893L, 893L),
ID_number = c(80427L, 2346L, 24954L, 27765L),
x = c(86.07, 72.4, 43.08, 80.13),
y = c(35.58, 26.43, 34.8, 34.79),
identity_type = structure(c(1L, 3L, 2L, 2L),
.Label = c("A", "B", "C"),
class = "factor"),
broad_category = structure(c(1L, 1L, 2L, 2L),
.Label = c("set1", "set2"),
class = "factor"),
frame_id = c(1L, 1L, 1L, 1L)),
.Names = c("EventID", "event_type", "ID_number","x", "y",
"identity_type", "broad_category", "frame_id"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
# Define your criteria here
dat$uniqueID <- paste0(dat$EventID, dat$event_type, dat$frame_id, dat$broad_category)
# made your function have two 2 dim vectors instead since that's simpler for passing in
two_dim_euclid = function(a, b) return(sqrt((a[1] - b[1])^2 + (a[2] - b[2])^2))
n <- nrow(dat)
vec <- numeric(n)
for(i in 1:n){
vec[i] = sum(apply(dat[dat$uniqueID != dat$uniqueID[i], c("x","y")], 1,
function(r) two_dim_euclid(dat[i,c("x","y")], r)), na.rm = T)
if(i%%10000 == 0) cat(i,"completed...\n") # Progress check since >1mil rows
}
dat$result <- vec

How to melt a dataframe with measured variables and associated standard deviations in two columns

I have a premade dataframe, in which each measured variable features an adjacent column with the standard deviations:
df <-
structure(list(Factor = structure(1:3, .Label = c("K", "L", "M"
), class = "factor"), A = c(52127802.82, 63410325.61, 76455661.87
), SD = c(9124562.98, 21975533.21, 9864019.36), B = c(63752980.62,
68303447.17, 73250794.15), SD.1 = c(34800000, 22600000, 6090000
), C = c(103512032.04, 65074190.8, 92686982.97), SD.2 = c(23900000,
20800000, 38300000), D = c(100006463.22, NA, 37406494.3)), .Names = c("Factor",
"A", "SD", "B", "SD.1", "C", "SD.2", "D"), class = "data.frame", row.names = c(NA,
-3L))
(SD.1, SD.2 were auto-renamed; originally they were all called "SD").
I want to melt into long format by factor:
library(reshape)
df.melt <- melt(df, id.vars="Factor").
However, I would like to have the melted object to keep the SD columns attached to their associated columns:
Factor Variable value value.sd
K A 52127802.82 9124562
So, i can call geom_errorbar(ymin=sd.value, ymax=sd.value) in ggplot(df.melt, aes(Factor, value)) + geom_bar(stat="identity") + facet_wrap(~variable).
Is that possible, even with the different row.names for SD?
First, I would drop df$D from the dataset because I think this is an error via df$D <- NULL:
# Factor A SD B SD.1 C SD.2
# 1 K 52127803 9124563 63752981 34800000 103512032 23900000
# 2 L 63410326 21975533 68303447 22600000 65074191 20800000
# 3 M 76455662 9864019 73250794 6090000 92686983 38300000
Then, I would rename the columns (this looks more complicated than it is and I encourage feedback/suggestions that would make this part more straightforward) -- the reason I am renaming the columns is so that I can use separate and spread from the package tidyr:
names(df)[-1][seq(2, length(names(df)) - 1, 2)] <- paste0(names(df)[-1][seq(1, length(names(df)) - 1, 2)], "-SD")
names(df)[-1][seq(1, length(names(df)) - 1, 2)] <- paste0(names(df)[-1][seq(1, length(names(df)) - 1, 2)], "-measure")
df
# Factor A-measure A-SD B-measure B-SD C-measure C-SD
# 1 K 52127803 9124563 63752981 34800000 103512032 23900000
# 2 L 63410326 21975533 68303447 22600000 65074191 20800000
# 3 M 76455662 9864019 73250794 6090000 92686983 38300000
This enables me to make df_clean:
df_clean <- df %>%
gather(measure, value, -Factor) %>%
separate(measure, c("measure_letter", "temp_var")) %>%
spread(temp_var, value)
df_clean
# Factor measure_letter measure SD
# 1 K A 52127803 9124563
# 2 K B 63752981 34800000
# 3 K C 103512032 23900000
# 4 L A 63410326 21975533
# 5 L B 68303447 22600000
# 6 L C 65074191 20800000
# 7 M A 76455662 9864019
# 8 M B 73250794 6090000
# 9 M C 92686983 38300000
Now that our dataset is clean/tidy, we can plot accordingly:
library(ggplot2)
ggplot(df_clean, aes(x = Factor, y = measure, fill = Factor)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = measure - SD, ymax = measure + SD)) +
facet_wrap(~ measure_letter)

Resources