Matching points and calculate total distance/accuracy - r

I have a list list.response with 309 dataframes. In each dataframe there are 10 rows and two columns. The columns are X and Y-coordinates and represents "clicks", which a survey respondent has made on picture.
Furthermore, I have another dataframe df.true with 10 XY-coordinates. These coordinates represents the coordinates of the objects, which the respondent tried to click on in the survey.
GOAL: For each respondent (i.e., each dataframe in list.response) I want to calculate how accurately they were, when trying to click on the objects. In other words: What is the distance between the coordinates of their 10 clicks, and the coordinates of the 10 objects in df.true.
My problem is that the coordinates of their clicks and coordinates of the objects are not in the same order. For instance, respondent A might have clicked on objects from left-right, whereas respondent B might have clicked on objects from right-left, which screws up the order of clicks and objects. Therefore, I need to match the respondents clicks with the nearest object. The criteria for matching are:
The spatial distance between a click and an object should be as small as possible.
One click can only be matched with one object and vice versa (i.e., if there is a click-object match, these should not be used in any other matches, even if it would be useful in terms of shortest distance).
Finally, I want to calculate the total distance between all the matched points (i.e, summarize the distance between all the matched). This will be my measurement for the respondents overall accuracy in clicking on the objects.
I have looked at several solutions to somewhat similar problems (see Working with spatial data: How to find the nearest neighbour of points without replacement? and https://gis.stackexchange.com/questions/297153/excluding-point-from-nearest-neighbor-search-once-its-been-matched-using-r), however I havn't been able to make it work in my case. Disclaimer: I'm new to R / programming
I hope someone is able to help me?
DATA FOR REPRODUCIBLE EXAMPLE:
Sample of 20 df's from the lists with clicks:
list.response <- list(structure(list(X = c(536, 160, 467, 552, 476, 242, 355,
414, 556, 0), Y = c(91, 181, 128, 84, 52, 379, 434, 528, 551,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 542, 455, 148, 70, 239, 369, 416, 553, 0), Y = c(91,
94, 110, 185, 98, 387, 427, 509, 554, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 160,
232, 374, 425, 561, 461, 544, 473, 0), Y = c(91, 193, 380, 426,
513, 559, 105, 97, 37, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 156, 240, 375, 455, 476, 549, 414,
547, 0), Y = c(91, 194, 389, 425, 116, 37, 87, 494, 553,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 70, 455, 543, 482, 241, 368, 418, 551, 0),
Y = c(91, 99, 107, 93, 47, 385, 427, 511, 552, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 480,
458, 81, 158, 231, 393, 409, 558, 0), Y = c(91, 35, 91, 114,
175, 385, 423, 508, 562, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 67, 492, 460, 542, 240, 364, 407,
554, 0), Y = c(91, 98, 48, 108, 98, 391, 428, 507, 553, 0
)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 156, 240, 371, 409, 563, 449, 480, 547, 0),
Y = c(91, 194, 387, 414, 510, 549, 110, 44, 96, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 485,
462, 419, 556, 371, 240, 156, 71, 0), Y = c(91, 50, 110,
499, 556, 423, 380, 183, 96, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 423, 362, 76, 156, 243, 551, 480,
455, 0), Y = c(91, 505, 434, 103, 187, 386, 547, 50, 114,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 155, 245, 359, 414, 552, 456, 535, 483, 0),
Y = c(91, 185, 391, 423, 508, 544, 119, 92, 48, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 419,
366, 242, 155, 76, 451, 538, 480, 0), Y = c(91, 510, 425,
393, 190, 103, 107, 96, 53, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 412, 369, 243, 153, 76, 458, 481,
543, 0), Y = c(91, 512, 425, 386, 187, 100, 114, 48, 96,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 483, 457, 151, 73, 241, 368, 416, 552, 0),
Y = c(91, 45, 108, 186, 99, 386, 426, 507, 556, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 151,
483, 455, 544, 239, 368, 418, 547, 0), Y = c(91, 182, 43,
104, 96, 388, 426, 508, 554, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 418, 368, 238, 154, 73, 454, 482,
543, 0), Y = c(91, 510, 430, 387, 184, 100, 110, 48, 93,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 481, 455, 149, 70, 240, 369, 417, 555, 0),
Y = c(91, 46, 109, 184, 99, 386, 427, 509, 555, 0)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(X = c(536, 456,
541, 148, 71, 244, 370, 418, 555, 0), Y = c(91, 110, 96,
186, 88, 389, 427, 511, 553, 0)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(X = c(536, 454, 240, 151, 71, 541, 366, 416,
551, 0), Y = c(91, 108, 389, 183, 99, 92, 428, 510, 552,
0)), row.names = c(NA, -10L), class = "data.frame"), structure(list(
X = c(536, 147, 476, 499, 553, 244, 385, 417, 557, 0),
Y = c(91, 185, 110, 38, 87, 397, 433, 506, 552, 0)), row.names = c(NA,
-10L), class = "data.frame"))
And the df.true coordinates:
df.true <- structure(list(X = c(71, 151, 240, 370, 415, 552, 542, 456, 482,
0), Y = c(99, 186, 387, 429, 509, 553, 91, 108, 45, 0)), row.names = c(NA,
-10L), class = "data.frame")

I came up with a solution. First, I converted all dataframes to matrices. I then used the function from here: https://gis.stackexchange.com/questions/297153/excluding-point-from-nearest-neighbor-search-once-its-been-matched-using-r:
pairup <- function(list1, list2){
keep = 1:nrow(list2)
used = c()
for(i in 1:nrow(list1)){
nearest = FNN::get.knnx(list2, list1[i,,drop=FALSE], 1)$nn.index[1,1]
used = c(used, keep[nearest])
keep = keep[-nearest]
list2 = list2[-nearest,,drop=FALSE]
}
used
}
And then I ran the for loop:
#Define an empty vector
pm <- c()
#Run loop and calculate distance
for (i in 1:length(list.response)) {
match <- pairup(list.response[[i]],df.true)
pm[i]<-sum(pointDistance(list.response[[i]], df.true[match,], lonlat=FALSE))
}

Related

Histogram and density plots with multiple groups

I have a dataset consist of 4 variables: CR, EN, LC and VU:
View first few values of my dateset
CR = c(2, 9, 10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105, 140, 155, 200, …)
EN = c(24, 52, 86, 110, 144, 154, 206, 242, 300, 302, 366, 403, 422, 427, 427, 434, 448, …)
LC = c(447, 476, 543, 580, 647, 685, 745, 763, 819, 821, 863, 904, 908, 926, 934, 951, 968, …)
VU = c(75, 96, 97, 217, 297, 498, 511, 551, 560, 564, 570, 575, 609, 673, 681, 700, 755,...)
I want to create a histogram of a group of these variables in a plot by R that shows the normal distribution and density, a plot similar to the one below...
Could you please help me?
Here are the distributions, a clear-cut use of geom_density.
But first, to address "grouping", we need to pivot/reshape the data so that ggplot2 can automatically handle grouping. This will result in a column with a character (or factor) for each of the "CR", "EN", "LC", or "VU", and another column with the particular value. When pivoting, there is typically one or more columns that are preserved (an id, an x-value, a time/date, or something similar), but we don't have any data that would suggest something to preserve.
longdat <- tidyr::pivot_longer(dat, everything())
longdat
# # A tibble: 68 × 2
# name value
# <chr> <dbl>
# 1 CR 2
# 2 EN 24
# 3 LC 447
# 4 VU 75
# 5 CR 9
# 6 EN 52
# 7 LC 476
# 8 VU 96
# 9 CR 10
# 10 EN 86
# # … with 58 more rows
# # ℹ Use `print(n = ...)` to see more rows
ggplot(longdat, aes(x = value, group = name, fill = name)) +
geom_density(alpha = 0.2)
tidyr::pivot_longer works, one can also use melt from either reshape2 or data.table:
longdat <- reshape2::melt(dat, c())
## names are 'variable' and 'value' instead of 'name' and 'value'
Data
dat <- structure(list(CR = c(2, 9, 10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105, 140, 155, 200), EN = c(24, 52, 86, 110, 144, 154, 206, 242, 300, 302, 366, 403, 422, 427, 427, 434, 448), LC = c(447, 476, 543, 580, 647, 685, 745, 763, 819, 821, 863, 904, 908, 926, 934, 951, 968), VU = c(75, 96, 97, 217, 297, 498, 511, 551, 560, 564, 570, 575, 609, 673, 681, 700, 755)), class = "data.frame", row.names = c(NA, -17L))

Apply a mutate over columns in R

I have some missing data that I am trying to impute to the mean of each column. My code,
apply(train_new, 2, function(x)
mutate(
ifelse(is.na(x) | x < 0, mean(x), x)
)
)
is meant to impute all 17 columns to the mean of each column in one fell swoop, but this returns Error during wrapup: no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')", and leads me to a debug screen. I'm sure this is just a syntactical issue, but I'm at a loss as to where it is.
Sample data:
structure(list(INDEX = c(1, 2, 3, 4, 5, 6), TARGET_WINS = c(39,
70, 86, 70, 82, 75), TEAM_BATTING_H = c(1445, 1339, 1377, 1387,
1297, 1279), TEAM_BATTING_2B = c(194, 219, 232, 209, 186, 200
), TEAM_BATTING_3B = c(39, 22, 35, 38, 27, 36), TEAM_BATTING_HR = c(13,
190, 137, 96, 102, 92), TEAM_BATTING_BB = c(457.7607, 685, 602,
451, 472, 443), TEAM_BATTING_SO = c(842, 1075, 917, 922, 920,
973), TEAM_BASERUN_SB = c(97.288, 37, 46, 43, 49, 107), TEAM_BASERUN_CS = c(NA,
28, 27, 30, 39, 59), TEAM_PITCHING_H = c(NA, 1347, 1377, 1396,
1297, 1279), TEAM_PITCHING_HR = c(84, 191, 137, 97, 102, 92),
TEAM_PITCHING_BB = c(530.9595, 689, 602, 454, 472, 443),
TEAM_PITCHING_SO = c(737.105, 1082, 917, 928, 920, 973),
TEAM_FIELDING_E = c(NA, 193, 175, 164, 138, 123), TEAM_FIELDING_DP = c(146.234708045,
155, 153, 156, 168, 149), TEAM_BATTING_1B = c(1199, 908,
973, 1044, 982, 951)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
You could try:
library(dplyr)
train_new %>%
mutate_all(funs(ifelse(is.na(.) | . < 0, mean(., na.rm = T), .)))
Here is one option with na.aggregate (from zoo)
library(zoo)
na.aggregate(replace(train_new, train_new < 0, NA))

which() function in filter() with dplyr

I am trying to filter a data set then set the outliers to the mean. Sample data frame:
structure(list(INDEX = c(1, 2, 3, 4, 5, 6), TARGET_WINS = c(39,
70, 86, 70, 82, 75), TEAM_BATTING_H = c(1445, 1339, 1377, 1387,
1297, 1279), TEAM_BATTING_2B = c(194, 219, 232, 209, 186, 200
), TEAM_BATTING_3B = c(39, 22, 35, 38, 27, 36), TEAM_BATTING_HR = c(13,
190, 137, 96, 102, 92), TEAM_BATTING_BB = c(143, 685, 602, 451,
472, 443), TEAM_BATTING_SO = c(842, 1075, 917, 922, 920, 973),
TEAM_BASERUN_SB = c(NA, 37, 46, 43, 49, 107), TEAM_BASERUN_CS = c(NA,
28, 27, 30, 39, 59), TEAM_BATTING_HBP = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), TEAM_PITCHING_H = c(9364,
1347, 1377, 1396, 1297, 1279), TEAM_PITCHING_HR = c(84, 191,
137, 97, 102, 92), TEAM_PITCHING_BB = c(927, 689, 602, 454,
472, 443), TEAM_PITCHING_SO = c(5456, 1082, 917, 928, 920,
973), TEAM_FIELDING_E = c(1011, 193, 175, 164, 138, 123),
TEAM_FIELDING_DP = c(NA, 155, 153, 156, 168, 149)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Using dplyr, I filter the outliers, then attempt to mutate the TEAM_FIELDING_E column based on the corrected (non-outlier) mean:
train %>%
filter(which(boxplot.stats(train$TEAM_FIELDING_E)$out %in% train$TEAM_FIELDING_E, arr.ind = TRUE) == TRUE) %>%
mutate(
TEAM_FIELDING_E = NA,
TEAM_FIELDING_E = mean(train$TEAM_FIELDING_E)
)
This returns error Error in filter_impl(.data, quo) : Result must have length 2276, not 303 (the original data set contains 303 TEAM_FIELDING_E outliers and 2276 rows). How do I utilize filter() such that my mutate() will only affect those filtered rows?
Within dplyr verbs, use bare variable names and not using [[ or $. Additionally if you're trying to filter on a value, you can just filter on the value directly rather than trying to use which to determine the position of the match.
For this case, you can get what you want with an if_else within mutate.
out <- boxplot.stats(train$TEAM_FIELDING_E)$out
train %>%
mutate(TEAM_FIELDING_E = if_else(TEAM_FIELDING_E %in% out, mean(TEAM_FIELDING_E[!(TEAM_FIELDING_E %in% out)]), TEAM_FIELDING_E))

How to change distance between breaks for continuous x-axis on ggplot?

I have a dataset with y-axis = diversity indices and x-axis = depth. I am looking at how diversity changes with depth (increases/decreases). It is informative to visualize these changes over depth (so transforming isn't helpful), however it is difficult with the disparity between number of samples for different depths (more samples at shallower versus deeper depths. With the following code:
breaks_depth=c(0,50,100,150,250,350,450,500,1200)
ggplot(data=df, aes(x=Depth, y=Diversity)) +
geom_line()+
scale_y_continuous(breaks=seq(0,1400,200), limits=c(0,1400))+
scale_x_continuous(breaks=breaks_depth, limits=c(0,1200))
I get the following plot:
I would like to get a plot such that the distance between 500m and 1200m depth is smaller and the distance between the shallower depths (0-150m) is greater. Is this possible? I have tried expand and different break and limit variations. The dput() of this dataset can be found here. The rownames are the sample IDs and the columns I am using for the plot are: y-axis=invsimpson_rd, and x-axis=Depth_rd. TIA.
****EDIT*****
Winner code and plot modified from Calum's answer below.
ggplot(data=a_div, aes(x=Depth_rd, y=invsimpson_rd)) +
geom_line()+
scale_y_continuous(breaks=seq(0,1400,200), limits=c(0,1400))+
scale_x_continuous(trans="log10",breaks = c(0,
15,25,50,100,150,200,250,300,350,400,450, seq(600, 1200, by = 200)))
Here's an example with the built in economics dataset. You can see that you can specify the breaks however you want as per usual, but the "sqrt" transformation shifts the actual plotted values to have more space near the beginning of the series. You can use other built in transformations or define your own as well.
EDIT: updated with example data and some comparison of common different trans options.
library(tidyverse)
tbl <- structure(list(yval = c(742, 494, 919, 625, 124, 788, 583, 213, 715, 363, 15, 313, 472, 559, 314, 494, 388, 735, 242, 153, 884, 504, 267, 454, 325, 305, 746, 628, 549, 345, 327, 230, 271, 486, 971, 979, 857, 779, 394, 903, 585, 238, 702, 850, 611, 710, 694, 674, 1133, 468, 784, 634, 234, 61, 325, 505, 693, 1019, 766, 435, 407, 772, 925, 877, 187, 290, 782, 674, 1263, 1156, 935, 499, 791, 797, 537, 308, 761, 744, 674, 764, 560, 805, 540, 427, 711), xval = c(80, 350, 750, 100, 20, 200, 350, 50, 110, 20, 200, 350, 60, 100, 20, 40, 60, 100, 20, 40, 350, 50, 20, 40, 50, 30, 40, 260, 1000, 200, 200, 200, 500, 50, 350, 360, 380, 250, 60, 190, 40, 70, 70, 40, 40, 70, 180, 180, 440, 370, 130, 1200, 20, 20, 30, 80, 120, 200, 220, 120, 40, 80, 350, 750, 20, 80, 200, 320, 500, 220, 160, 80, 140, 350, 100, 40, 350, 100, 200, 340, 60, 40, 100, 60, 40)), .Names = c("yval", "xval"), row.names = c(NA, -85L), class = c("tbl_df", "tbl", "data.frame"))
ggplot(tbl) +
geom_line(aes(x = xval, y = yval)) +
scale_x_continuous(trans = "sqrt", breaks = c(0,50,100,150,250,350,450,500,1200))
ggplot(tbl) +
geom_line(aes(x = xval, y = yval)) +
scale_x_continuous(trans = "log10", breaks = c(0,50,100,150,250,350,450,500,1200))
Created on 2018-04-27 by the reprex package (v0.2.0).

Getting the error "level sets of factors are different" when running a for loop

I have the following 3 tables:
AggData <- structure(list(Path = c("NonBrand", "Brand", "NonBrand,NonBrand",
"Brand,Brand", "NonBrand,NonBrand,NonBrand", "Brand,Brand,Brand",
"Brand,NonBrand", "NonBrand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,Brand,Brand,Brand,Brand", "Brand,Brand,NonBrand", "NonBrand,Brand,Brand",
"Brand,NonBrand,NonBrand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"NonBrand,NonBrand,Brand", "Brand,NonBrand,Brand", "NonBrand,Brand,NonBrand",
"NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,Brand,Brand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"NonBrand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,Brand",
"Brand,Brand,Brand,NonBrand", "Brand,Brand,Brand,Brand,Brand,Brand,Brand",
"Brand,NonBrand,NonBrand,NonBrand", "NonBrand,NonBrand,Brand,Brand",
"Brand,Brand,NonBrand,NonBrand", "Brand,NonBrand,Brand,Brand",
"NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,Brand,NonBrand,Brand", "NonBrand,Brand,NonBrand,NonBrand",
"Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"NonBrand,NonBrand,Brand,NonBrand", "Brand,NonBrand,NonBrand,Brand",
"NonBrand,Brand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand,Brand",
"Brand,NonBrand,Brand,NonBrand", "NonBrand,Brand,Brand,NonBrand",
"Brand,Brand,Brand,Brand,NonBrand", "Brand,NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand",
"Brand,NonBrand,Brand,Brand,Brand", "NonBrand,Brand,NonBrand,Brand",
"Brand,Brand,Brand,NonBrand,Brand", "NonBrand,NonBrand,Brand,Brand,Brand",
"NonBrand,NonBrand,NonBrand,Brand,Brand", "Brand,Brand,NonBrand,Brand,Brand",
"Brand,Brand,Brand,NonBrand,NonBrand", "Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand,Brand",
"NonBrand,NonBrand,NonBrand,Brand,NonBrand", "Brand,Brand,NonBrand,NonBrand,NonBrand",
"NonBrand,Brand,Brand,Brand,Brand,Brand", "NonBrand,Brand,NonBrand,NonBrand,NonBrand",
"NonBrand,NonBrand,Brand,NonBrand,NonBrand", "NonBrand,NonBrand,NonBrand,NonBrand,NonBrand,Brand",
"Brand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand", "Brand,Brand,Brand,Brand,Brand,NonBrand",
"NonBrand,Brand,Brand,NonBrand,NonBrand", "Brand,NonBrand,NonBrand,Brand,Brand",
"NonBrand,NonBrand,NonBrand,NonBrand,Brand,Brand", "NonBrand,NonBrand,Brand,Brand,Brand,Brand",
"NonBrand,NonBrand,NonBrand,NonBrand,Brand,NonBrand", "NonBrand,NonBrand,Brand,NonBrand,Brand",
"Brand,NonBrand,NonBrand,Brand,NonBrand", "NonBrand,NonBrand,NonBrand,Brand,Brand,Brand",
"NonBrand,Brand,Brand,NonBrand,Brand", "Brand,NonBrand,NonBrand,NonBrand,NonBrand,Brand",
"Brand,Brand,NonBrand,NonBrand,NonBrand,NonBrand,NonBrand", "Brand,Brand,Brand,Brand,NonBrand,NonBrand,NonBrand"
), click_count = c(1799265, 874478, 198657, 128159, 45728, 30172,
20520, 17815, 16718, 9479, 6554, 3722, 3561, 3408, 3391, 3366,
3256, 2526, 1846, 1708, 1682, 1013, 951, 899, 881, 782, 780,
703, 642, 625, 615, 601, 453, 442, 414, 407, 362, 343, 313, 284,
281, 281, 271, 269, 268, 229, 223, 218, 215, 212, 204, 162, 161,
158, 155, 145, 132, 130, 115, 103, 102, 86, 77, 77, 72, 68, 68,
67, 58, 52, 32, 18, 18), conv_count = c(30938, 19652, 7401, 3803,
2014, 1072, 1084, 981, 652, 379, 230, 166, 205, 246, 254, 93,
239, 104, 112, 51, 76, 23, 66, 81, 55, 29, 62, 57, 50, 37, 17,
33, 38, 17, 8, 41, 33, 30, 24, 16, 26, 18, 16, 17, 7, 21, 10,
8, 27, 23, 11, 13, 6, 15, 14, 16, 8, 10, 6, 6, 11, 11, 8, 9,
8, 8, 9, 7, 7, 6, 6, 6, 7), CR = c(0.0171947989873643, 0.0224728352228415,
0.0372551684561833, 0.0296740767328085, 0.0440430370888733, 0.0355296301206417,
0.0528265107212476, 0.0550659556553466, 0.0389998803684651, 0.0399831205823399,
0.0350930729325603, 0.0445996775926921, 0.057568098848638, 0.0721830985915493,
0.0749041580654674, 0.0276292335115865, 0.0734029484029484, 0.0411718131433096,
0.0606717226435536, 0.0298594847775176, 0.0451843043995244, 0.0227048371174729,
0.0694006309148265, 0.0901001112347052, 0.0624290578887628, 0.0370843989769821,
0.0794871794871795, 0.0810810810810811, 0.0778816199376947, 0.0592,
0.0276422764227642, 0.0549084858569052, 0.0838852097130243, 0.0384615384615385,
0.0193236714975845, 0.100737100737101, 0.0911602209944751, 0.0874635568513119,
0.0766773162939297, 0.0563380281690141, 0.0925266903914591, 0.0640569395017794,
0.0590405904059041, 0.0631970260223048, 0.0261194029850746, 0.091703056768559,
0.0448430493273543, 0.036697247706422, 0.125581395348837, 0.108490566037736,
0.053921568627451, 0.0802469135802469, 0.0372670807453416, 0.0949367088607595,
0.0903225806451613, 0.110344827586207, 0.0606060606060606, 0.0769230769230769,
0.0521739130434783, 0.058252427184466, 0.107843137254902, 0.127906976744186,
0.103896103896104, 0.116883116883117, 0.111111111111111, 0.117647058823529,
0.132352941176471, 0.104477611940299, 0.120689655172414, 0.115384615384615,
0.1875, 0.333333333333333, 0.388888888888889)), .Names = c("Path",
"click_count", "conv_count", "CR"), row.names = c(NA, -73L), class = "data.frame")
another one here:
breakVector <- structure(list(breakVector = structure(c(1L, 1L), .Label = "NonBrand", class = "factor"),
CR = c(0.461541302855402, 0.538458697144598)), .Names = c("breakVector",
"CR"), row.names = c(NA, -2L), class = "data.frame")
and:
FinalTable <- structure(list(autribution_category = structure(c(2L, 1L), .Label = c("Brand",
"NonBrand"), class = "factor"), attributed_result = c(0, 0)), .Names = c("autribution_category",
"attributed_result"), row.names = 1:2, class = "data.frame")
when I run the following command:
if (FinalTable [2,1] == breakVector[1,1]) {
FinalTable$attributed_result[2] <- FinalTable$attributed_result[2] +
breakVector[1,2] * AggData$conv_count[3];
break}
I get the following error:
Error in Ops.factor(FinalTable[2, 1], breakVector[1, 1]) :
level sets of factors are different
This is pretty weird, since both values that im comparing are factors, I don't see any reason why R cant compare the two levels?
FinalTable[2,1] and breakVector[1,1] do not have the same levels:
> FinalTable[2,1]
[1] Brand
Levels: Brand NonBrand
> breakVector[1,1]
[1] NonBrand
Levels: NonBrand
This is easily fixed by using
breakVector[,1] <- factor(breakVector[,1], levels=c("Brand", "NonBrand"))
or, more generally
breakVector[,1] <- factor(breakVector[,1], levels=levels(FinalTable[,1]))
Perhaps, it will better compare both variables like a string:
if (as.character(FinalTable [2,1]) == as.character(breakVector[1,1])) {
FinalTable$attributed_result[2] <- FinalTable$attributed_result[2] +
breakVector[1,2] * AggData$conv_count[3];
break}

Resources