Related
I am trying to use the $names operator on my OutVals (outliers) to find the class these outliers are associated to and then put the outliers and their class name inside a data frame so I can see clearly from which class these outliers came from.
However, when trying to implement this, my class names return as "1", "2" etc... and not "Van", "Bus etc.. as it is in the dataset.
Have I missed something or am I approaching this completely wrong?
The goal is to get the outliers in the data and place them inside a table which shows from which class the outliers came from
Any help would be appreciated
I have shown my data frame as well as my reproduceable code below
library(reshape2)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers function
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
namesforgroups <- boxplot(OutVals)$names #get group name of the outliers
dataf <- as.data.frame(OutVals, col.names = namesforgroups)#dataframe of outlier + names
print(OutVals) # show all outliers
remOutliers <- sapply(data, function(x) x[!x %in% OutVals]) #remove outliers from data
return (remOutliers)
}
#Remove class column and sample number
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2 #assign to new variable
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData) #remove first set of outliers
removeOutliers2 <- removeOutliers(removeOutliers1) #test again for more and remove
Output data frame
The information about which row/class name the outlier is tied to is not provided in the boxplot object. You have to get it yourself. What is given is the column that the outlier came from, inside boxplot(data)$group, so you can use which to see which row it was from, and use that to get what class it is. I rewrote your function and it now prints a table of the outlier value, the column it came from, and the row/class it came from. There are 5 outliers from 3 rows in the first iteration, and no outliers in the second iteration - makes sense because they've been removed.
removeOutliers <- function(data, class) {
x=boxplot(data)
OutVals <- x$out
columns <- x$group #get group name of the outliers
ind=numeric()
classes=c()
if (length(columns) > 0) {
for (i in 1:length(columns)) {
rows=which(data[,columns[i]]==OutVals[i])
ind=union(ind, rows)
classes=c(classes, class[rows])
}
dt=data.frame(OutVals, columns, classes) # show all outliers
print(dt)
return (list(data[-ind,], class[-ind]))
}
return(list(data, class))
}
#Remove class column and sample number
vehData1 <- vehData[, -c(1,20)]
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData1, vehClass) #remove first set of outliers
OutVals columns classes
1 103 5 bus
2 52 6 bus
3 6 6 bus
4 127 14 bus
5 14 15 saab
removeOutliers2 <- removeOutliers(removeOutliers1[[1]], removeOutliers1[[2]])
The first function returns a data frame with the outlier rows removed. The second function returns a table containing information about each outlier (the class, the column, and the value).
removeOutliers=function(data) {
x=boxplot(data %>% select(-Class), plot=FALSE)
outlierRows=c()
for (i in 1:length(x$out)) {
outlierRows=c(outlierRows, which(data[,x$group[i]]==x$out[i]))
}
return(data[-outlierRows,])
}
getOutliers=function(data) {
x=boxplot(data %>% select(-Class))
outlierInfo=data.frame()
for (i in 1:length(x$out)) {
rows=which(data[,x$group[i]]==x$out[i])
outlierInfo=bind_rows(outlierInfo, data.frame(class=data$Class[rows],
value=x$out[i],
column=names(data)[x$group[i]]))
}
return(outlierInfo)
}
removeOutliers(vehData)
Samples Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect
1 1 95 48 83 178 72 10 162 42 20 159
2 2 91 41 84 141 57 9 149 45 19 143
4 4 93 41 82 159 63 9 144 46 19 143
Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra Class
1 176 379 184 70 6 16 187 197 van
2 170 330 158 72 9 14 189 199 van
4 160 309 127 63 6 10 199 207 van
getOutliers(vehData)
class value column
1 bus 103 Pr.Axis.Ra
2 bus 52 Max.L.Ra
3 bus 6 Max.L.Ra
4 bus 127 Skew.Maxis
5 saab 14 Skew.maxis
I've done a self-paced reading experiment in which 151 participants read 112 sentences divided into three lists and I'm having some problems cleaning the data in R. I'm not a programmer so I'm kind of struggling with all this!
I've got the results file which looks something like this:
results
part item word n.word rt
51 106 * 1 382
51 106 El 2 286
51 106 asistente 3 327
51 106 del 4 344
51 106 carnicero 5 394
51 106 que 6 274
51 106 abapl’a 7 2327
51 106 el 8 1104
51 106 sabor 9 409
51 106 del 10 360
51 106 pollo 11 1605
51 106 envipi— 12 256
51 106 un 13 4573
51 106 libro 14 660
51 106 *. 15 519
Part=participant; item=sentences; n.word=number of word; rt=reading times.
In the results file, I have the reading times of every word of every sentence read by every participant. Every participant read more or less 40 sentences. My problem is that I am interested in the reading times of specific words, such as the main verb or the last word of each sentence. But as every sentence is a bit different, the main verb is not always in the same position for each sentence. So I've done another table with the position of the words I'm interested in every sentence.
rules
item v1 v2 n1 n2
106 12 7 3 5
107 11 8 3 6
108 11 8 3 6
item=sentence; v1=main verb; v2=secondary verb; n1=first noun; n2=second noun.
So this should be read: For sentence 106, the main verb is the word number 12, the secondary verb is the word number 7 and so on.
I want to have a final table that looks like this:
results2
part item v1 v2 n1 n2
51 106 256 2327 327 394
51 107 ...
52 106 ...
Does anyone know how to do this? It's kind of a from long to wide problem but with a more complex scenario.
If anyone could help me, I would really appreciate it! Thanks!!
You can try the following code, which joins your results data to a reshaped rules data, and then reshapes the result into a wider form.
library(tidyr)
library(dplyr)
inner_join(select(results, -word),
pivot_longer(rules, -item), c("item", "n.word"="value")) %>%
select(-n.word) %>%
pivot_wider(names_from=name, values_from=rt) %>%
select(part, item, v1, v2, n1, n2)
# A tibble: 1 x 6
# part item v1 v2 n1 n2
# <int> <int> <int> <int> <int> <int>
#1 51 106 256 2327 327 394
Data:
results <- structure(list(part = c(51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L), item = c(106L, 106L, 106L,
106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L,
106L), word = c("*", "El", "asistente", "del", "carnicero", "que",
"abapl’a", "el", "sabor", "del", "pollo", "envipi—", "un", "libro",
"*."), n.word = 1:15, rt = c(382L, 286L, 327L, 344L, 394L, 274L,
2327L, 1104L, 409L, 360L, 1605L, 256L, 4573L, 660L, 519L)), class = "data.frame", row.names = c(NA,
-15L))
rules <- structure(list(item = 106:108, v1 = c(12L, 11L, 11L), v2 = c(7L,
8L, 8L), n1 = c(3L, 3L, 3L), n2 = c(5L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-3L))
I am trying to fit the Richards model in R for this data bellow but can't get it to work.
time Volume
3 12
6 25
9 38
12 53
15 73
21 108
27 136
33 160
39 180
48 202
60 222
72 241
96 255
Richards <- nls(
Volume ~ (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*time)))^(1/n),
data=dat1,
start=c(Vi=3, Vf=255, u=6, n=-0.5))
Any help is appreciated!
It is better to use dput to provide data since it is quicker to get into R and preserves data types such as integer and factor:
dat1 <- structure(list(time = c(3L, 6L, 9L, 12L, 15L, 21L, 27L, 33L,
39L, 48L, 60L, 72L, 96L), Volume = c(12L, 25L, 38L, 53L, 73L,
108L, 136L, 160L, 180L, 202L, 222L, 241L, 255L)), class = "data.frame",
row.names = c(NA, -13L))
Here are the data and the curve you are trying to fit with your starting values:
plot(Volume~time, dat1)
Vi <- 3; Vf <- 255; u <- 6; n <- -0.5
Vol.pred <- (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*dat1$time)))^(1/n)
lines(dat1$time, Vol.pred, col="red")
You can see the predicted line is nowhere near the data. As #Maurits Evers indicated, it is not clear that the Richard's curve is appropriate, but you can try changing the starting values to get something closer, e.g. by changing u to .05:
lines(dat1$time, Vol.pred, col="blue")
That gives us starting values that will work:
Richards <- nls(
Volume ~ (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*time)))^(1/n),
data=dat1,
start=c(Vi=3, Vf=255, u=.05, n=-0.5))
lines(dat1$time, predict(Richards), col="darkgreen")
I got a list of nodes, and I need to randomly assign 'p' hubs to 'n' clients.
I got the following data, where the first row shows:
The total number of nodes.
The requested number of hubs.
The total supply capacity for each hub.
The following lines show:
The first column the node number.
The second column the "x" coordinate.
The third the "y" coordinate.
Below I will show the raw data, adding colnames() it would look something like this:
total_nodes hubs_required total_capacity
50 5 120
node number x_coordinate y_coordinate node_demand
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
. . . .
. . . .
. . . .
50 1 58 2
The x and y values are provided so we can calculate the Euclidean distance.
nodes:
50 5 120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I extracted the information from the first line.
nodes <- as.matrix(read.table(data))
header<-colnames(nodes)
clean_header <-gsub('X','',header)
requested_hubs <- as.numeric(clean_header[2])
max_supply_capacity <- as.numeric(clean_header[3])
I need to randomly select 5 nodes, that will act as hubs
set.seed(37)
node_to_hub <-nodes[sample(nrow(nodes),requested_hubs,replace = FALSE),]
Then randomly I need to assign nodes to each hub calculate the distances between the hub and each one of the nodes and when the max_supply_capacity(120) is exceeded select the following hub and repeat the process.
After the final iteration I need to return the cumulative sum of distances for all the hubs.
I need to repeat this process 100 times and return the min() value of the cumulative sum of distances.
This is where I'm completely stuck since I'm not sure how to loop through a matrix let alone when I have to select elements randomly.
I got the following elements:
capacity <- c(numeric()) # needs to be <= to 120
distance_sum <- c(numeric())
global_hub_distance <- c(numeric())
The formula for the euclidean distance (rounded) would be as below but I'm not sure how I can reflect the random selection when assigning nodes.
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[random,3]-nodes[random,3])^2))
The idea for the loop I think I need is below, but as I mentioned before I don't know how to deal with the sample client selection, and the distance calculation of the random clients.
for(i in 1:100){
node_to_hub
for(i in 1:nrow(node_to_hub){
#Should I randomly sample the clients here???
while(capacity < 120){
node_demand <- nodes[**random**,3]
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[**random**,3]-nodes[**random**,3])^2))
capacity <-c(capacity, node_demand)
distance_sum <- c(distance_sum,distance)
}
global_hub_distance <- c(global_hub_distance,distance_sum)
capacity <- 0
distance_sum <- 0
}
min(global_hub_distance)
}
Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances
The Loop:
for(i in 1:iterations){
# random hub selection
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
for(h in 1:hubs_required){
# sample the nodes into a random order
random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
min(rowSums(track_sums))
EDIT
as a function:
hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
# initialize results matrices
track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
# user defined distance function (only exists wihtin hubnode() function)
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2)))
for(i in 1:iterations){
# random hub selection
assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
hubs <- nodes[assigned_hubs,]
assigned_nodes <- NULL
for(h in 1:hubs_required){
# sample the nodes into a random order
assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
random_nodes <- nodes[assigned_nodes,]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# if there are none
if(is.na(last)) last = nrow(random_nodes)
node_count[i,h] <- last
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
return(list(track_sums = track_sums, node_count = node_count))
}
output <- hubnode(nodes, iterations = 100)
node_count <- output$node_count
track_sums <- output$track_sums
plot(rowSums(node_count),
rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))
min(rowSums(track_sums))
After running the replicate() function [a close relative of lapply()] on some data I ended up with an output that looks like this
myList <- structure(list(c(55L, 13L, 61L, 38L, 24L), 6.6435972422341, c(37L, 1L, 57L, 8L, 40L), 5.68336098665417, c(19L, 10L, 23L, 52L, 60L ),
5.80430476680636, c(39L, 47L, 60L, 14L, 3L), 6.67554407822367,
c(57L, 8L, 53L, 6L, 2L), 5.67149520387856, c(40L, 8L, 21L,
17L, 13L), 5.88446015238962, c(52L, 21L, 22L, 55L, 54L),
6.01685181395007, c(12L, 7L, 1L, 2L, 14L), 6.66299948053721,
c(41L, 46L, 21L, 30L, 6L), 6.67239635545512, c(46L, 31L,
11L, 44L, 32L), 6.44174324641076), .Dim = c(2L, 10L), .Dimnames = list(
c("reps", "score"), NULL))
In this case the vectors of integers are indexes that went into a function that I won't get into and the scalar-floats are scores.
I'd like a data frame that looks like
Index 1 Index 2 Index 3 Index 4 Index 5 Score
55 13 61 38 24 6.64
37 1 57 8 40 5.68
19 10 23 52 60 5.80
and so on.
Alternatively, a matrix of the indexes and an array of the values would be fine too.
Things that haven't worked for me.
data.frame(t(random.out)) # just gives a data frame with a column of vectors and another of scalars
cbind(t(random.out)) # same as above
do.call(rbind, random.out) # intersperses vectors and scalars
I realize other people have similar problems,
eg. Convert list of vectors to data frame
but I can't quite find an example with this particular kind of vectors and scalars together.
myList[1,] is a list of vectors, so you can combine them into a matrix with do.call and rbind. myList[2,] is a list of single scores, so you can combine them into a vector with unlist:
cbind(as.data.frame(do.call(rbind, myList[1,])), Score=unlist(myList[2,]))
# V1 V2 V3 V4 V5 Score
# 1 55 13 61 38 24 6.643597
# 2 37 1 57 8 40 5.683361
# 3 19 10 23 52 60 5.804305
# 4 39 47 60 14 3 6.675544
# 5 57 8 53 6 2 5.671495
# 6 40 8 21 17 13 5.884460
# 7 52 21 22 55 54 6.016852
# 8 12 7 1 2 14 6.662999
# 9 41 46 21 30 6 6.672396
# 10 46 31 11 44 32 6.441743