Loops with random sampling from a matrix and distance calculation - r

I got a list of nodes, and I need to randomly assign 'p' hubs to 'n' clients.
I got the following data, where the first row shows:
The total number of nodes.
The requested number of hubs.
The total supply capacity for each hub.
The following lines show:
The first column the node number.
The second column the "x" coordinate.
The third the "y" coordinate.
Below I will show the raw data, adding colnames() it would look something like this:
total_nodes hubs_required total_capacity
50 5 120
node number x_coordinate y_coordinate node_demand
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
. . . .
. . . .
. . . .
50 1 58 2
The x and y values are provided so we can calculate the Euclidean distance.
nodes:
50 5 120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I extracted the information from the first line.
nodes <- as.matrix(read.table(data))
header<-colnames(nodes)
clean_header <-gsub('X','',header)
requested_hubs <- as.numeric(clean_header[2])
max_supply_capacity <- as.numeric(clean_header[3])
I need to randomly select 5 nodes, that will act as hubs
set.seed(37)
node_to_hub <-nodes[sample(nrow(nodes),requested_hubs,replace = FALSE),]
Then randomly I need to assign nodes to each hub calculate the distances between the hub and each one of the nodes and when the max_supply_capacity(120) is exceeded select the following hub and repeat the process.
After the final iteration I need to return the cumulative sum of distances for all the hubs.
I need to repeat this process 100 times and return the min() value of the cumulative sum of distances.
This is where I'm completely stuck since I'm not sure how to loop through a matrix let alone when I have to select elements randomly.
I got the following elements:
capacity <- c(numeric()) # needs to be <= to 120
distance_sum <- c(numeric())
global_hub_distance <- c(numeric())
The formula for the euclidean distance (rounded) would be as below but I'm not sure how I can reflect the random selection when assigning nodes.
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[random,3]-nodes[random,3])^2))
The idea for the loop I think I need is below, but as I mentioned before I don't know how to deal with the sample client selection, and the distance calculation of the random clients.
for(i in 1:100){
node_to_hub
for(i in 1:nrow(node_to_hub){
#Should I randomly sample the clients here???
while(capacity < 120){
node_demand <- nodes[**random**,3]
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[**random**,3]-nodes[**random**,3])^2))
capacity <-c(capacity, node_demand)
distance_sum <- c(distance_sum,distance)
}
global_hub_distance <- c(global_hub_distance,distance_sum)
capacity <- 0
distance_sum <- 0
}
min(global_hub_distance)
}

Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances
The Loop:
for(i in 1:iterations){
# random hub selection
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
for(h in 1:hubs_required){
# sample the nodes into a random order
random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
min(rowSums(track_sums))
EDIT
as a function:
hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
# initialize results matrices
track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
# user defined distance function (only exists wihtin hubnode() function)
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2)))
for(i in 1:iterations){
# random hub selection
assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
hubs <- nodes[assigned_hubs,]
assigned_nodes <- NULL
for(h in 1:hubs_required){
# sample the nodes into a random order
assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
random_nodes <- nodes[assigned_nodes,]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# if there are none
if(is.na(last)) last = nrow(random_nodes)
node_count[i,h] <- last
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
return(list(track_sums = track_sums, node_count = node_count))
}
output <- hubnode(nodes, iterations = 100)
node_count <- output$node_count
track_sums <- output$track_sums
plot(rowSums(node_count),
rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))
min(rowSums(track_sums))

Related

tidyverse replace NA by other data frame values under condition

I have a first incomplete dataset data_incom and a second with the missing values of the first data_to_com. Using mutate(UG = case_when (INSEE == "07185" ~ 6, etc)), overwrites the "UG" column. How is it possible to replace the NA from the first dataset with the values from the second table using the tidyverse tools please?
Thank you !
data_incom <- structure(list(INSEE = c("07005", "07005", "07010", "07011",
"07011", "07012", "07019", "07025", "07026", "07032", "07033",
"07042", "07064", "07066", "07068", "07069", "07075", "07088",
"07096", "07099", "07101", "07101", "07105", "07105", "07107",
"07110", "07117", "07117", "07119", "07128", "07129", "07131",
"07144", "07153", "07154", "07159", "07161", "07161", "07168",
"07172", "07173", "07185", "07186", "07202", "07204", "07228",
"07232", "07240", "07261", "07265", "07273", "07279", "07284",
"07286", "07294", "07301", "07315", "07329", "07330", "07331",
"07338", "07338", "07347", "07187", "07265", "07334", "07262"
), UG = c(NA, NA, 2L, NA, NA, 10L, 13L, 28L, 26L, 15L, 21L, 19L,
11L, 16L, 8L, 6L, 26L, 25L, 11L, 18L, 21L, 21L, 26L, 26L, 24L,
25L, 25L, 25L, NA, 3L, 8L, 22L, 24L, NA, 28L, NA, 28L, 28L, 21L,
1L, 12L, NA, 15L, 24L, 7L, 1L, 24L, 9L, 9L, 2L, 18L, 19L, NA,
11L, 21L, 6L, NA, 24L, 18L, 28L, 8L, 8L, 3L, 24L, 2L, 20L, 24L
)), row.names = c(NA, -67L), class = "data.frame")
data_to_com <-structure(list(INSEE=c("07185", "07284", "07315", "07153", "07119", "07159", "070005"),
UG=c(6L,20L,24L,28L,26L,15L,17L)), row.names = c(NA,7L), class = "data.frame")
You can use the following solution. There are some INSEE values in the first data set that weren't present in the second data set and I just left them as NA values.
library(dplyr)
library(tidyr)
data_incom %>%
filter(is.na(UG)) %>%
rowwise() %>%
mutate(UG = list(data_to_com$UG[grepl(INSEE, data_to_com$INSEE)])) %>%
unnest(cols = c(UG)) -> data_com
data_com %>%
bind_rows(data_incom %>%
filter(!INSEE %in% data_com$INSEE)) %>%
arrange(INSEE)
# A tibble: 67 x 2
INSEE UG
<chr> <int>
1 07005 NA
2 07005 NA
3 07010 2
4 07011 NA
5 07011 NA
6 07012 10
7 07019 13
8 07025 28
9 07026 26
10 07032 15
# ... with 57 more rows
using coalesce in these kind of scenarios.
Using left_join will result in inclusion of all rows from incom
use coalesce thereafter
further use .keep = 'unused' in mutate argument to retain wanted rows only
library(dplyr)
data_incom %>% left_join(data_to_com, by = 'INSEE') %>%
mutate(UG = coalesce(UG.x, UG.y), .keep = 'unused')
INSEE UG
1 07005 NA
2 07005 NA
3 07010 2
4 07011 NA
5 07011 NA
6 07012 10
7 07019 13
8 07025 28
9 07026 26
10 07032 15
11 07033 21
12 07042 19
13 07064 11
14 07066 16
15 07068 8
16 07069 6
17 07075 26
18 07088 25
19 07096 11
20 07099 18
21 07101 21
22 07101 21
23 07105 26
24 07105 26
25 07107 24
26 07110 25
27 07117 25
28 07117 25
29 07119 26
30 07128 3
31 07129 8
32 07131 22
33 07144 24
34 07153 28
35 07154 28
36 07159 15
37 07161 28
38 07161 28
39 07168 21
40 07172 1
41 07173 12
42 07185 6
43 07186 15
44 07202 24
45 07204 7
46 07228 1
47 07232 24
48 07240 9
49 07261 9
50 07265 2
51 07273 18
52 07279 19
53 07284 20
54 07286 11
55 07294 21
56 07301 6
57 07315 24
58 07329 24
59 07330 18
60 07331 28
61 07338 8
62 07338 8
63 07347 3
64 07187 24
65 07265 2
66 07334 20
67 07262 24

Change row name returns "duplicate 'row.names' are not allowed" in R

I've tried change row names from formate from "data07_2470178_2" to "2470178" by following code:
rownames(df) <-regmatches(rownames(df), gregexpr("(?<=_)[[:alnum:]]{7}", rownames(df), perl = TRUE))
But it returns following error:
Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
The dataset briefly looks like this:
1 2 3 4
data143_2220020_1 24 87 3 32
data143_2220020_2 24 87 3 32
data105_2220058_1 26 91 3 36
data105_2220058_2 26 91 3 36
data134_2221056_2 13 40 3 17
data134_2221056_1 13 40 3 17
And I'd like my dataset looks like this. For every original row only remain the one ended with "_2":
1 2 3 4
2220020 24 87 3 32
2220058 26 91 3 36
2221056 13 40 3 17
I really don't understand why is this case? Also, how can I change row name correctly? Could anyone help? Thanks in advance!
If you want to remove rows based on rownames, you can use :
rn <- sub('.*_(\\d+)_.*', '\\1', rownames(df))
df1 <- df[!duplicated(rn), ]
rownames(df1) <- unique(rn)
df1
# 1 2 3 4
#2220020 24 87 3 32
#2220058 26 91 3 36
#2221056 13 40 3 17
However, unique(df) would automatically give you only unique rows and you can change the rownames based on above method.
data
df <- structure(list(`1` = c(24L, 24L, 26L, 26L, 13L, 13L), `2` = c(87L,
87L, 91L, 91L, 40L, 40L), `3` = c(3L, 3L, 3L, 3L, 3L, 3L), `4` = c(32L,
32L, 36L, 36L, 17L, 17L)), class = "data.frame",
row.names = c("data143_2220020_1",
"data143_2220020_2", "data105_2220058_1", "data105_2220058_2",
"data134_2221056_2", "data134_2221056_1"))

Converting a list of vectors and numbers (from replicate) into a data frame

After running the replicate() function [a close relative of lapply()] on some data I ended up with an output that looks like this
myList <- structure(list(c(55L, 13L, 61L, 38L, 24L), 6.6435972422341, c(37L, 1L, 57L, 8L, 40L), 5.68336098665417, c(19L, 10L, 23L, 52L, 60L ),
5.80430476680636, c(39L, 47L, 60L, 14L, 3L), 6.67554407822367,
c(57L, 8L, 53L, 6L, 2L), 5.67149520387856, c(40L, 8L, 21L,
17L, 13L), 5.88446015238962, c(52L, 21L, 22L, 55L, 54L),
6.01685181395007, c(12L, 7L, 1L, 2L, 14L), 6.66299948053721,
c(41L, 46L, 21L, 30L, 6L), 6.67239635545512, c(46L, 31L,
11L, 44L, 32L), 6.44174324641076), .Dim = c(2L, 10L), .Dimnames = list(
c("reps", "score"), NULL))
In this case the vectors of integers are indexes that went into a function that I won't get into and the scalar-floats are scores.
I'd like a data frame that looks like
Index 1 Index 2 Index 3 Index 4 Index 5 Score
55 13 61 38 24 6.64
37 1 57 8 40 5.68
19 10 23 52 60 5.80
and so on.
Alternatively, a matrix of the indexes and an array of the values would be fine too.
Things that haven't worked for me.
data.frame(t(random.out)) # just gives a data frame with a column of vectors and another of scalars
cbind(t(random.out)) # same as above
do.call(rbind, random.out) # intersperses vectors and scalars
I realize other people have similar problems,
eg. Convert list of vectors to data frame
but I can't quite find an example with this particular kind of vectors and scalars together.
myList[1,] is a list of vectors, so you can combine them into a matrix with do.call and rbind. myList[2,] is a list of single scores, so you can combine them into a vector with unlist:
cbind(as.data.frame(do.call(rbind, myList[1,])), Score=unlist(myList[2,]))
# V1 V2 V3 V4 V5 Score
# 1 55 13 61 38 24 6.643597
# 2 37 1 57 8 40 5.683361
# 3 19 10 23 52 60 5.804305
# 4 39 47 60 14 3 6.675544
# 5 57 8 53 6 2 5.671495
# 6 40 8 21 17 13 5.884460
# 7 52 21 22 55 54 6.016852
# 8 12 7 1 2 14 6.662999
# 9 41 46 21 30 6 6.672396
# 10 46 31 11 44 32 6.441743

Ordering the x-axis in an R graph

I have a data.frame that looks like:
gvs order labels
1 -2.3321916 1 Adygei
2 -1.4996229 5 Basque
3 1.7958170 15 French
4 2.5543214 19 Italian
5 -2.7758460 33 Orcadian
6 -1.9659984 39 Russian
7 2.1239768 41 Sardinian
8 -1.8515908 47 Tuscan
9 -1.5597359 6 Bedouin
10 -1.2534511 14 Druze
11 -0.1625003 31 Mozabite
12 -1.0265275 35 Palestinian
13 -0.8519079 2 Balochi
14 -2.4279528 8 Brahui
15 -3.1717421 9 Burusho
16 -0.9258497 17 Hazara
17 -1.2207974 21 Kalash
18 -1.0325107 24 Makrani
19 -3.2102686 37 Pathan
20 -0.9377928 43 Sindhi
21 -1.7657017 48 Uygurf
22 -0.5058627 10 Cambodian
23 -0.7819299 12 Dai
24 -1.4095947 13 Daur
25 2.2810477 16 Han
26 -0.9007551 18 Hezhen
27 2.6614486 20 Japanese
28 -0.9441980 23 Lahu
29 -0.7237586 29 Miao
30 -0.9452944 30 Mongola
31 -1.2035258 32 Naxi
32 -0.7703779 34 Oroqen
33 -3.0895998 42 She
34 -0.7037952 45 Tu
35 -1.9311354 46 Tujia
36 -0.5423822 49 Xibo
37 -1.6244801 50 Yakut
38 -0.9049735 51 Yi
39 -2.6491331 11 Colombian
40 2.3706977 22 Karitiana
41 -2.7590587 26 Maya
42 -0.9614190 38 Pima
43 -1.6961014 44 Surui
44 -0.8449225 28 Melanesian
45 -1.1163019 36 Papuan
46 -0.9298674 3 BantuKenya
47 -2.8859587 4 BantuSouthAfrica
48 -1.4494841 7 BiakaPygmy
49 -0.7381369 25 Mandenka
50 -0.5644325 27 MbutiPygmy
51 -0.9195156 40 San
52 2.0949378 52 Yoruba
I would like to graph the column gvs along the x-axis in the order of the column order, and then have the label for each point along the x-axis to be from the column labels. Does anyone know how this is done? I want the graph to look like a less colorful version of the graphs in figure-5 in this paper http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004412
Based on your comments, it looks like (1) labels doesn't correspond to gvs and order, and (2) if I sort the first two columns by order, the data frame will be ordered properly. Please let me know if this is not correct.
Sort first two columns by order, leaving third column alone:
df[,c("gvs","order")] = df[order(df$order), c("gvs","order")]
Set the ordering of labels based on the current ordering of labels in the sample data frame:
df$labels = factor(df$labels, levels=df$labels)
Add a grouping variable for region. I did this by creating a new group each time the alphabetic ordering of labels went "backwards". The regions are just numbers here, but you can give them descriptive names if you want to use them:
df$group = c(0, cumsum(diff(match(substr(df$labels,1,1), LETTERS)) < 0))
Add fake p-values (since point size was based on p-value in the graph you linked to):
set.seed(595)
df$p.value = runif(nrow(df), 0, 0.5)
Plot the data, including a different color for each regional group, point-size based on p-value, and black borders around points with p < 0.05. geom_line add the regional means:
library(dplyr)
ggplot(df, aes(labels, gvs, size=p.value, fill=factor(group))) +
geom_line(data=df %>% group_by(group) %>% mutate(gvs=mean(gvs)),
aes(group=group, colour=factor(group)), size=0.8,alpha=0.5) +
geom_point(pch=21, stroke=1, aes(color=p.value<0.05)) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, hjust=0, vjust=0.5),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank()) +
scale_size_continuous(name="p values", limits=c(0, 0.5), breaks=seq(0,1,0.1), range=c(4,1)) +
scale_color_manual(values=c(hcl(seq(15,375,length.out=8),100,65)[1:7],NA,"black")) +
labs(x="Language", fill="Region") +
guides(colour=FALSE,
size=guide_legend(reverse=TRUE, override.aes=list(color=NA,fill="grey50")),
fill=guide_legend(reverse=TRUE, override.aes=list(color=NA, size=3)))
Read data frame:
df <- data.frame(gvs = c(-2.3321916, -1.4996229, 1.795817, 2.5543214, -2.775846, -1.9659984,
2.1239768, -1.8515908, -1.5597359, -1.2534511, -0.1625003, -1.0265275,
-0.8519079, -2.4279528, -3.1717421, -0.9258497, -1.2207974, -1.0325107,
-3.2102686, -0.9377928, -1.7657017, -0.5058627, -0.7819299, -1.4095947,
2.2810477, -0.9007551, 2.6614486, -0.944198, -0.7237586, -0.9452944,
-1.2035258, -0.7703779, -3.0895998, -0.7037952, -1.9311354, -0.5423822,
-1.6244801, -0.9049735, -2.6491331, 2.3706977, -2.7590587, -0.961419,
-1.6961014, -0.8449225, -1.1163019, -0.9298674, -2.8859587, -1.4494841,
-0.7381369, -0.5644325, -0.9195156, 2.0949378),
order = c(1L, 5L, 15L, 19L, 33L, 39L, 41L, 47L, 6L, 14L, 31L, 35L, 2L,
8L, 9L, 17L, 21L, 24L, 37L, 43L, 48L, 10L, 12L, 13L, 16L, 18L,
20L, 23L, 29L, 30L, 32L, 34L, 42L, 45L, 46L, 49L, 50L, 51L, 11L,
22L, 26L, 38L, 44L, 28L, 36L, 3L, 4L, 7L, 25L, 27L, 40L, 52L),
labels = c("Adygei", "Basque", "French", "Italian", "Orcadian", "Russian",
"Sardinian", "Tuscan", "Bedouin", "Druze", "Mozabite", "Palestinian",
"Balochi", "Brahui", "Burusho", "Hazara", "Kalash", "Makrani",
"Pathan", "Sindhi", "Uygurf", "Cambodian", "Dai", "Daur", "Han",
"Hezhen", "Japanese", "Lahu", "Miao", "Mongola", "Naxi", "Oroqen",
"She", "Tu", "Tujia", "Xibo", "Yakut", "Yi", "Colombian", "Karitiana",
"Maya", "Pima", "Surui", "Melanesian", "Papuan", "BantuKenya",
"BantuSouthAfrica", "BiakaPygmy", "Mandenka", "MbutiPygmy", "San",
"Yoruba"))
Order data
df.ordered <- df[ order(df$order) , ]
And some simple (ugly) sample plotting which you can surely improve upon (maybe with ggplot)
plot(df.ordered$gvs, pch = 19)
axis(1, at=1:52, labels=df.ordered$labels, las=2)
Another option that doesn't rely on the sorting of the dataframe is to use the limits parameter of a discrete scale (which as a side benefit can allow you do do more arbitrary ordering when plotting.)
df <-read.csv(/path/to/file/df.csv')
xorder <-df[order(df$order),'labels']
ggplot(df, aes(x=labels, y=gvs, size=gvs)) +
geom_point() +
scale_x_discrete(limits=xorder)+
theme(axis.text.x=element_text(angle=90))

Can corr.test be used for any dataframe?

I am trying to find get correlations and p-values between variables in a dataframe (df1) using corr.test in the psych package. The variables in the dataframe are all integers and there is no NAs. But when I run the corr.test(df1), there is always a error message.
Error in data.frame(lower = lower, r = r[lower.tri(r)], upper = upper, :
arguments imply differing number of rows: 0, 28
I tried to run the example (corr.test(sat.act)) in the psych package and there is no error.
I am new to R, can someone tell me what is wrong with the dataframe.
> head(df1)
S1.pre S2.pre S1.post S2.post V1.pre V2.pre V1.post V2.post
1 21 31 25 35 7 1 19 4
2 15 26 21 29 13 11 16 14
3 18 27 23 31 8 2 3 3
4 17 31 18 39 13 11 15 14
5 15 26 16 29 26 15 32 20
6 17 28 16 28 2 4 2 7
> dput(head(df1))
structure(list(S1.pre = c(21L, 15L, 18L, 17L, 15L, 17L), S2.pre = c(31L,
26L, 27L, 31L, 26L, 28L), S1.post = c(25L, 21L, 23L, 18L, 16L,
16L), S2.post = c(35L, 29L, 31L, 39L, 29L, 28L), V1.pre = c(7L,
13L, 8L, 13L, 26L, 2L), V2.pre = c(1L, 11L, 2L, 11L, 15L, 4L),
V1.post = c(19L, 16L, 3L, 15L, 32L, 2L), V2.post = c(4L,
14L, 3L, 14L, 20L, 7L)), .Names = c("S1.pre", "S2.pre", "S1.post",
"S2.post", "V1.pre", "V2.pre", "V1.post", "V2.post"), row.names = c(NA,
6L), class = "data.frame")
> sapply(df1, class)
S1.pre S2.pre S1.post S2.post V1.pre V2.pre V1.post V2.post
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer"
I contacted William Revelle - author of the psych package and here is what he said:
Mark,
Unfotunately you found a bug introduced into 1.4.3.
1.4.4 will go out to Cran this weekend.
In the meantime you can get the fix at http://personality-project.org/r (choose source from other repository if you are using a mac) or
http://personality-project.org/r/src/contrib and get the zip file if you are using a PC.
Otherwise, wait until next week.
Sorry about the problem.
It will still work as long as you have unequal number of subjects or some missing data.

Resources