Can corr.test be used for any dataframe? - r

I am trying to find get correlations and p-values between variables in a dataframe (df1) using corr.test in the psych package. The variables in the dataframe are all integers and there is no NAs. But when I run the corr.test(df1), there is always a error message.
Error in data.frame(lower = lower, r = r[lower.tri(r)], upper = upper, :
arguments imply differing number of rows: 0, 28
I tried to run the example (corr.test(sat.act)) in the psych package and there is no error.
I am new to R, can someone tell me what is wrong with the dataframe.
> head(df1)
S1.pre S2.pre S1.post S2.post V1.pre V2.pre V1.post V2.post
1 21 31 25 35 7 1 19 4
2 15 26 21 29 13 11 16 14
3 18 27 23 31 8 2 3 3
4 17 31 18 39 13 11 15 14
5 15 26 16 29 26 15 32 20
6 17 28 16 28 2 4 2 7
> dput(head(df1))
structure(list(S1.pre = c(21L, 15L, 18L, 17L, 15L, 17L), S2.pre = c(31L,
26L, 27L, 31L, 26L, 28L), S1.post = c(25L, 21L, 23L, 18L, 16L,
16L), S2.post = c(35L, 29L, 31L, 39L, 29L, 28L), V1.pre = c(7L,
13L, 8L, 13L, 26L, 2L), V2.pre = c(1L, 11L, 2L, 11L, 15L, 4L),
V1.post = c(19L, 16L, 3L, 15L, 32L, 2L), V2.post = c(4L,
14L, 3L, 14L, 20L, 7L)), .Names = c("S1.pre", "S2.pre", "S1.post",
"S2.post", "V1.pre", "V2.pre", "V1.post", "V2.post"), row.names = c(NA,
6L), class = "data.frame")
> sapply(df1, class)
S1.pre S2.pre S1.post S2.post V1.pre V2.pre V1.post V2.post
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer"

I contacted William Revelle - author of the psych package and here is what he said:
Mark,
Unfotunately you found a bug introduced into 1.4.3.
1.4.4 will go out to Cran this weekend.
In the meantime you can get the fix at http://personality-project.org/r (choose source from other repository if you are using a mac) or
http://personality-project.org/r/src/contrib and get the zip file if you are using a PC.
Otherwise, wait until next week.
Sorry about the problem.
It will still work as long as you have unequal number of subjects or some missing data.

Related

R adding columns and data

I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37

Finding columns that contain values based on another column

I have the following data frame:
Step 1 2 3
1 5 10 6
2 5 11 5
3 5 13 9
4 5 15 10
5 13 18 10
6 15 20 10
7 17 23 10
8 19 25 10
9 21 27 13
10 23 30 7
I would like to retrieve the columns that satisfy one of the following conditions: if step 1 = step 4 or step 4 = step 8. In this case, column 1 and 3 should be retrieved. Column 1 because the value at Step 1 = value at step 4 (i.e., 5), and for column 3, the value at step 4 = value at step 8 (i.e., 10).
I don't know how to do that in R. Can someone help me please?
You can get the column indices by the following code:
df[1, -1] == df[4, -1] | df[4, -1] == df[8, -1]
# X1 X2 X3
# 1 TRUE FALSE TRUE
# data
df <- structure(list(Step = 1:10, X1 = c(5L, 5L, 5L, 5L, 13L, 15L,
17L, 19L, 21L, 23L), X2 = c(10L, 11L, 13L, 15L, 18L, 20L, 23L,
25L, 27L, 30L), X3 = c(6L, 5L, 9L, 10L, 10L, 10L, 10L, 10L, 13L,
7L)), class = "data.frame", row.names = c(NA, -10L))

Loops with random sampling from a matrix and distance calculation

I got a list of nodes, and I need to randomly assign 'p' hubs to 'n' clients.
I got the following data, where the first row shows:
The total number of nodes.
The requested number of hubs.
The total supply capacity for each hub.
The following lines show:
The first column the node number.
The second column the "x" coordinate.
The third the "y" coordinate.
Below I will show the raw data, adding colnames() it would look something like this:
total_nodes hubs_required total_capacity
50 5 120
node number x_coordinate y_coordinate node_demand
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
. . . .
. . . .
. . . .
50 1 58 2
The x and y values are provided so we can calculate the Euclidean distance.
nodes:
50 5 120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I extracted the information from the first line.
nodes <- as.matrix(read.table(data))
header<-colnames(nodes)
clean_header <-gsub('X','',header)
requested_hubs <- as.numeric(clean_header[2])
max_supply_capacity <- as.numeric(clean_header[3])
I need to randomly select 5 nodes, that will act as hubs
set.seed(37)
node_to_hub <-nodes[sample(nrow(nodes),requested_hubs,replace = FALSE),]
Then randomly I need to assign nodes to each hub calculate the distances between the hub and each one of the nodes and when the max_supply_capacity(120) is exceeded select the following hub and repeat the process.
After the final iteration I need to return the cumulative sum of distances for all the hubs.
I need to repeat this process 100 times and return the min() value of the cumulative sum of distances.
This is where I'm completely stuck since I'm not sure how to loop through a matrix let alone when I have to select elements randomly.
I got the following elements:
capacity <- c(numeric()) # needs to be <= to 120
distance_sum <- c(numeric())
global_hub_distance <- c(numeric())
The formula for the euclidean distance (rounded) would be as below but I'm not sure how I can reflect the random selection when assigning nodes.
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[random,3]-nodes[random,3])^2))
The idea for the loop I think I need is below, but as I mentioned before I don't know how to deal with the sample client selection, and the distance calculation of the random clients.
for(i in 1:100){
node_to_hub
for(i in 1:nrow(node_to_hub){
#Should I randomly sample the clients here???
while(capacity < 120){
node_demand <- nodes[**random**,3]
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[**random**,3]-nodes[**random**,3])^2))
capacity <-c(capacity, node_demand)
distance_sum <- c(distance_sum,distance)
}
global_hub_distance <- c(global_hub_distance,distance_sum)
capacity <- 0
distance_sum <- 0
}
min(global_hub_distance)
}
Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances
The Loop:
for(i in 1:iterations){
# random hub selection
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
for(h in 1:hubs_required){
# sample the nodes into a random order
random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
min(rowSums(track_sums))
EDIT
as a function:
hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
# initialize results matrices
track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
# user defined distance function (only exists wihtin hubnode() function)
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2)))
for(i in 1:iterations){
# random hub selection
assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
hubs <- nodes[assigned_hubs,]
assigned_nodes <- NULL
for(h in 1:hubs_required){
# sample the nodes into a random order
assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
random_nodes <- nodes[assigned_nodes,]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# if there are none
if(is.na(last)) last = nrow(random_nodes)
node_count[i,h] <- last
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
return(list(track_sums = track_sums, node_count = node_count))
}
output <- hubnode(nodes, iterations = 100)
node_count <- output$node_count
track_sums <- output$track_sums
plot(rowSums(node_count),
rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))
min(rowSums(track_sums))

Converting a list of vectors and numbers (from replicate) into a data frame

After running the replicate() function [a close relative of lapply()] on some data I ended up with an output that looks like this
myList <- structure(list(c(55L, 13L, 61L, 38L, 24L), 6.6435972422341, c(37L, 1L, 57L, 8L, 40L), 5.68336098665417, c(19L, 10L, 23L, 52L, 60L ),
5.80430476680636, c(39L, 47L, 60L, 14L, 3L), 6.67554407822367,
c(57L, 8L, 53L, 6L, 2L), 5.67149520387856, c(40L, 8L, 21L,
17L, 13L), 5.88446015238962, c(52L, 21L, 22L, 55L, 54L),
6.01685181395007, c(12L, 7L, 1L, 2L, 14L), 6.66299948053721,
c(41L, 46L, 21L, 30L, 6L), 6.67239635545512, c(46L, 31L,
11L, 44L, 32L), 6.44174324641076), .Dim = c(2L, 10L), .Dimnames = list(
c("reps", "score"), NULL))
In this case the vectors of integers are indexes that went into a function that I won't get into and the scalar-floats are scores.
I'd like a data frame that looks like
Index 1 Index 2 Index 3 Index 4 Index 5 Score
55 13 61 38 24 6.64
37 1 57 8 40 5.68
19 10 23 52 60 5.80
and so on.
Alternatively, a matrix of the indexes and an array of the values would be fine too.
Things that haven't worked for me.
data.frame(t(random.out)) # just gives a data frame with a column of vectors and another of scalars
cbind(t(random.out)) # same as above
do.call(rbind, random.out) # intersperses vectors and scalars
I realize other people have similar problems,
eg. Convert list of vectors to data frame
but I can't quite find an example with this particular kind of vectors and scalars together.
myList[1,] is a list of vectors, so you can combine them into a matrix with do.call and rbind. myList[2,] is a list of single scores, so you can combine them into a vector with unlist:
cbind(as.data.frame(do.call(rbind, myList[1,])), Score=unlist(myList[2,]))
# V1 V2 V3 V4 V5 Score
# 1 55 13 61 38 24 6.643597
# 2 37 1 57 8 40 5.683361
# 3 19 10 23 52 60 5.804305
# 4 39 47 60 14 3 6.675544
# 5 57 8 53 6 2 5.671495
# 6 40 8 21 17 13 5.884460
# 7 52 21 22 55 54 6.016852
# 8 12 7 1 2 14 6.662999
# 9 41 46 21 30 6 6.672396
# 10 46 31 11 44 32 6.441743

Identify periods of high counts in timeseries data

Using the example dataframe:
count.bouts <-structure(list(time.stamp = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L), .Label = c("13:00:00", "13:00:10",
"13:00:20", "13:00:30", "13:00:40", "13:00:50", "13:01:00", "13:01:10",
"13:01:20", "13:01:30", "13:01:40", "13:01:50", "13:02:00", "13:02:10",
"13:02:20", "13:02:30", "13:02:40", "13:02:50", "13:03:00", "13:03:10",
"13:03:20", "13:03:30", "13:03:40", "13:03:50", "13:04:00", "13:04:10",
"13:04:20", "13:04:30", "13:04:40", "13:04:50", "13:05:00", "13:05:10",
"13:05:20", "13:05:30", "13:05:40", "13:05:50", "13:06:00"), class = "factor"),
count = c(5L, 11L, 16L, 19L, 15L, 11L, 8L, 5L, 2L, 6L, 12L,
15L, 20L, 12L, 6L, 2L, 18L, 25L, 26L, 15L, 13L, 6L, 5L, 4L,
8L, 9L, 16L, 26L, 29L, 55L, 21L, 6L, 9L, 28L, 16L, 19L, 26L,
5L)), .Names = c("time.stamp", "count"), class = "data.frame", row.names = c(NA,
-38L))
I wish to create a function that would identify bouts of high count activity that fulfils the following criteria:
Count data that is greater or equal to 10 for 1 minute or more
Within this period (or bout) of high counts, I would allow count data to drop to under 10 for a maximum of 20 seconds (within the bout)
Data that fulfils this criteria I would wish to be highlighted in the dataset by adding an extra column (called "1min+.bouts") to the dataframe. Then each bout would be identified with a number starting from 1 - i.e. the dataframe described above would have a series of 1s for the first bout (13:01:40 to 13:03:20) and then 2s for the second bout (13:04:20 to 13:05:50). 0s would be added to those rows with no bouts.
I hope that makes sense. If anyone could possible point me in the right direction re. packages or functions that would help me out, I should be most grateful.
This assumes that there are no NA values:
#which counts are >= 10
tmp <- count.bouts$count >= 10
#substitute FALSE with NA, so we can use na.approx for interpolation
tmp[!tmp] <- NA
library(zoo)
#fill gaps of up to two values
tmp <- na.approx(tmp, method = "constant", maxgap = 2, na.rm = FALSE)
#NA --> 0
tmp[is.na(tmp)] <- 0
#run lengths
tmp <- rle(tmp)
#we don't want run lengths shorter one minute
tmp$values[tmp$lengths < 6] <- 0
#number the run lengths we are interested in
tmp$values <- cumsum(tmp$values) * tmp$values
#inverse run length encoding
count.bouts$bout <- inverse.rle(tmp)
# time.stamp count bout
#1 13:00:00 5 0
#2 13:00:10 11 0
#3 13:00:20 16 0
#4 13:00:30 19 0
#5 13:00:40 15 0
#6 13:00:50 11 0
#7 13:01:00 8 0
#8 13:01:10 5 0
#9 13:01:20 2 0
#10 13:01:30 6 0
#11 13:01:40 12 1
#12 13:01:50 15 1
#13 13:02:00 20 1
#14 13:02:10 12 1
#15 13:02:20 6 1
#16 13:02:30 2 1
#17 13:02:40 18 1
#18 13:02:50 25 1
#19 13:03:00 26 1
#20 13:03:10 15 1
#21 13:03:20 13 1
#22 13:03:30 6 0
#23 13:03:40 5 0
#24 13:03:50 4 0
#25 13:04:00 8 0
#26 13:04:10 9 0
#27 13:04:20 16 2
#28 13:04:30 26 2
#29 13:04:30 29 2
#30 13:04:40 55 2
#31 13:04:50 21 2
#32 13:05:00 6 2
#33 13:05:10 9 2
#34 13:05:20 28 2
#35 13:05:30 16 2
#36 13:05:40 19 2
#37 13:05:50 26 2
#38 13:06:00 5 0

Resources