Related
I have the following data structure:
iid<-c(rep("I1",5),rep("I2",5),rep("I3",5),rep("I4",5))
days<-rep(c(0,2,5,7,14),4)
estatus<-c(4,4,4,3,3,
5,4,4,4,3,
4,4,4,4,4,
5,4,4,3,2)
data<-as.data.frame(cbind(iid,days,estatus))
I'm interested in obtained different outcomes all related to changes in the variable "status"
First I want to know how many individuals (iid) have changed their status in 1 unit by the day 5. I don't want to treat days as a factor, this is a simple example, but in the real dataset days can change between individuals, so I don't have always the same days.
The first outcome would look like this:
iid<-c("I1","I2","I3","I4")
res_5<-c(0,1,0,1)
results_1<-as.data.frame(cbind(iid,res_5))
I1 and I3 did not experience a change in their status of 1 unit.
The second outcome I'm interested in is to know on which day the status of each individual changes 1 unit in their status. The outcome would be like:
iid<-c("I1","I2","I3","I4")
res_d<-c(7,2,NA,2)
results_1<-as.data.frame(cbind(iid,res_d))
I think that I got the first part of the problem, as I know how to aggregate by iid with tidyverse or dplyr. However, I don't know how to check if a certain row is 1, 2 or n units above or below the previous row.
Using by to apply a function for each id.
(i) look for the index where days == 5 and check the diff with the first element
(ii) use diff to compute the difference of consecutive elements in your vector and then look for a difference of 1 or -1
iid <- c(rep("I1", 5), rep("I2", 5), rep("I3", 5), rep("I4", 5))
days <- rep(c(0, 2, 5, 7, 14), 4)
estatus <- c(
4, 4, 4, 3, 3,
5, 4, 4, 4, 3,
4, 4, 4, 4, 4,
5, 4, 4, 3, 2
)
data <- data.frame(iid = iid, days = days, estatus = estatus)
my_func1 <- function(x) {
ind5 <- which(x$days == 5)
d <- x$estatus[ind5] - x$estatus[1]
return((d == 1) | (d == -1))
}
by(data, data$iid, my_func1)
my_func2 <- function(x) {
d <- diff(x$estatus)
hasChangeOf1 <- (d == 1) | (d == -1)
return(x$days[which(hasChangeOf1)[1] + 1])
}
by(data, data$iid, my_func2)
I tried in this way, but it doesn't work, because I can't match the different cards in the function sum. How can I match the 4 cards?
suits <- c("Clubs", "Diamonds", "Hearts", "Spades")
cards <-c("Ace", 7:10, "Jack", "Queen", "King")
deck2 <- rep(cards, 4)
prob4cards <- function()
{
prob4cards <- sample(deck2, size= 5, replace = FALSE)
sum(prob4cards[,1] == prob4cards[,2] == prob4cards[,3]== prob4cards[,4])
}
The probability of 4 cards being the same out of 5 cards drawn can be found by -
prob4cards <- function(deck) {
prob4cards <- sample(deck, size= 5, replace = FALSE)
any(table(prob4cards) == 4)
}
mean(replicate(10000, prob4cards(deck2)))
Increase the count in replicate to get more accurate results.
I have data that is roughly in the following format but is very large but is broken up by groups using the class and uniqueId variable. Where each location is a pair row wise (x, y).
df <-
data.frame(
x = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
y = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
class = c(0, 0, 0, 0, 0, 1, 0, 1, 0),
uniqueId = c("1-2-3", "1-2-3", "1-2-3", "1-2-4", "1-2-4", "1-2-4", "1-3-2", "1-3-2", "1-3-2"),
partialId = c("1.2", "1.2", "1.2", "1.2", "1.2", "1.2", "1.3", "1.3", 1.3")
)
The function I am using should go through the dataframe and calculate the smallest distance to another object within the same uniqueId but different class as the current row. To do this I've broken my data into chunks the following way.
indexes <-
df %>%
select(partialId) %>%
unique()
j <- 1
library(doParallel)
class_separation <- c()
cl <- makePSOCKcluster(24)
registerDoParallel(cl)
while(j <= nrow(indexes)) {
test <- df %>% filter(partialId == indexes$partialId[j])
n <- nrow(test)
vec <- numeric(n)
vec <- foreach(k = 1:n, .combine = 'c', .multicombine = F) %dopar% {
c(
min(
apply(
test[test$uniqueId == test$uniqueId[k] & test$class != test$class[k], c("x","y")],
1,
function(x) dist(rbind(c(test$x[k],test$y[k]), c(x[1], x[2])))
)
)
)
}
class_separation <- c(class_separation, vec)
j <- j + 1
}
endtime <- Sys.time()
stopwatch <- endtime - starttime
closeAllConnections()
registerDoSEQ()
gc()
df <- cbind(df, class_separation)
When handling single plays or small batches, this code seems to operate just fine. However, when handling the full dataset I am getting results that are obviously incorrect. I know there must be a flaw in how I am calculating these distances since there is very little chance the dist() function itself or %dopar% is at fault. I have changed to %do% and my results do not change.
As an example of the discrepancy, the following image shows the class_separation column from when the full data run is conducted vs when I feed it a small example. As you can see the results are wildly different, but I'm not sure why.
After a day of thinking about this, the problem is in how I was sending my df to dist().
For example, if we intended to pass
dist(rbind(c(1, 1), c(6, 6)))
dist(rbind(c(1, 1), c(9, 9)))
What we actually pass is dist(rbind(c(1, 1), c(6, 6, 9, 9)))
This is obviously not what I want. I needed both distances and then to select the minimum between them or add in other conditionals. The way to do this I found was using the rdist package.
foreach(i = 1:nrow(df), .combine = 'c', .multicombine = F, .packages = c('tidyverse',
'rdist')) %dopar% {
min(
cdist(
df[df$class != df$class[i] & df$uniqueId == df$uniqueId[i], ] %>% select(x, y),
df %>% select(x, y) %>% slice(i)
)
)
}
For our test data this returns the vector
Inf Inf Inf 2.828427 1.414214 1.414214 1.414214 1.414214 1.414214
Which is exactly what I needed. The first three entries having no class == 1 options for their uniqueId should return Inf, row 4 is twice as far from row 6 as row 5 while all having the same uniqueId, while row 9 is equally distance to rows 8 and 10. Whether this solution will be sufficiently fast I will test out.
I am working on an assignment for school. I need to transform the columns in a data frame using a for loop and the bcPower function from the cars package. My data frame named bb2.df consists of 13 columns of baseball statistics for 337 players. The data is from:
http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt
I read the data in using:
bb.df <- read.fwf("baseball.dat.txt",widths=c(4,6,6,4,4,3,3,3,4,4,4,3,3,2,2,2,2,19))
And then I created a second data frame just for the numeric stats using:
bb2.df <- bb.df[,1:13]
library(cars)
Then I unsuccessfully tried to build the for loop.
> bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
> for (i in 1:ncol(bb2.df)) {
+ c <- coef(powerTransform(bb2.df[[i]]))
+ bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
+ }
Error in bc1(out[, j], lambda[j]) :
First argument must be strictly positive.
The loop seems to transform the first three columns but stops.
What am I doing wrong?
This solution
tests whether a column appears to contain logical values and omits them from the transformation
replaces zero values in the vectors with a small number, outside the range of the actual values
stores the transformed values in a new data frame, retaining the column and row names
I have also tested all of the variables for normality before and after the transformation. I tried to find a variable that's interesting in that the transformed variable has a large p-value for the Shapiro test, but also there there was a large change in the p-value. Finally, the interesting variable is scaled in both the original and transformed version, and the two versions are overlaid on a density plot.
library(car); library(ggplot2); library(reshape2)
# see this link for column names and type hints
# http://ww2.amstat.org/publications/jse/datasets/baseball.txt
# add placeholder column for opening quotation mark
bb.df <-
read.fwf(
"http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 2, 17)
)
# remove placeholder column
bb.df <- bb.df[,-(ncol(bb.df) - 1)]
names(bb.df) <- make.names(
c(
'Salary', 'Batting average', 'OBP', 'runs', 'hits', 'doubles', 'triples',
'home runs', 'RBI', 'walks', 'strike-outs', 'stolen bases', 'errors',
"free agency eligibility", "free agent in 1991/2" ,
"arbitration eligibility", "arbitration in 1991/2", 'name'
)
)
# test for boolean/logical values... don't try to transform them
logicals.test <- apply(
bb.df,
MARGIN = 2,
FUN = function(one.col) {
asnumeric <- as.numeric(one.col)
aslogical <- as.logical(asnumeric)
renumeric <- as.numeric(aslogical)
matchflags <- renumeric == asnumeric
cant.be.logical <- any(!matchflags)
print(cant.be.logical)
}
)
logicals.test[is.na(logicals.test)] <- FALSE
probably.numeric <- bb.df[, logicals.test]
result <- apply(probably.numeric, MARGIN = 2, function(one.col)
{
# can't transform vectors containing non-positive values
# replace zeros with something small
non.zero <- one.col[one.col > 0]
small <- min(non.zero) / max(non.zero)
zeroless <- one.col
zeroless[zeroless == 0] <- small
c <- coef(powerTransform(zeroless))
transformation <- bcPower(zeroless, c)
return(transformation)
})
result <- as.data.frame(result)
row.names(result) <- bb.df$name
cols2test <- names(result)
normal.before <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(bb.df[, one.col])
return(temp$p.value)
})
normal.after <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(result[, one.col])
return(temp$p.value)
})
more.normal <- cbind.data.frame(normal.before, normal.after)
more.normal$more.normal <-
more.normal$normal.after / more.normal$normal.before
more.normal$interest <-
more.normal$normal.after * more.normal$more.normal
interesting <-
rownames(more.normal)[which.max(more.normal$interest)]
data2plot <-
cbind.data.frame(bb.df[, interesting], result[, interesting])
names(data2plot) <- c("original", "transformed")
data2plot <- scale(data2plot)
data2plot <- melt(data2plot)
names(data2plot) <- c("Var1", "dataset", interesting)
ggplot(data2plot, aes(x = data2plot[, 3], fill = dataset)) +
geom_density(alpha = 0.25) + xlab(interesting)
Original, incomplete answer:
I believe you're trying to do illegal power transformations (vectors including non-positive values, specifically zeros; vectors with no variance)
The fact that you are copying bb.df into bb2.df and then overwriting is a sure sign that you should really be using apply.
This doesn't create a useful dataframe, but it should get you started,
library(car)
bb.df <-
read.fwf(
"baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 19)
)
bb.df[bb.df == 0] <- NA
# skip last (text) col
for (i in 1:(ncol(bb.df) - 1)) {
print(i)
# use comma to indicate indexing by column
temp <- bb.df[, i]
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(bb.df[, i]))
print(bcPower(bb.df[i], c))
} else {
print(paste0("column ", i, " is invariant"))
}
}
# apply solution
result <- apply(bb.df[,-ncol(bb.df)], MARGIN = 2, function(one.col)
{
temp <- one.col
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(temp))
transformation <- bcPower(temp, c)
return(transformation)
} else
{
print("skipping invariant column")
return(NULL)
}
})
I wish to extract an increasing subsequence of a vector, starting from the first element. For example, from this vector:
a = c(2, 5, 4, 0, 1, 6, 8, 7)
...I'd like to return:
res = c(2, 5, 6, 8).
I thought I could use a loop, but I want to avoid it. Another attempt with sort:
a = c(2, 5, 4, 0, 1, 6, 8, 7)
ind = sort(a, index.return = TRUE)$ix
mat = (t(matrix(ind))[rep(1, length(ind)), ] - matrix(ind)[ , rep(1, length(ind))])
mat = ((mat*upper.tri(mat)) > 0) %*% rep(1, length(ind)) == (c(length(ind):1) - 1)
a[ind][mat]
Basically I sort the input vector and check if the indices verify the condition "no indices at the right hand side are lower" which means that there were no greater values beforehand.
But it seems a bit complicated and I wonder if there are easier/quicker solutions, or a pre-built function in R.
Thanks
One possibility would be to find the cumulative maxima of the vector, and then extract unique elements:
unique(cummax(a))
# [1] 2 5 6 8
The other answer is better, but i made this iterative function which works as well. It works by making all consecutive differences > 0
increasing <- function (input_vec) {
while(!all(diff(input_vec) > 0)){
input_vec <- input_vec[c(1,diff(input_vec))>0]
}
input_vec
}