How to plot graphs through two loops - r

Though this problem has been 'solved' many times, it turns out there's always another problem.
Without the print function it runs with no errors, but with it I get the following:
Error in .subset2(x, i) : recursive indexing failed at level 2
Which I'm taking to mean it doesn't like graphs being created in two layers of iteration? Changing the method to 'qplot(whatever:whatever)' has the exact same problem.
It's designed to print a graph for every pairing of the variables I'm looking at. There's too many for them to fit in a singular picture, such as for the pairs function, and I need to be able to see the actual variable names in the axes.
load("Transport_Survey.RData")
variables <- select(Transport, "InfOfReceievingWeather", "InfOfReceievingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
varnames <- list("InfOfReceivingWeather", "InfOfReceivingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
counterx = 1
countery = 1
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]],
xlab=varnames[counterx], ylab=varnames[countery]))+
geom_point())
countery = countery+1
counterx = counterx+1
}
}
#variables2 <- select(Transport, one_of(InfOfReceivingWeather, InfOfReceivingTraffic, InfOfSeeingTraffic, InfWeather.Ice, InfWeather.Rain, InfWeather.Wind, InfWeather.Storm, InfWeather.Snow, InfWeather.Cold, InfWeather.Warm, InfWeather.DarkMorn, InfWeather.DarkEve, HomeParking, WorkParking, Disability, Age, CommuteFlexibility, Gender, PassionReduceCongest))
Here is a mini-data frame for reference, sampled from the columns I'm using:
structure(list(InfOfReceievingWeather = c(1, 1, 1, 1, 4), InfOfReceievingTraffic = c(1,
1, 1, 1, 4), InfOfSeeingTraffic = c(1, 1, 1, 1, 4), InfWeather.Ice = c(3,
1, 3, 5, 5), InfWeather.Rain = c(1, 1, 2, 2, 4), InfWeather.Wind = c(1,
1, 2, 2, 4), InfWeather.Storm = c(1, 1, 1, 2, 5), InfWeather.Snow = c(1,
1, 2, 5, 5), InfWeather.Cold = c(1, 1, 1, 2, 5), InfWeather.Warm = c(1,
1, 1, 1, 3), InfWeather.DarkMorn = c(1, 1, 1, 1, 1), InfWeather.DarkEve = c(1,
1, 1, 1, 1), HomeParking = c(1, 1, 3, 1, 1), WorkParking = c(1,
4, 4, 5, 4), Disability = c(1, 1, 1, 1, 1), Age = c(19, 45, 35,
40, 58), CommuteFlexibility = c(2, 1, 5, 1, 2), Gender = c(2,
2, 2, 2, 1), PassionReduceCongest = c(0, 0, 2, 0, 2)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

You get an error in the assignment of your a and b. Basically, when defining a and b in variables, they become the vector of values contained in columns of variables. Thus, in your aes mapping, when you are calling variables[[a]], basically, you are writing (for the first iteration of a in variables):
variables[[c(1, 1, 1, 1, 4)]] instead of variables[["InfOfReceievingWeather"]]. So, it can't work.
To get over this issue, you have to either choose between:
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=a, y=b)) ...
or
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) ...
Despite the first one seems to be simpler, I will rather prefere the second option because it will allow you to recycle a and b as column indicator to extract colnames of variables for xlab and ylab.
At the end, writing something like this should work:
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) +
xlab(colnames(variables)[a])+
ylab(colnames(variables)[b])+
geom_point())
}
}
Does it answer your question ?

Related

"sample sizes in the longitudinal and event processes differ" in JointModel in r

I am trying to perform a joint model analysis with simulated data. I believe I have formatted the data properly, but I receive this error:
"Error in jointModel(lmeFitJ, coxFit, timeVar = "time.point") :
sample sizes in the longitudinal and event processes differ; maybe you forgot the cluster() argument."
I only see this mentioned in the source code for JM and in one brief and unresolved troubleshooting thread. Where have I messed up? Thank you for any help!
Minimal complete example with first 4 participants:
#required packages
library(readxl, nlme, JM)
#long_data
structure(list(particip.id = c(1, 1, 1, 1, 2, 2, 3, 4, 4, 4,
4), time.point = c(1, 2, 3, 4, 1, 2, 1, 1, 2, 3, 4), school4me = c("DPU",
"DPU", "DPU", "DPU", "DPU", "DPU", "DPU", "DPU", "DPU", "DPU",
"DPU"), hours.a = c(3, 3, 2, 3, 0, 0, 6, 10, 13, 16, 15), hours.b = c(4,
6, 0, 0, 0, 1, 3, 7, 15, 9, 10), enrolled = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1), TimeQ = c(4, 4, 4, 4, 2.9369807105977, 2.9369807105977,
1.50240888306871, 4, 4, 4, 4)), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
#short_data
structure(list(particip.id = c(1, 2, 3, 4), time.point = c(3,
2, 3, 4), school4me = c("DPU", "DPU", "DPU", "DPU"), enrolled = c(0,
0, 0, 1), TimeQ = c(2.376576055, 1.152660467, 2.300307851, 4),
actual = c(1, 1, 1, 0)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
#Analysis
lmeFitJ <- lme(hours.a ~ time.point + time.point:school4me, data=long_data, random = ~time.point | particip.id)
coxFit <- coxph(Surv(TimeQ, actual) ~ school4me, data = short_data, x = TRUE)
fitJOINT <- jointModel(lmeFitJ, coxFit, timeVar = "time.point")
#analysis produces: "Error in jointModel(lmeFitJ, coxFit, timeVar = "time.point") : sample sizes in
#the longitudinal and event processes differ; maybe you forgot the cluster() argument."
In the source code you can find
if (is.null(survObject$model))
stop("\nplease refit the Cox model including in the ",
"call to coxph() the argument 'model = TRUE'.")
and
nT <- length(unique(idT))
if (LongFormat && is.null(survObject$model$cluster))
stop("\nuse argument 'model = TRUE' and cluster() in coxph().")
Unfortunately the longitudinal process warning is occurring first so you don't see them.
("sample sizes in the longitudinal and event processes differ; ",
"maybe you forgot the cluster() argument.\n")
Try adding model = TRUE and cluster(particip.id) to your coxFit i.e.
coxFit <- coxph(Surv(TimeQ, actual) ~ school4me + cluster(particip.id), data = short_data, x = TRUE, model = TRUE)

R - run each time over one cell from one column over each cell in another column

I have a function which her input should run each time over one cell from one column over each cell in another column.
I can do it with a loop, however, I'm looking to vectorize the process or make it faster. As for now, it would take me days to finish the process.
Ideally, it would be using tidyverse but any help would be appreciated.
My loop looks like that:
results <- data.frame(
pathSubject1 = as.character(),
pathSubject2 = as.character())
i <- 1 #Counter first loop
j <- 1 #Counter second loop
#Loop over subject 1
for (i in 1:dim(df)[1]) {#Start of first loop
#Loop over subject 2
for (j in 1:dim(df)[1]) {#Start of second loop
#calc my function for the subjects
tempPercentSync <- myFunc(df$subject1[i], df$subject2[j])
results <- rbind(
results,
data.frame(
pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = nest(tempPercentSync)))
} #End second loop
} #End first loop
My example function:
myFunc <- function(x, y) {
temp <- dplyr::inner_join(
as.data.frame(x),
as.data.frame(y),
by = "Time")
out <- as.data.frame(summary(temp))
}
Example of my dataset using dput:
structure(list(value = c("data/ExportECG/101_1_1_0/F010.feather",
"data/ExportECG/101_1_1_0/F020.feather"), ID = c(101, 101), run = c(1,
1), timeComing = c(1, 1), part = c(0, 0), paradigm = c("F010",
"F020"), group = c(1, 1), subject1 = list(structure(list(Time = c(0,
0.5, 1, 1.5, 2, 2.5), subject1 = c(9.73940345482368, 9.08451907157601,
8.42963468832833, 7.77475030508065, 7.11986592183298, 7.24395122629289
)), .Names = c("Time", "subject1"), row.names = c(NA, 6L), class = "data.frame"),
structure(list(Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject1 = c(58.3471156751544,
75.9103303197856, 83.014068283342, 89.7923167579699, 88.6748903116088,
84.7651306939912)), .Names = c("Time", "subject1"), row.names = c(NA,
6L), class = "data.frame")), subject2 = list(structure(list(
Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject2 = c(77.7776200371528,
77.4139420609906, 74.9760822165258, 75.3915183650012, 77.5672070195079,
80.7418145918357)), .Names = c("Time", "subject2"), row.names = c(NA,
6L), class = "data.frame"), structure(list(Time = c(0, 0.5, 1,
1.5, 2, 2.5), subject2 = c(101.133666720578, 105.010792226714,
107.01541987713, 104.471173834529, 97.5910271952943, 92.9840354003295
)), .Names = c("Time", "subject2"), row.names = c(NA, 6L), class = "data.frame"))), .Names = c("value",
"ID", "run", "timeComing", "part", "paradigm", "group", "subject1",
"subject2"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))
Output should loook like:
pathSubject1
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F010.feather
3 data/ExportECG/101_1_1_0/F020.feather
4 data/ExportECG/101_1_1_0/F020.feather
pathSubject2
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F020.feather
3 data/ExportECG/101_1_1_0/F010.feather
4 data/ExportECG/101_1_1_0/F020.feather
data
1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 6, 19, 16, 10, 13, 22, 7, 18, 15, 9, 12
2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 7, 19, 16, 10, 13, 22, 6, 18, 15, 9, 12
4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
Thank you!
I think you're looking for lapply (or a related function).
What's probably taking the most time is the rbind, because at each step in your loops the entire object results gets slightly larger, which means it gets fully copied. With lapply, all results are first calculated, and only then you combine them with dplyr::rbind_list dplyr::bind_rows
What you get is this:
results <- dplyr::bind_rows(lapply(1:dim(df)[1], function(i) {
dplyr::bind_rows(lapply(1:dim(df)[1], function(j) {
data.frame(pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = tidyr::nest(myFunc(df$subject1[[i]], df$subject2[[j]])))
}))
}))
Does that solve your problem?
EDIT: speeding things up
I've edited to use bind_rows instead of rbind_list, it's supposed to be faster. Furthermore, if you use [[i]] instead of [i] in the call to myFunc, you can drop the as.data.frame(x) there (and some for j/y).
Finally, you could optimize myFunc a bit by not assigning any intermediate results:
myFunc <- function(x, y) {
as.data.frame(summary(dplyr::inner_join(x, y, by = "Time")))
}
But my gut feeling says these will be small differences. To gain more speedup we need to reduce the actual computations, and then it matters what your actual data is, and what you need for your results.
Some observations, based on your example:
Do we need seperate data.frames? We compare all values in df$subject1 with those in df$subject2. In the example, first making one large data.frame for subject1, and then another for subject2, if needed with an extra label would speed up the join.
Why a join? Right now the summary of the join gives only information that we could have gotten without a join as well.
We join on Time, but in the example the timestamps for subject1 and 2 are identical. A check whether they are the same, followed by simply copying would be faster
As end-result we have a data.frame, with one column containing data.frames containing the summary of the join. Is that the way you need it? I think your code could be a lot faster if you only calculate the values you actually need. And I haven't worked a lot with data.frames containing data.frames, but it could well be that bind_rows doesn't handle it efficiently. Maybe a simple list (as column of your data.frame) would work better, as there's less overhead.
Finally, it could be that you're unable to reveal more about your real data, or it's too complicated.
In that case I think you could look aorund for various profiling-tools, functions that can help show you where most time is being spend. Personally, I like the profvis-tool
Put print(profvis::profvis({ mycode }, interval=seconds)) around a block of code, and after it finishes execution you see which lines took the most time, and which functions are called under the hood.
In the example-code, almost all time is spent in the row-binding and making data.frames. But in real data, I expect other functions may be time-consuming.

if else in a loop in R

I want to create a variable region based on a series of similar variables zipid1 to zipid26. My current code is like this:
dat$region <- with(dat, ifelse(zipid1 == 1, 1,
ifelse(zipid2 == 1, 2,
ifelse(zipid3 == 1, 3,
ifelse(zipid4 == 1, 4,
5)))))
How can I write a loop to avoid typing from zipid1 to zipid26? Thanks!
We subset the 'zipid' columns, create a logical matrix by comparing with 1 (== 1), get the column index of the TRUE value with max.col (assuming there is only a single 1 per each row and assign it to create 'region'
dat$region <- max.col(dat[paste0("zipid", 1:26)] == 1, "first")
Using a small reproducible example
max.col(dat[paste0("zipid", 1:5)] == 1, "first")
data
dat <- data.frame(id = 1:5, zipid1 = c(1, 3, 2, 4, 5),
zipid2 = c(2, 1, 3, 5, 4), zipid3 = c(3, 2, 1, 5, 4),
zipid4 = c(4, 3, 6, 2, 1), zipid5 = c(5, 3, 8, 1, 4))

Efficiently assigning a new column value based on multiple column conditions in R

I have a dataframe that contains information about many seller's IDs, and the period they made a sell. I want to create a new column called inactive if they didn't make a sell for the next 6 periods.
Here is the dput of a sample dataset:
structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10)), .Names = c("SellerID",
"Period"), row.names = c(NA, -13L), class = "data.frame")
Here is the dput of my ideal outcome(row 5 has an Inactive of 1 because for that row, sellerID 1 made a sale at Period 2, but his next sale was at period 9 [row 10]. Thus, he was inactive for at least 6 periods and thus we want to record that in order to predict when a seller will be inactive):
structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0,
0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0)), .Names = c("SellerID",
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")
I tried solving this problem using a nest-for loop approach but my dataset is very large and it will take a very long time to run (about 200,000 rows). I also tried my approach on the sample dataset, but it seems to not work. Here is my approach below:
full.df$Inactive <- NA
for (i in 1:nrow(full.df)){
temp = subset(full.df, SellerID = unique(full.df$SellerID[i]))
for(j in 1:(nrow(temp) -1)){
if(temp$Period[j+1] - temp$Period[j] <6)
temp$Inactive[j] <-0
else
temp$Inactive[j] <-1
}
full.df[rownames(full.df) %in% rownames(temp), ]$Inactive <- temp$Inactive
}
The output from the dummy dataset, using my approach puts a 0 in all the rows in "Inactive" except the last row is NA. Here is the dput of the output that I got:
structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA)), .Names = c("SellerID",
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")
I am assuming 1 things here. Max range of period variable is 12.
Here is the logic: You order the data frame. Then you append 12 to the end of the list and take a difference. This would also categorise seller 3 who is inactive withing the range of 7 days.
df_s=df[with(df, order(SellerID, Period)),]
g=split(df$Period, df$SellerID)
l=lapply(g, function(x) c(x,12) )
j=lapply(l, diff)
u=unlist(j, use.names = F)
df_s$ind=ifelse(u>=7,1,0)
Using R --vanilla
# your input dataframe
d <- structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10)), .Names = c("SellerID",
"Period"), row.names = c(NA, -13L), class = "data.frame")
# your wanted output
o <- structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2,
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)), .Names = c("SellerID",
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")
# 6 steps solution, step by step using vanilla R
# step1. - add tmp key for final sorting
d$tmp.key <- seq_len(nrow(d))
# step 2. - split by individual seller id
d.tmp <- split(d,f=d$SellerID)
# step 3. - add inactive column to individual sellers
d.tmp <- lapply(d.tmp,
function(x){
# Below as.numeric is optional
# it may stay logical as well.
# Also sorting by Period (not used here)
# should be done (I am asuming it is sorted.)
x$Inactive <- as.numeric(c(diff(x$Period) >= 6,FALSE))
x
})
# step 4. - assemble again individual sellers back into one data.frame
d <- do.call(rbind,d.tmp)
# step 5. - sort to original order using temp.key
d <- d[order(d$tmp.key),c("SellerID","Period","Inactive")]
# step 6. - rename rows according the row order
rownames(d) <- NULL
# here I am just comparing with your wanted ideal
> identical(d,o)
[1] TRUE
For data.frame with 1 000 000 lines and 1 seller the runtime will be more or less 1 second on normal PC.

ggplot complains about applying code to the data

I have strange problem which I can't solve on my own.
That's how my original data looks like:
> dput(df[1:25])
structure(c(2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 2,
1, 1, 1, 1, 1, 2, 1, 1), .Names = c("AT1G01050", "AT1G01090",
"AT1G01320", "AT1G01470", "AT1G01560", "AT1G02150", "AT1G02560",
"AT1G02780", "AT1G02920", "AT1G03090", "AT1G03130", "AT1G03220",
"AT1G03230", "AT1G03330", "AT1G03630", "AT1G03680", "AT1G03870",
"AT1G04080", "AT1G04170", "AT1G04270", "AT1G04410", "AT1G04480",
"AT1G04690", "AT1G04710", "AT1G04810"))
I try to create a ggplot using function below:
dat <- as.data.frame(df[1:25]))
dat$factor <-1
ggplot(dat,aes(x=factor,fill=factor(dat))) +
geom_bar(binwidth=5) +
coord_flip()
and as a result I got this error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
but...
when I tried to use same code on the example data which has exactly the same (at least in my opinion) structure it worked fine yesterday but right now it does not... Both data sets are called Named numeric:
> dput(data)
structure(c(2, 8, 3, 4, 1, 2, 3, 5, 4, 7, 6, 4, 1), .Names = c("Mark",
"Greg", "Sonya", "Monica", "Tiana", "Arra", "Armin", "Hera",
"Cyrus", "Pier", "Tina", "Hector", "Markus"))
I used your example data but works fine for both. I believe that's what you are looking for:
dat <- as.data.frame(data)
dat$factor <-1
ggplot(dat,aes(x=as.numeric(factor),fill=factor(dat[,1]))) + geom_bar(binwidth=5) + coord_flip()

Resources