How to select columns in R with colSums as condition? - r

I have tried to find common question, but any of the is just like this.
I'am trying to filter my data table with colSums. This means that if colSums gives certain amount(lets say under 5000) I want to include or exclude this certain column and I want to repeat this with loop or apply that it does this to whole data table. Basically this shouldn't be that hard, but I'm not sure what I'm doing wrong, maybe someone can help from here.
Below there is preperesation of my data and my code. I used dput function to reprepesent the data.
There are many different codes that i have tried, but none of them have worked. I thinks this is closest, but when I use code line from below, it gives me this type of warning message: "Error: expecting a one sided formula, a function, or a function name."
I have been using dplyr package, but others should be base functions.
> dput(data999[1:2, ])
KER000_349094 = c(0.1806,
0.1806), KER000_349085 = c(0.1832, 0.1832), KER000_351771 = c(0.1858,
0.1858), KER000_60103549 = c(0.1034, 0.1034), KER000_391452 = c(0.0016,
0.0016), KER000_345696 = c(0.1718, 0.1718), KER000_342793 = c(0.189230769230769,
0.189230769230769), KER000_345615 = c(0.0165384615384615,
0.0165384615384615), KER000_344065 = c(0.0592307692307692,
0.0592307692307692), KER000_353687 = c(0.188076923076923,
0.188076923076923), KER000_340589 = c(2.44, 2.44), KER000_346489 = c(0,
0), KER000_348357 = c(0.16, 0.16), KER000_363845 = c(3.135,
3.135), KER000_60029018 = c(0.115, 0.115), KER000_341255 = c(0,
0)), row.names = 1:2, class = "data.frame")
jeejee = apply(data999, 2, function(x) select_if(colSums(x <= 5000)))

Copying my comment, since it seems to be the answer.
data999[,colSums(data999)<=5000]
to select all columns whose sum is <= 5000.

Related

Rolling Sample standard deviation in R

I wanted to get the standard deviation of the 3 previous row of the data, the present row and the 3 rows after.
This is my attempt:
mutate(ming_STDDEV_SAMP = zoo::rollapply(ming_f, list(c(-3:3)), sd, fill = 0)) %>%
Result
ming_f
ming_STDDEV_SAMP
4.235279667
0.222740262
4.265353
0.463348209
4.350810667
0.442607461
3.864739333
0.375839159
3.935632333
0.213821765
3.802632333
0.243294783
3.718387667
0.051625808
4.288542333
0.242010836
4.134689
0.198929941
3.799883667
0.112733475
This is what I expected:
ming_f
ming_STDDEV_SAMP
4.235279667
0.225532646
4.265353
0.212776157
4.350810667
0.23658801
3.864739333
0.253399417
3.935632333
0.26144862
3.802632333
0.246259684
3.718387667
0.20514358
4.288542333
0.208578409
4.134689
0.208615874
3.799883667
0.233948429
It doesn't match your output exactly, but perhaps this is what you need:
zoo::rollapply(quux$ming_f, 7, FUN=sd, partial=TRUE)
(It also works replacing 7 with list(-3:3).)
This expression isn't really different from your sample code, but the output is correct. Perhaps your original frame has a group_by still applied?
Data
quux <- structure(list(ming_f = c(4.235279667, 4.265353, 4.350810667, 3.864739333, 3.935632333, 3.802632333, 3.718387667, 4.288542333, 4.134689, 3.799883667), ming_STDDEV_SAMP = c(0.225532646, 0.212776157, 0.23658801, 0.253399417, 0.26144862, 0.246259684, 0.20514358, 0.208578409, 0.208615874, 0.233948429)), class = "data.frame", row.names = c(NA, -10L))

How to structure output to write to csv within a for statement in R?

I'm running a conditional logistic regression analysis on different individuals using a for statement in R. The code for this is pretty straightforward:
for(ID in unique(Hour168Fin$BAND)){
modelone = clogit(Hour168Fin$OBSERVED ~ Hour168Fin$LNSTEPLENG + Hour168Fin$PowCross + Shrub +
strata(Hour168Fin$STEPID), data=Hour168Fin, subset = which(ID==Hour168Fin$BAND))
I'm interested in very specific parts of the output, so I've structured the output to give me exactly the coefficients I need using this:
x1beta = as.numeric(summary(modelone)$coef[1,1])
x2beta = as.numeric(summary(modelone)$coef[2,1])
x3beta = as.numeric(summary(modelone)$coef[3,1])
x1SE = as.numeric(summary(modelone)$coef[1,3])
x2SE = as.numeric(summary(modelone)$coef[2,3])
x3SE = as.numeric(summary(modelone)$coef[3,3])
x1pvalue = as.numeric(summary(modelone)$coef[1,5])
x2pvalue = as.numeric(summary(modelone)$coef[2,5])
x3pvalue = as.numeric(summary(modelone)$coef[3,5])
modelAIC = AIC(modelone)
results = table(x1beta, x1SE, x1pvalue, x2beta, x2SE, x2pvalue, x2beta, x2SE, x2pvalue, modelAIC, rownames = ID)}
In R, I can see all the results in the format I'm looking for, but when I use this to get these results into a csv:
write.csv = (results, file = "TrialOut.csv")
I'm only getting the results of 1 unique ID. I've tried embedding the write.csv statement in the for statement, and using it outside of it with the same results. Any suggestions? I'm really baffled because I can see the results in R but can't seem to get that to translate to a csv.
Thanks for your time!
Try including the write.csv call inside the loop, and use append = TRUE:
for (...) {
# ...
# ...
write.csv(results, file = "someFile.csv", append = TRUE)
}

theta.sparse error with lorDIF

I was wondering whether anyone can help me out.
I am trying to run a dif analysis on my data but keep getting a theta.sparse error, which I am unsure of how to fix. I would really appreciate any that you can give me.
library(lordif)
dat<- read.csv2("OPSO.csv",header=TRUE)
datgender <- read.csv2("DATA.csv",header=TRUE)
group<-datgender$Gender
sink("outputDIFopso.txt")
gender.difopso <- lordif(dat, group, selection = NULL,
criterion = c("Chisqr", "R2", "Beta"),
pseudo.R2 = c("McFadden", "Nagelkerke", "CoxSnell"), alpha = 0.01,
beta.change = 0.1, R2.change = 0.02, maxIter = 10, minCell = 5,
minTheta = -4, maxTheta = 4, inc = 0.1, control = list(), model = "GRM",
anchor = NULL, MonteCarlo = FALSE, nr = 100)
print(gender.difopso)
summary(gender.difopso)
sink()
pdf("graphtestop.pdf")
plot(gender.difopso)
dev.off()
dev.off()
Error in lordif(dat, group, selection = NULL, criterion = c("Chisqr", :
object 'theta.sparse' not found
Thank you :)
You should check the error line before then. The output will probably say you have no items flagged for DIF. When that's the case you should just run the mirt function and extract theta and ipar objects as necessary.
The author could add some case handling for when compare(flags, flags.matrix) is true. At the very least, it seems a warning is omitted when there are no items with DIF the same way it says
if (ndif == ni) {
warning("all items got flagged for DIF - stopping\n")
}
and there is no case handling when (ndif == 0) although compare(flags, flag.matrix) evaluates to TRUE.
The implications when all or none of the items have DIF is that you would get the same results (generating the same ICC plots, same inference etc) by fitting mirt in the combined sample (no DIF) or two or more mirt models for each group (all DIF). So it's a correct time saving procedure to just bypass when all that breaks down.

r - taking difference of two xyplots?

I have several xyplot objects that I have saved as .RDATA files. I am now interested in being able to look at their differences. I have tried things like
plot1-plot2
but this does not work (I get the "non-numeric argument to binary operator error).
I would also be able to do this if I knew how to extract the timeseries data stored within the lattice xyplot object, but I have looked everywhere and can't figure out how to do this either.
Any suggestions?
EDIT:
just to make it perfectly clear what I mean for MrFlick, by "taking the difference of two plots" I mean plotting the elementwise difference of the timeseries from each plot, assuming it exists (i.e. assuming that the plots have the same domain). Graphically,
I might want to take the following two plots, stored as xyplot objects:
and end up with something that looks like this:
-Paul
Here is a little function I wrote to plot the difference of two xyplots:
getDifferencePlot = function(plot1,plot2){
data1 = plot1$panel.args
data2 = plot2$panel.args
len1 = length(data1)
len2 = length(data2)
if (len1!=len2)
stop("plots do not have the same number of panels -- cannot take difference")
if (len1>1){
plotData = data.table(matrix(0,0,4))
setNames(plotData,c("x","y1","y2","segment"))
for (i in 1:len1){
thing1 = data.table(cbind(data1[[i]]$x,data1[[i]]$y))
thing2 = data.table(cbind(data2[[i]]$x,data2[[i]]$y))
finalThing = merge(thing1, thing2,by = "V1")
segment = rep(i,nrow(finalThing))
finalThing = cbind(finalThing,segment)
setNames(finalThing,c("x","y1","y2","segment"))
plotData = rbind(plotData,finalThing)
}
}
if (len1==1){
plotData = data.table(matrix(0,0,3))
setNames(plotData,c("x","y1","y2"))
thing1 = data.table(cbind(data1[[i]]$x,data1[[i]]$y))
thing2 = data.table(cbind(data2[[i]]$x,data2[[i]]$y))
plotData = merge(thing1, thing2,by = "V1")
}
plotData$difference = plotData$y1-plotData$y2
if (len1==1)
diffPlot = xyplot(difference~x,plotData,type = "l",auto.key = T)
if (len1>1)
diffPlot = xyplot(difference~x|segment,plotData,type = "l",auto.key = T)
return(diffPlot)
}

Issues with formatting header in R prior to using plot() function

I have a data set that I've successfully read into R. It's a simple data.frame with ONE ROW of data (I'm not sure how many columns, but its in the hundreds). It was read with column headers, but no row labels. So the data set looks something like this:
df=structure(list(X500000 = 0.0958904109589041, X1500000 = 0.10958904109589, X2500000 = 0.10958904109589, X3500000 = 0.164383561643836, X4500000 = 0.136986301369863, X5500000 = 0.205479452054795, X6500000 = 0.136986301369863, X7500000 = 0.0273972602739726, X8500000 = 0.0821917808219178, X9500000 = 0.178082191780822), .Names = c("X500000", "X1500000", "X2500000", "X3500000", "X4500000", "X5500000", "X6500000", "X7500000", "X8500000", "X9500000"), class = "data.frame", row.names = 79L)
Except that it is MUCH LARGER (I don't know if it matters, but it has around 300 columns going across). I'm trying to plot it so that the X##### labels are on the x axis, and the value of each data point is plotted on the y axis (say like a scatter plot on excel or even a line graph). Doing just plot(df) gives me an extremely bizarre graph that makes no sense to me (a bunch of boxes each with a dot right in the centre and no labels?).
I have a feeling it might work if I were to transform the data frame into a vector by removing the headings and then adding x-axis labels individually afterwards and doing a plot() on the vector, but if there is a way of avoiding that it would be great....
As explained in '?plot', 'x' and 'y' must be two vectors of numerics, of same size:
df=structure(list(X500000 = 0.0958904109589041, X1500000 = 0.10958904109589, X2500000 = 0.10958904109589, X3500000 = 0.164383561643836, X4500000 = 0.136986301369863, X5500000 = 0.205479452054795, X6500000 = 0.136986301369863, X7500000 = 0.0273972602739726, X8500000 = 0.0821917808219178, X9500000 = 0.178082191780822), .Names = c("X500000", "X1500000", "X2500000", "X3500000", "X4500000", "X5500000", "X6500000", "X7500000", "X8500000", "X9500000"), class = "data.frame", row.names = 79L)
plot(x=as.numeric(substr(names(df),2,nchar(names(df)))), as.numeric(df), xlab="This is xlab", ylab="This is y")

Resources