Iterating an R Script as a function of sequential survey questions - r

The function below works perfectly for my purpose. The display is wonderful. Now my problem is I need to be able to do it again, many times, on other variables that fit other patterns.
In this example, I've output results for "q4a", I would like to be able to do it for sequences of questions that follow patterns like: q4 < a - z > or q < 4 - 10 >< a - z >, automagically.
Is there some way to iterate this such that the specified variable (in this case q4a) changes each time?
Here's my function:
require(reshape) # Using it for melt
require(foreign) # Using it for read.spss
d1 <- read.spss(...) ## Read in SPSS file
attach(d1,warn.conflicts=F) ## Attach SPSS data
q4a_08 <- d1[,grep("q4a_",colnames(d1))] ## Pull in everything matching q4a_X
q4a_08 <- melt(q4a_08) ## restructure data for post-hoc
detach(d1)
q4aaov <- aov(formula=value~variable,data=q4a) ## anova
Thanks in advance!

Not sure if this is what you are looking for, but to generate the list of questions:
> gsub('^', 'q', gsub(' ', '',
apply(expand.grid(1:10,letters),1,
function(r) paste(r, sep='', collapse='')
)))
[1] "q1a" "q2a" "q3a" "q4a" "q5a" "q6a" "q7a" "q8a" "q9a" "q10a"
[11] "q1b" "q2b" "q3b" "q4b" "q5b" "q6b" "q7b" "q8b" "q9b" "q10b"
[21] "q1c" "q2c" "q3c" "q4c" "q5c" "q6c" "q7c" "q8c" "q9c" "q10c"
[31] "q1d" "q2d" "q3d" "q4d" "q5d" "q6d" "q7d" "q8d" "q9d" "q10d"
[41] "q1e" "q2e" "q3e" "q4e" "q5e" "q6e" "q7e" "q8e" "q9e" "q10e"
[51] "q1f" "q2f" "q3f" "q4f" "q5f" "q6f" "q7f" "q8f" "q9f" "q10f"
[61] "q1g" "q2g" "q3g" "q4g" "q5g" "q6g" "q7g" "q8g" "q9g" "q10g"
[71] "q1h" "q2h" "q3h" "q4h" "q5h" "q6h" "q7h" "q8h" "q9h" "q10h"
[81] "q1i" "q2i" "q3i" "q4i" "q5i" "q6i" "q7i" "q8i" "q9i" "q10i"
[91] "q1j" "q2j" "q3j" "q4j" "q5j" "q6j" "q7j" "q8j" "q9j" "q10j"
...
And then you turn your inner part of the analysis into a function that takes the question prefix as a parameter:
analyzeQuestion <- function (prefix)
{
q <- d1[,grep(prefix,colnames(d1))] ## Pull in everything matching q4a_X
q <- melt(q) ## restructure data for post-hoc
qaaov <- aov(formula=value~variable,data=q4a) ## anova
return (LTukey(q4aaov,which="",conf.level=0.95)) ## Tukey's post-hoc
}
Now - I'm not sure where your 'q4a' variable is coming from (as used in the aov(..., data=q4a)- so not sure what to do about that bit. But hopefully this helps.
To put the two together you can use sapply() to apply the analyzeQuestion function to each of the prefixes that we automagically generated.

I would recommend melting the entire dataset and then splitting variable into its component pieces. Then you can more easily use subset to look at (e.g.) just question four: subset(molten, q = 4).

Related

Using for loop variable to access element in array yielding NA in R

I'm using a nested for loop to create a greedy algorithm in R.
z = 0
for (j in 1:length(t))
for (i in 1:(length(t) - j))
if ((t[j + i] - t[j]) >= 30)
{z <- c(z,j + i - 1)
j <- j + i - 1
break}
z
Where t is a vector such as:
[1] 12.01485 26.94091 33.32458 49.46742 65.07425 76.05700
[7] 87.11043 100.64116 111.72977 125.72649 139.46460 153.67292
[13] 171.46393 184.54244 201.20850 214.05093 224.16196 237.12485
[19] 251.51753 258.45865 273.95466 285.42704 299.01869 312.35587
[25] 326.26289 339.78724 353.81854 363.15847 378.89307 390.66134
[31] 402.22007 412.86049 424.23181 438.50462 448.88005 462.59917
[37] 473.65289 487.20678 499.80053 509.14141 526.03873 540.17209
[43] 550.69941 565.74602 576.06882 589.07297 598.53208 614.20677
[49] 627.44605 648.08346 665.49614 681.46445 691.01806 704.05762
[55] 714.09172 732.04124 745.90960 758.52628 769.80519 779.41537
[61] 788.35732 805.78547 818.75262 832.71196 844.97859 856.08608
[67] 865.72998 875.55945 887.20862 900.00000
The goal for the function is to find the indexes whose differences are as close to 30 as possible and save them in z.
For example, with the vector t provided, I would expect z to be [0, 2, 4, 6, 8, 10,...70]
The functionality is not my concern right now, as I am running into the error:
Error in if ((t[j + i] - t[j]) >= 30) { :
missing value where TRUE/FALSE needed
I'm new to R so I know I'm not utilizing the vectorization that R is known for. I simply want to have 'j' and 'i' as "counter variables" that I can use to access specific elements of vector t, but for a reason unknown to me, the if statement is not yielding a T/F value.
Any suggestions?
I know you want to learn how to use for-loop, but it is difficult to help you because you did not provide a reproducible example. On the other hand, in R a lot of functions were vectorized, meaning that you can avoid for-loop to achieve the same task with more efficient ways.
Based on the description in your post "The goal for the function is to find the indexes whose differences are as close to 30 as possible and save them in z." I provided the following example to address your question without a for-loop.
z <- which.min(abs(diff(vec) - 30))
z
# [1] 49
vec[c(z, z + 1)]
# [1] 627.4461 648.0835
Based on the data you provided, the indices with the numbers difference which are the closest to 30 is 49. The numbers are 627.4461 and 648.0835.
Data
vec <- c("12.01485 26.94091 33.32458 49.46742 65.07425 76.05700 87.11043
100.64116 111.72977 125.72649 139.46460 153.67292 171.46393
184.54244 201.20850 214.05093 224.16196 237.12485 251.51753
258.45865 273.95466 285.42704 299.01869 312.35587 326.26289
339.78724 353.81854 363.15847 378.89307 390.66134 402.22007
412.86049 424.23181 438.50462 448.88005 462.59917 473.65289
487.20678 499.80053 509.14141 526.03873 540.17209 550.69941
565.74602 576.06882 589.07297 598.53208 614.20677 627.44605
648.08346 665.49614 681.46445 691.01806 704.05762 714.09172
732.04124 745.90960 758.52628 769.80519 779.41537 788.35732
805.78547 818.75262 832.71196 844.97859 856.08608 865.72998
875.55945 887.20862 900.00000")
vec <- strsplit(vec, split = " ")[[1]]
vec <- as.numeric(grep("[0-9]+\\.[0-9]+", vec, value = TRUE))

Specify order of import for multiple tables in R

I'm trying to read in 360 data files in text format. I can do so using this code:
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.table)
The problem I have is that the files are named as "DO_1, DO_2,...DO_360" and when I try to import the files into a list, they do not maintain this order. Instead I get DO_1, DO_10, etc. Is there a way to specify the order in which the files are imported and stored? I didn't see anything in the help pages for list.files or read.table. Any suggestions are greatly appreciated.
lapply will process the files in the order you have them stored in temp. So your goal is to sort them the way you actually think about them. Luckily there is the mixedsort function from the gtools package that does just the kind of sorting you're looking for. Here is a quick demo.
> library(gtools)
> vals <- paste("DO", 1:20, sep = "_")
> vals
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
> vals <- sample(vals)
> sort(vals) # doesn't give us what we want
[1] "DO_1" "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17"
[10] "DO_18" "DO_19" "DO_2" "DO_20" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7"
[19] "DO_8" "DO_9"
> mixedsort(vals) # this is the sorting we're looking for.
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
So in your case you just want to do
library(gtools)
temp <- mixedsort(temp)
before your call to lapply that calls read.table.

Apply conditional selection to sequence of columns R

I use data from the NHANES periodontal dataset (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm) and after cleaning it to only keep the "pc" variables, I have a df=setPD 168 columns that include 6 measurements (pcd, pcm, pcs, pcp, pcl, pca) around 28 teeth numbered from #02 to #31
#names(setPD)
[1] "ohx02pcd" "ohx02pcm" "ohx02pcs" "ohx02pcp" "ohx02pcl" "ohx02pca" "ohx03pcd" "ohx03pcm" "ohx03pcs" "ohx03pcp" "ohx03pcl" "ohx03pca"
[13] "ohx04pcd" "ohx04pcm" "ohx04pcs" "ohx04pcp" "ohx04pcl" "ohx04pca" "ohx05pcd" "ohx05pcm" "ohx05pcs" "ohx05pcp" "ohx05pcl" "ohx05pca"
[25] "ohx06pcd" "ohx06pcm" "ohx06pcs" "ohx06pcp" "ohx06pcl" "ohx06pca" "ohx07pcd" "ohx07pcm" "ohx07pcs" "ohx07pcp" "ohx07pcl" "ohx07pca"
[37] "ohx08pcd" "ohx08pcm" "ohx08pcs" "ohx08pcp" "ohx08pcl" "ohx08pca" "ohx09pcd" "ohx09pcm" "ohx09pcs" "ohx09pcp" "ohx09pcl" "ohx09pca"
[49] "ohx10pcd" "ohx10pcm" "ohx10pcs" "ohx10pcp" "ohx10pcl" "ohx10pca" "ohx11pcd" "ohx11pcm" "ohx11pcs" "ohx11pcp" "ohx11pcl" "ohx11pca"
[61] "ohx12pcd" "ohx12pcm" "ohx12pcs" "ohx12pcp" "ohx12pcl" "ohx12pca" "ohx13pcd" "ohx13pcm" "ohx13pcs" "ohx13pcp" "ohx13pcl" "ohx13pca"
[73] "ohx14pcd" "ohx14pcm" "ohx14pcs" "ohx14pcp" "ohx14pcl" "ohx14pca" "ohx15pcd" "ohx15pcm" "ohx15pcs" "ohx15pcp" "ohx15pcl" "ohx15pca"
[85] "ohx18pcd" "ohx18pcm" "ohx18pcs" "ohx18pcp" "ohx18pcl" "ohx18pca" "ohx19pcd" "ohx19pcm" "ohx19pcs" "ohx19pcp" "ohx19pcl" "ohx19pca"
[97] "ohx20pcd" "ohx20pcm" "ohx20pcs" "ohx20pcp" "ohx20pcl" "ohx20pca" "ohx21pcd" "ohx21pcm" "ohx21pcs" "ohx21pcp" "ohx21pcl" "ohx21pca"
[109] "ohx22pcd" "ohx22pcm" "ohx22pcs" "ohx22pcp" "ohx22pcl" "ohx22pca" "ohx23pcd" "ohx23pcm" "ohx23pcs" "ohx23pcp" "ohx23pcl" "ohx23pca"
[121] "ohx24pcd" "ohx24pcm" "ohx24pcs" "ohx24pcp" "ohx24pcl" "ohx24pca" "ohx25pcd" "ohx25pcm" "ohx25pcs" "ohx25pcp" "ohx25pcl" "ohx25pca"
[133] "ohx26pcd" "ohx26pcm" "ohx26pcs" "ohx26pcp" "ohx26pcl" "ohx26pca" "ohx27pcd" "ohx27pcm" "ohx27pcs" "ohx27pcp" "ohx27pcl" "ohx27pca"
[145] "ohx28pcd" "ohx28pcm" "ohx28pcs" "ohx28pcp" "ohx28pcl" "ohx28pca" "ohx29pcd" "ohx29pcm" "ohx29pcs" "ohx29pcp" "ohx29pcl" "ohx29pca"
[157] "ohx30pcd" "ohx30pcm" "ohx30pcs" "ohx30pcp" "ohx30pcl" "ohx30pca" "ohx31pcd" "ohx31pcm" "ohx31pcs" "ohx31pcp" "ohx31pcl" "ohx31pca"
I am trying to apply a conditional selection in each group of six columns. This is:
transmute(setPD,PD02 = ifelse(setPD$ohx02pcd >5 |
setPD$ohx02pcm>5 |setPD$ohx02pcs >5|
setPD$ohx02pcp >5 | setPD$ohx02pcl >5 |
setPD$ohx02pca >5, 1, 0))
Then for the next tooth (03) I have to write again:
transmute(setPD,PD03 = ifelse(setPD$ohx03pcd >5 |
setPD$ohx03pcm>5|setPD$ohx03pcs >5|
setPD$ohx03pcp >5|setPD$ohx03pcl >5|
setPD$ohx03pca >5, 1, 0))
I tried to firstly do that conditional selection in a more efficient way, something like:
transmute(setPD,PD02 = ifelse(list(setPD$ohx02pcd:setPD$ohx02pcp) >5, 1, 0))
but it does not work.
Then I am looking for a way to write a loop that does that over each tooth without needing to write this 28 times!!
I thought of applying the select function of dplyr in a for loop but I don't know how to do that.
At the end I want to get all the new columns I made with transmute and say that if at least 2 of the 28 columns are 1, then I have disease, if <2 are 1 then I have health. ANy help would be appreciated.
**Note: If you want to get the dataset, it is open access from CDC.org:
https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm **
First, it is useful to point out that the logical statements of the form is A true OR is B true OR is C true are equivalent to asking is ANY of A,B,C true? We can use this to simplify the statements setPD$ohx02pcd >5 | setPD$ohx02pcm>5 |setPD$ohx02pcs >5| ... to ask if for any of these columns it is true that their value is larger than 5.
For example, let us focus on tooth number 02 first. To get all columns that concern this tooth, we can use grep to get a vector of column names. This can be achieved with
current_tooth <- grep("02", names(setPD), value = T)
Note that if there are any other columns in the data that contain the string 02, these columns will also show up. This does not appear to be the case in your data, but it is worthwhile pointing out here in case someone else uses it and this applies in other datasets.
Now, we can use these names to subset the dataframe. For instance,
setPD[,current_tooth]
will give you the corresponding columns. In each row, we want to check if any of the above mentioned conditions are true. Given a vector of logical statements, we can check if any of them is true with the function any. To go through a dataframe by row and apply a function, we can use apply, such as in
setPD$PD02 <-
apply(setPD[,grep("02", names(setPD), value = T)], 1, function(x) any(x>5))
Now, the above applies to one tooth only, namely 02. One way of doing it for all teeth is to create a vector with all tooth indicators and use this to loop over the above lines, replacing the "02" in the above grep call in each iteration and using assign or something similar to get the variable name right. A more elegant and more efficient way is to use the same principle on long data. Consider the following:
library(reshape2)
library(dplyr)
m <- melt(setPD, id.vars="SEQN")
m$num <- substr(m$variable, 4,5) # be careful here and check output!
m <- m %>% group_by(num) %>% mutate(PS = any(value>5))
m$num <- paste0("PS", m$num)
md <- dcast(m, SEQN ~ num, value.var = "PS")
setPD <- merge(setPD, md, by="SEQN")
This melts your data first and creates a variable num that indicates your tooth. Again, make sure that this works. I have used the fact that in your data, the tooth number all appear in the 4th and 5th place in the character string. Make sure this is true, and adjust the code otherwise. Then I create a variable PS which indicates whether any of the columns that contain the tooth identifer has a value larger than 5. Last but not least I recast the data so that you have the values of PD02, PD03, etc in columns again, before I merge this to the old dataset. The line with paste0 merely creates the variable names that you want to have.

Plots lists of data after assagn then in a function

I started using R for a course of Computational Fluid Dynamics and one of the starting lessons we should create a function that put out two lists of data. So I wrote this function:
Green.Ampt=function(param){
k=param[1]
Psi=param[2]
DTheta=param[3]
h=param[4]
F1=0.65
F1=0.65
vector.F2<-1:h
vector.f<-1:h
for(tempo in 1 : h){
DeltaF=1
while(DeltaF>0.01) {
F2=k*tempo+Psi*DTheta*log(F1/(Psi*DTheta)+1)
DeltaF=abs(F1-F2)
F1=F2
}
vector.F2[tempo]=F2
vector.f[tempo]= k*(Psi*DTheta/F2+1)}
OUT<-list(vector.F2, vector.f)
return(OUT)
}
I used this Green.Ampt(c(0.65,16.7,0.34,10)) to run the function then I controlled the console have recieved the following output:
[[1]]
[1] 3.152985 4.745484 6.077012 7.284812 8.404389 9.469498
[7] 10.490538 11.474561 12.434380 13.371189`
[[2]]
[1] 1.8205417 1.4277289 1.2573215 1.1566294 1.0891396 1.0397461
[7] 1.0018123 0.9716419 0.9468141 0.9260188`
I want to give at this two series of data a name because I need to plot them, but I am not successful in this.
Save the return value of your function as an object, which in this case will be a list. You can then extract the components of the list using [[ notation in a call to plot:
x <- Green.Ampt(c(0.65,16.7,0.34,10))
plot(x[[1]], x[[2]])
Here's the result:

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources