Basic for loop in R - r

I am trying to implement a for loop for a set of data-frames that I want to write to Excel
AU2<-intersect(A_CN_U$symbol,A_CP_U$symbol)
AD2<-intersect(A_CN_D$symbol,A_CP_D$symbol)
BU2<-intersect(B_CN_U$symbol,B_CP_U$symbol)
BD2<-intersect(B_CN_D$symbol,B_CP_D$symbol)
CU2<-intersect(C_CN_U$symbol,C_CP_U$symbol)
CD2<-intersect(C_CN_D$symbol,C_CP_D$symbol)
tot<- c(AU2, AD2, BU2, BD2, CU2, CD2)
for (i in tot){
print(i)
write_xlsx(i,"/Users/ABC/Desktop/Research/i.xlsx")
}
It returns this:
[1] "TMTC2"
Error in write_xlsx(i, "/Users/abkhan/Desktop/Research/Patel Meningioma/Smoker DEG/01RedoSubmissionMM/i.xlsx") :
Argument x must be a data frame or list of data frames
Outside my failed loop:
> AU2
[1] "TMTC2" "NPB" "GALNT6" "CDCA2" "ABTB1" "C12orf75" "GPR63" NA "ESR1" "NPAS2" "PLAGL1" "C11orf45" "SYNE1" "C16orf74"
[15] "S100A6" "LOXL4" "PLCL1" "KLHL29" "DTX4" "ITGB5" "BCAT1" "CDKN2B" "KANK4" "S1PR2" "AHR" "STAMBPL1" "TRPM3" "TMEM200A"
[29] "BASP1" "AQP5" "THBS2" "ADRA1B" "MGLL" "RIMBP2" "KCNN4" "PROCR" "MXRA5" "CAV1" "GALNT15" "RIMS1" "ELAVL4" "COL4A6"
[43] "FAM189A1" "AMH" "DPP4" "MEGF6" "JPH3" "POU5F1B" "EVA1A" "ABCC2" "PTGES" "CACNG8" "ALK" "VGLL3" "TGM2" "SLC9A2"
[57] "LVRN" "MEGF10" "LMO3" "PRPH" "ATP2B2" "SRPX2" "LUM" "SLC9A4" "CDKN2A" "LGR6" "ALPK2" "C6orf132" "FAP" "ANKRD1"
[71] "LTK" "ASPN" "SLC22A1" "PPL" "LYPD1" "GPR39" "DSC3" "SOX11" "NHLH2" "KRT14" "IGFL2" "GDF6"
Thanks!
Edit:
I realized that my original variable (Au2, AD2 etc) were not data frames
I converted them to dataframes and their names to reflect my mood.
Now I get:
> tot<- c(Atrash, Btrash, Ctrash)
> for (i in tot){
+ print(i)
+ write_xlsx(i,"/Users/abkhan/Desktop/Research/Patel Meningioma/Smoker DEG/01RedoSubmissionMM/i.xlsx")
+ }
[1] "TMTC2" "NPB" "GALNT6" "CDCA2" "ABTB1" "C12orf75" "GPR63" NA "ESR1" "NPAS2" "PLAGL1" "C11orf45" "SYNE1" "C16orf74"
[15] "S100A6" "LOXL4" "PLCL1" "KLHL29" "DTX4" "ITGB5" "BCAT1" "CDKN2B" "KANK4" "S1PR2" "AHR" "STAMBPL1" "TRPM3" "TMEM200A"
[29] "BASP1" "AQP5" "THBS2" "ADRA1B" "MGLL" "RIMBP2" "KCNN4" "PROCR" "MXRA5" "CAV1" "GALNT15" "RIMS1" "ELAVL4" "COL4A6"
[43] "FAM189A1" "AMH" "DPP4" "MEGF6" "JPH3" "POU5F1B" "EVA1A" "ABCC2" "PTGES" "CACNG8" "ALK" "VGLL3" "TGM2" "SLC9A2"
[57] "LVRN" "MEGF10" "LMO3" "PRPH" "ATP2B2" "SRPX2" "LUM" "SLC9A4" "CDKN2A" "LGR6" "ALPK2" "C6orf132" "FAP" "ANKRD1"
[71] "LTK" "ASPN" "SLC22A1" "PPL" "LYPD1" "GPR39" "DSC3" "SOX11" "NHLH2" "KRT14" "IGFL2" "GDF6"
Error in write_xlsx(i, "/Users/ABC/Desktop/Research/i.xlsx") :
Argument x must be a data frame or list of data frames

writexl take the first argument as a:
data frame or named list of data frames that will be sheets in the
xlsx
So you are passing another type of data. You can create a list of your data.frames and iterate within the list to loop each data.frame, here an example:
df1 <- data.frame(a = 1)
df2 <- data.frame(b = 2)
list_of_dfs <- list(df1,df2)
n <- length(list_of_dfs)
for (i in 1:n){
print(i)
write.csv(list_of_dfs[[i]],file = paste0(i,".csv"),row.names = FALSE)
}
list.files(pattern = "*.csv")
[1] "1.csv" "2.csv"

Related

How can I copy and rename a bunch of variables at once?

I have created some variables. I would like to duplicate these so that they exist twice, once with the name you see below, and once with Ireland_ in front of their name, i.e.,
c_PFS_Folfox = 307.81 would become:
Ireland_c_PFS_Folfox = 307.81
I initially define these as follows:
1. Cost of treatment in this country
c_PFS_Folfox <- 307.81
c_PFS_Bevacizumab <- 2580.38
c_OS_Folfiri <- 326.02
administration_cost <- 365.00
2. Cost of treating the AE conditional on it occurring
c_AE1 <- 2835.89
c_AE2 <- 1458.80
c_AE3 <- 409.03
3. Willingness to pay threshold
n_wtp = 45000
Then I put them together to rename all at once:
kk <- data.frame(c_PFS_Folfox, c_PFS_Bevacizumab, c_OS_Folfiri, administration_cost, c_AE1, c_AE2, c_AE3, n_wtp)
colnames(kk) <- paste("Ireland", kk, sep="_")
kk
Ireland_307.81 Ireland_2580.38 Ireland_326.02 Ireland_365 Ireland_2835.89 Ireland_1458.8
1 307.8 2580 326 365 2836 1459
Ireland_409.03 Ireland_45000
1 409 45000
Obviously this isn't the output I intended. These also don't exist as new variables in the environment.
What can I do?
If we want to create objects with Ireland_ as prefix, either use
list2env(setNames(kk, paste0("Ireland_", names(kk))), .GlobalEnv)
Once we created the objects in the global env, we may remove the original objects
> rm(list = names(kk))
> ls()
[1] "Ireland_administration_cost" "Ireland_c_AE1" "Ireland_c_AE2" "Ireland_c_AE3" "Ireland_c_OS_Folfiri"
[6] "Ireland_c_PFS_Bevacizumab" "Ireland_c_PFS_Folfox" "Ireland_n_wtp" "kk"
or with %=% from collapse
library(collapse)
paste("Ireland", colnames(kk), sep="_") %=% kk
-checking
> Ireland_administration_cost
[1] 365
> Ireland_c_PFS_Folfox
[1] 307.81
First put all your variables in a vector, then use sapply to iterate the vector to assign the existing variables to a new variable with the prefix "Ireland_".
your_var <- c("c_PFS_Folfox", "c_PFS_Bevacizumab", "c_OS_Folfiri",
"administration_cost", "c_AE1", "c_AE2", "c_AE3", "n_wtp")
sapply(your_var, \(x) assign(paste0("Ireland_", x), get(x), envir = globalenv()))

Using for loop variable to access element in array yielding NA in R

I'm using a nested for loop to create a greedy algorithm in R.
z = 0
for (j in 1:length(t))
for (i in 1:(length(t) - j))
if ((t[j + i] - t[j]) >= 30)
{z <- c(z,j + i - 1)
j <- j + i - 1
break}
z
Where t is a vector such as:
[1] 12.01485 26.94091 33.32458 49.46742 65.07425 76.05700
[7] 87.11043 100.64116 111.72977 125.72649 139.46460 153.67292
[13] 171.46393 184.54244 201.20850 214.05093 224.16196 237.12485
[19] 251.51753 258.45865 273.95466 285.42704 299.01869 312.35587
[25] 326.26289 339.78724 353.81854 363.15847 378.89307 390.66134
[31] 402.22007 412.86049 424.23181 438.50462 448.88005 462.59917
[37] 473.65289 487.20678 499.80053 509.14141 526.03873 540.17209
[43] 550.69941 565.74602 576.06882 589.07297 598.53208 614.20677
[49] 627.44605 648.08346 665.49614 681.46445 691.01806 704.05762
[55] 714.09172 732.04124 745.90960 758.52628 769.80519 779.41537
[61] 788.35732 805.78547 818.75262 832.71196 844.97859 856.08608
[67] 865.72998 875.55945 887.20862 900.00000
The goal for the function is to find the indexes whose differences are as close to 30 as possible and save them in z.
For example, with the vector t provided, I would expect z to be [0, 2, 4, 6, 8, 10,...70]
The functionality is not my concern right now, as I am running into the error:
Error in if ((t[j + i] - t[j]) >= 30) { :
missing value where TRUE/FALSE needed
I'm new to R so I know I'm not utilizing the vectorization that R is known for. I simply want to have 'j' and 'i' as "counter variables" that I can use to access specific elements of vector t, but for a reason unknown to me, the if statement is not yielding a T/F value.
Any suggestions?
I know you want to learn how to use for-loop, but it is difficult to help you because you did not provide a reproducible example. On the other hand, in R a lot of functions were vectorized, meaning that you can avoid for-loop to achieve the same task with more efficient ways.
Based on the description in your post "The goal for the function is to find the indexes whose differences are as close to 30 as possible and save them in z." I provided the following example to address your question without a for-loop.
z <- which.min(abs(diff(vec) - 30))
z
# [1] 49
vec[c(z, z + 1)]
# [1] 627.4461 648.0835
Based on the data you provided, the indices with the numbers difference which are the closest to 30 is 49. The numbers are 627.4461 and 648.0835.
Data
vec <- c("12.01485 26.94091 33.32458 49.46742 65.07425 76.05700 87.11043
100.64116 111.72977 125.72649 139.46460 153.67292 171.46393
184.54244 201.20850 214.05093 224.16196 237.12485 251.51753
258.45865 273.95466 285.42704 299.01869 312.35587 326.26289
339.78724 353.81854 363.15847 378.89307 390.66134 402.22007
412.86049 424.23181 438.50462 448.88005 462.59917 473.65289
487.20678 499.80053 509.14141 526.03873 540.17209 550.69941
565.74602 576.06882 589.07297 598.53208 614.20677 627.44605
648.08346 665.49614 681.46445 691.01806 704.05762 714.09172
732.04124 745.90960 758.52628 769.80519 779.41537 788.35732
805.78547 818.75262 832.71196 844.97859 856.08608 865.72998
875.55945 887.20862 900.00000")
vec <- strsplit(vec, split = " ")[[1]]
vec <- as.numeric(grep("[0-9]+\\.[0-9]+", vec, value = TRUE))

Automate "ncvar_get"-reading of different variable names in NetCDF files

I have a NetCDF dataset with two climate scenarios (rcp & hist), both of them containing 25 files. Each file either contains data for the variable "pr", "tas", "tasmax", or "tasmin". I wrote a for loop to iteratively read the files of hist and rcp, read them with nc_open, extract the variable with ncvar_get and finally make a calculation in form of mean(abs(hist - rcp) to obtain the mean absolute distance between each pair of hist and rcp. The problem: as ncvar_get requires the exact variable name of the current file I wrote an if else block (see below) that shall find the variable name of the current file and apply it for ncvar_get. Running the code I obtain the following error:
[1] "vobjtovarid4: error #F: I could not find the requsted var (or dimvar) in the file!"
[1] "var (or dimvar) name: tas"
[1] "file name: /data/historical/tasmax_ICHEC-EC-EARTH_DMI-HIRHAM5_r3i1p1.nc" Error in vobjtovarid4(nc, varid, verbose = verbose, allowdimvar = TRUE) : Variable not found
#Extract of the files in the hist list. Same file names in the rcp list, but different directory
> hist.files.cl <- list.files("/historical", full.names = TRUE)
> hist.files.cl
[1] "/historical/pr_CNRM-CERFACS-CNRM-CM5_ALADIN53_r1i1p1.nc"
[2] "/historical/pr_CNRM-CERFACS-CNRM-CM5_ALARO-0_r1i1p1.nc"
[3] "/historical/pr_ICHEC-EC-EARTH_HIRHAM5_r3i1p1.nc"
[4] "/historical/pr_ICHEC-EC-EARTH_RACMO22E_r12i1p1.nc"
[5] "/historical/pr_ICHEC-EC-EARTH_RCA4_r12i1p1.nc"
[6] "/historical/pr_MPI-M-MPI-ESM-LR_RCA4_r1i1p1.nc"
[7] "/historical/pr_MPI-M-MPI-ESM-LR_REMO2009_r1i1p1.nc"
[8] "/historical/pr_MPI-M-MPI-ESM-LR_REMO2009_r2i1p1.nc"
[9] "/historical/tas_CNRM-CERFACS-CNRM-CM5_CNRM-ALADIN53_r1i1p1.nc"
[10] "/historical/tas_CNRM-CERFACS-CNRM-CM5_RMIB-UGent-ALARO-0_r1i1p1.nc"
[11] "/historical/tas_ICHEC-EC-EARTH_DMI-HIRHAM5_r3i1p1.nc"
[12] "/historical/tas_ICHEC-EC-EARTH_KNMI-RACMO22E_r12i1p1.nc"
[13] "/historical/tas_ICHEC-EC-EARTH_SMHI-RCA4_r12i1p1.nc"
[14] "/historical/tas_MPI-M-MPI-ESM-LR_MPI-CSC-REMO2009_r1i1p1.nc"
[15] "/historical/tas_MPI-M-MPI-ESM-LR_MPI-CSC-REMO2009_r2i1p1.nc"
[16] "/historical/tasmax_ICHEC-EC-EARTH_DMI-HIRHAM5_r3i1p1.nc"
[17] "/historical/tasmax_ICHEC-EC-EARTH_KNMI-RACMO22E_r12i1p1.nc"
[18] "/historical/tasmax_ICHEC-EC-EARTH_SMHI-RCA4_r12i1p1.nc"
euc.distance <- list()
for(i in 1:length(hist.files.cl)) {
#Open ith file in list of hist files as well as in list of rcp files
hist.data <- nc_open(hist.files.cl[i])
rcp.data <- nc_open(rcp.files.cl[i])
if(grepl("pr", hist.data$filename)){
hist.var <- ncvar_get(hist.data, "pr")
rcp.var <- ncvar_get(rcp.data, "pr")
}else if (grepl("tas", hist.data$filename)){
hist.var <- ncvar_get(hist.data, "tas")
rcp.var <- ncvar_get(rcp.data, "tas")
}else if (grepl("tasmax", hist.data$filename)){
hist.var <- ncvar_get(hist.data, "tasmax")
rcp.var <- ncvar_get(rcp.data, "tasmax")
}else{
hist.var <- ncvar_get(hist.data, "tasmin")
rcp.var <- ncvar_get(rcp.data, "tasmin")
}
#Converting temperature variable from K to °C:
if(grepl("tas", hist.data$filename)){
hist.var <- hist.var-273.15
rcp.var <- rcp.var-273.15
}
#Find for the ith rcp file with dim=(1,1,360) in the ith hist file with dim=(385,373,360) the grid point with the best fitting distribution (each grid point consists of a distribution of 360 time steps).The calculation may contain errors...
euc.distance[[i]] <- apply(hist.var, c(1,2), function(x) mean(abs(rcp.var - x)))
min_values <- which(rank(euc.distance[[i]], ties.method='min') <= 10)
}
As cath highlighted the probable cause of the error, but the proposed approach to extract the part of interest (=variable name) from the filename does not work. I before tried to automate the extraction of the variable name by using stringr("filename",startposition, endposition) until I noticed that there is no sense in it, because each variable name (pr, tas, tasmax, tasmin) has another string length. What possibilities do you see for me?
Thank you a lot!
To complete a bit my comment, if you need to operate on each file, you could do it at once, putting everything in a list.
So, first get the "keypart" for each file:
keyparts <- sub("^([a-z]+)_.+", "\\1", basename(hist.files.cl))
keyparts
# [1] "pr" "pr" "pr" "pr" "pr" "pr" "pr" "pr"
# [9] "tas" "tas" "tas" "tas" "tas" "tas" "tas" "tasmax"
#[17] "tasmax" "tasmax"
Then you can use lapply to do what you need to do for every files at once:
my_res <- lapply(seq(keyparts),
function(i){
hist.data <- nc_open(hist.files.cl[i])
rcp.data <- nc_open(rcp.files.cl[i])
hist.var <- ncvar_get(hist.data, keyparts[i])
rcp.var <- ncvar_get(rcp.data, keyparts[i])
if(keyparts[i]=="tas"){
hist.var <- hist.var-273.15
rcp.var <- rcp.var-273.15
}
euc.distance <- apply(hist.var, c(1,2), function(x) mean(abs(rcp.var - x)))
min_values <- which(rank(euc.distance[[i]], ties.method='min') <= 10)
return(list(euc.distance=euc.distance, min.values=min.values))
})

Read the VW raw scores from (CS)OAA

VowpalWabbit writes raw predictions from (CS)OAA model as a sequence of lines like this:
1:-2.31425 2:-3.98557 3:-3.97967 4:-2.63708 5:-3.18749 6:-2.43984 7:-4.99018 8:-3.49138 9:-3.07816 10:-6.15126 11:-6.01152 12:-5.76039 13:-5.13096 14:-5.18472 15:-5.37358 16:-5.24147 17:-5.21512 18:-5.67961 19:-4.62929 20:-4.61404 000db8cd6aef4e5fa459126d36e0fa1f-none
1:-2.65864 2:-3.33924 3:-2.8116 4:-1.83108 5:-2.05677 6:-1.29879 7:-6.7446 8:-3.05036 9:-2.82138 10:-5.19605 11:-4.5119 12:-5.28309 13:-4.35789 14:-4.76992 15:-4.16866 16:-4.6897 17:-3.76224 18:-4.13129 19:-4.4489 20:-4.32605 000e0e58a4cb4a218bbc6cae0b1af201-none
How do I read it into R?
Here is my code:
## load raw vw (CS)OAA scores
read.vw.oaa.scores <- function (myfile) {
v <- sapply(strsplit(readLines(myfile),' ',fixed=TRUE), function (r) {
m <- matrix(unlist(strsplit(head(r,-1),':',fixed=TRUE)),ncol=2,byrow=TRUE)
stopifnot(identical(1:nrow(m),as.integer(m[,1])))
c(tail(r,1),m[,2])
})
f <- as.data.frame(t(v),stringsAsFactors=FALSE)
names(f) <- c("id",head(names(f),-1))
for (n in tail(names(f),-1))
f[[n]] <- as.numeric(f[[n]])
f
}
Are there any obvious bugs/inefficiencies?
Is there a better way?
PS. This data format looks like CRS but it is not it.
See if the following works for you (probably really slow). Assumes all desired values are in numeric:value format. And uses raw which requires each line to be stored as a character array.
raw = c("1:-2.31425 2:-3.98557 3:-3.97967 4:-2.63708 5:-3.18749 6:-2.43984 7:-4.99018 8:-3.49138 9:-3.07816 10:-6.15126 11:-6.01152 12:-5.76039 13:-5.13096 14:-5.18472 15:-5.37358 16:-5.24147 17:-5.21512 18:-5.67961 19:-4.62929 20:-4.61404 000db8cd6aef4e5fa459126d36e0fa1f-none",
"1:-2.65864 2:-3.33924 3:-2.8116 4:-1.83108 5:-2.05677 6:-1.29879 7:-6.7446 8:-3.05036 9:-2.82138 10:-5.19605 11:-4.5119 12:-5.28309 13:-4.35789 14:-4.76992 15:-4.16866 16:-4.6897 17:-3.76224 18:-4.13129 19:-4.4489 20:-4.32605 000e0e58a4cb4a218bbc6cae0b1af201-none")
Function to clean
clean = function(t, n) {as.numeric(gsub("^[0-9]+:", "", unlist(strsplit(t, split=" "))[1:n]))}
lapply(raw, clean, n = 20)
[[1]]
[1] -2.31425 -3.98557 -3.97967 -2.63708 -3.18749 -2.43984 -4.99018 -3.49138 -3.07816 -6.15126 -6.01152 -5.76039
[13] -5.13096 -5.18472 -5.37358 -5.24147 -5.21512 -5.67961 -4.62929 -4.61404
[[2]]
[1] -2.65864 -3.33924 -2.81160 -1.83108 -2.05677 -1.29879 -6.74460 -3.05036 -2.82138 -5.19605 -4.51190 -5.28309
[13] -4.35789 -4.76992 -4.16866 -4.68970 -3.76224 -4.13129 -4.44890 -4.32605

Iterating an R Script as a function of sequential survey questions

The function below works perfectly for my purpose. The display is wonderful. Now my problem is I need to be able to do it again, many times, on other variables that fit other patterns.
In this example, I've output results for "q4a", I would like to be able to do it for sequences of questions that follow patterns like: q4 < a - z > or q < 4 - 10 >< a - z >, automagically.
Is there some way to iterate this such that the specified variable (in this case q4a) changes each time?
Here's my function:
require(reshape) # Using it for melt
require(foreign) # Using it for read.spss
d1 <- read.spss(...) ## Read in SPSS file
attach(d1,warn.conflicts=F) ## Attach SPSS data
q4a_08 <- d1[,grep("q4a_",colnames(d1))] ## Pull in everything matching q4a_X
q4a_08 <- melt(q4a_08) ## restructure data for post-hoc
detach(d1)
q4aaov <- aov(formula=value~variable,data=q4a) ## anova
Thanks in advance!
Not sure if this is what you are looking for, but to generate the list of questions:
> gsub('^', 'q', gsub(' ', '',
apply(expand.grid(1:10,letters),1,
function(r) paste(r, sep='', collapse='')
)))
[1] "q1a" "q2a" "q3a" "q4a" "q5a" "q6a" "q7a" "q8a" "q9a" "q10a"
[11] "q1b" "q2b" "q3b" "q4b" "q5b" "q6b" "q7b" "q8b" "q9b" "q10b"
[21] "q1c" "q2c" "q3c" "q4c" "q5c" "q6c" "q7c" "q8c" "q9c" "q10c"
[31] "q1d" "q2d" "q3d" "q4d" "q5d" "q6d" "q7d" "q8d" "q9d" "q10d"
[41] "q1e" "q2e" "q3e" "q4e" "q5e" "q6e" "q7e" "q8e" "q9e" "q10e"
[51] "q1f" "q2f" "q3f" "q4f" "q5f" "q6f" "q7f" "q8f" "q9f" "q10f"
[61] "q1g" "q2g" "q3g" "q4g" "q5g" "q6g" "q7g" "q8g" "q9g" "q10g"
[71] "q1h" "q2h" "q3h" "q4h" "q5h" "q6h" "q7h" "q8h" "q9h" "q10h"
[81] "q1i" "q2i" "q3i" "q4i" "q5i" "q6i" "q7i" "q8i" "q9i" "q10i"
[91] "q1j" "q2j" "q3j" "q4j" "q5j" "q6j" "q7j" "q8j" "q9j" "q10j"
...
And then you turn your inner part of the analysis into a function that takes the question prefix as a parameter:
analyzeQuestion <- function (prefix)
{
q <- d1[,grep(prefix,colnames(d1))] ## Pull in everything matching q4a_X
q <- melt(q) ## restructure data for post-hoc
qaaov <- aov(formula=value~variable,data=q4a) ## anova
return (LTukey(q4aaov,which="",conf.level=0.95)) ## Tukey's post-hoc
}
Now - I'm not sure where your 'q4a' variable is coming from (as used in the aov(..., data=q4a)- so not sure what to do about that bit. But hopefully this helps.
To put the two together you can use sapply() to apply the analyzeQuestion function to each of the prefixes that we automagically generated.
I would recommend melting the entire dataset and then splitting variable into its component pieces. Then you can more easily use subset to look at (e.g.) just question four: subset(molten, q = 4).

Resources