Convert an "AsIs" class to data.frame in R - r

The data is from
datasetname="riboflavin"
data(riboflavin, package = "hdi")
Y=as.numeric(riboflavin$y)-1
mydata=data.frame(Y,X)
#X now is 71*4088
str(X)
'AsIs' num [1:71, 1:4088] 8.49 7.64 8.09 7.89 6.81 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:71] "b_Fbat107PT24.CEL" "b_Fbat107PT30.CEL" "b_Fbat107PT48.CEL" "b_Fbat107PT52.CEL" ...
..$ : chr [1:4088] "AADK_at" "AAPA_at" "ABFA_at" "ABH_at" ...
#71*1
str(Y)
num [1:71] -7.64 -7.95 -8.93 -9.29 -8.31 ...
dim(mydata)
[1] 71 2
why dim(mydata) is not 71*4089? How to obtain a data.frame of (X,Y) with a dimensionality with 71*4089?
Thanks

num [1:71, 1:4088] means riboflavin$x is a matrix. cbinding modified column y to matrix x should solve the problem.
res <- as.data.frame(cbind(Y=riboflavin$y - 1, X=riboflavin$x))
dim(res)
# [1] 71 4089

Related

Adding a suffix to names when storing results in a loop

I am making some plots in R in a for-loop and would like to store them using a name to describe the function being plotted, but also which data it came from.
So when I have a list of 2 data sets "x" and "y" and the loop has a structure like this:
x = matrix(
c(1,2,4,5,6,7,8,9),
nrow=3,
ncol=2)
y = matrix(
c(20,40,60,80,100,120,140,160,180),
nrow=3,
ncol=2)
data <- list(x,y)
for (i in data){
??? <- boxplot(i)
}
I would like the ??? to be "name" + (i) + "_" separator. In this case the 2 plots would be called "plot_x" and "plot_y".
I tried some stuff with paste("plot", names(i), sep = "_") but I'm not sure if this is what to use, and where and how to use it in this scenario.
We can create an empty list with the length same as that of the 'data' and then store the corresponding output from the for loop by looping over the sequence of 'data'
out <- vector('list', length(data))
for(i in seq_along(data)) {
out[[i]] <- boxplot(data[[i]])
}
str(out)
#List of 2
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 1 1.5 2 3 4 5 5.5 6 6.5 7
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 0.632 3.368 5.088 6.912
# ..$ out : num(0)
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 20 30 40 50 60 80 90 100 110 120
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 21.8 58.2 81.8 118.2
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
If required, set the names of the list elements with the object names
names(out) <- paste0("plot_", c("x", "y"))
It is better not to create multiple objects in the global environment. Instead as showed above, place the objects in a list
akrun is right, you should try to avoid setting names in the global environment. But if you really have to, you can try this,
> y = matrix(c(20,40,60,80,100,120,140,160,180),ncol=1)
> .GlobalEnv[[paste0("plot_","y")]] <- boxplot(y)
> str(plot_y)
List of 6
$ stats: num [1:5, 1] 20 60 100 140 180
$ n : num 9
$ conf : num [1:2, 1] 57.9 142.1
$ out : num(0)
$ group: num(0)
$ names: chr "1"
You can read up on .GlobalEnv by typing in ?.GlobalEnv, into the R command prompt.

How to use lapply to remove columns with too many missing values in a list in R?

I have a list of data frames called ls.df.val.dcas. Each dataframe has various columns with some missing values which are NA. I would like to use lappy() to the list so that I can remove those columns that more than X % (e.g. 40%) of their values are NA. To give you a view of how the dataframes within the list look like I am showing an example:
$ SK_VALUES_IMV_EU28_INTRA :'data.frame': 74 obs. of 65 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] 1078759 1850083 1872924 1038070 626471 ...
..$ 2208 : num [1:74] 3329179 7061890 1351550 1371469 1557605 ...
..$ 220710 : num [1:74] 1030704 1804495 1831958 972263 574855 ...
..$ 220720 : num [1:74] 48055 45588 40966 65807 51616 ...
..$ 220820 : num [1:74] 380843 1014933 71804 126348 138138 ...
..$ 220830 : num [1:74] 380007 459653 155033 205879 297446 ...
..$ 220840 : num [1:74] 41561 88449 31549 60768 117534 ...
..$ 220850 : num [1:74] 94483 340439 44949 32949 37550 ...
..$ 220860 : num [1:74] 371217 728521 143974 179311 254546 ...
..$ 220870 : num [1:74] 731231 1374532 228087 227772 230129 ...
..$ 22082014: num [1:74] NA 2531 1776 NA NA ...
$ RO_VALUES_IMV_EU28_EXTRA :'data.frame': 74 obs. of 44 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 2208 : num [1:74] 312035 840540 315008 884357 100836 ...
..$ 220710 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 220720 : num [1:74] NA NA NA NA NA NA NA NA NA NA ...
..$ 220820 : num [1:74] 3570 698 483 1087 1802 ...
My incomplete solution is based on counting the number of NA in each column of each dataframe and calculating the percentage of NA. Then removing those columns that the percentage is more than X%.
# Counting the number of NA
ls.Nan <- lapply(ls.df.val.dcas, function(x) colSums(!is.na(x)))
# Calculating the lengths of all column
ls.size <- lapply(ls.df.val.dcas, function(x) dim(x))
# we want the first element of size which shows the number of rows.
ls.percen <- mapply(function(x,y) x/y[1] , x=ls.Nan, y=ls.size)
# keeping those columns that have more than half of the data on that category
mis.list <- sapply(ls.df.val.dcas, "]]" sapply(ls.percen, function(x) x >= NPI))
I get the following error from running the last line.
Error: unexpected symbol in "mis.list <- sapply(ls.df.val.dcas, "]]" sapply"
Ultimately I also like to merge all of these functions into a single functions and then use lapply once. But right now, I am struggling to understand the indexing system of lapply applied to list of dataframes. If any one can demonstrate with an example how to use lapply with different granularity of lists then that would be great. For instance how functions should be written when you want to change an element of a list or a dataframe within list, or a column within a dataframe of a list.
EDIT
Given the comment below about forgetting to put a comma after "]]". I corrected the code but still getting the error
> mis.list <- sapply(ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))
Error in get(as.character(FUN), mode = "function", envir = envir) :
object ']]' of mode 'function' was not found
By the way, the NPI is just a percentage threshold of NAs in the column. For instance I have set it to NPI= 0.35
Since I suspect there the error is related to the structure of my data, I added the more info on the structure of the ls.percen.
> str(ls.percen)
List of 69
$ AT_VALUES_IMV_EU28_EXTRA : Named num [1:59] 1 0.635 1 0.378 0.338 ...
..- attr(*, "names")= chr [1:59] "PERIOD" "2207" "2208" "220710" ...
$ AT_VALUES_IMV_EU28_INTRA : Named num [1:67] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:67] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_EXTRA : Named num [1:57] 1 1 1 1 0.365 ...
..- attr(*, "names")= chr [1:57] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_INTRA : Named num [1:69] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:69] "PERIOD" "2207" "2208" "220710" ...
Might be a simple typo (and not a problem with indexing): that message says you are missing a comma, and it should perhaps be:
mis.list <- sapply( ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))
We don't see a definition of 'NPI'. Might be simpler to merge the first two 'lapply' calls (and return the desired list of shorted df's) with:
mis.lst <- lapply( ls.df.val.dcas,
function(x) x[ , colSums(!is.na(x))/nrow(x) > .40 ] )
You can use logical indexing in the "j" position for the two argument version of "[".

Rename subtitles of a list using a loop

I do have to rename sublist titles within a main matrix list called l1. Each Name(n) is related to a value as a character string. Here is my code :
names(l1)[1] <- Name1
names(l1)[2] <- Name2
names(l1)[3] <- Name3
names(l1)[4] <- Name4
## ...
names(l1)[43] <- Name43
As you can see, I have 43 sublists. Is there a way do do that using an automated loop like for (i in 1:43) or something ? I tried to perform a loop but I am a beginner and that's very hard for now.
Edit : I would like to rename the elements of my list without having to type 43 lines manually. Here is the first three elements of my list :
str(l1)
List of 43
$ XXX : num [1:640, 1:3] -0.83 -0.925 -0.623 -0.191 0.155 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "EV_BICYCLE" "HW_DISTANCE" "NO_ASSETS"
$ XXX : num [1:640, 1:2] -0.159 0.485 -0.686 -0.245 -3.361 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "HOME_OWN" "METRO_DISTANCE"
$ XXX : num [1:640, 1:3] -0.79 1.15 0.224 0.388 -1.571 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "BICYCLE" "HOME_OWN_SC" "POP_SC"
That is to say, I would like to replace the 43 XXX by Name1, Name2 ... to Name43
Try
names(l1) <- unlist(mget(ls(pattern="^Nom_F")))
str(l1, list.len=2)
#List of 3
# $ Accessibility : int [1:5, 1:5] 10 10 3 9 7 6 8 2 7 8 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : NULL
# .. ..$ : chr [1:5] "A" "B" "C" "D" ...
# $ Access : int [1:5, 1:5] 6 4 10 5 9 8 9 4 7 1 ...
#..- attr(*, "dimnames")=List of 2
# .. ..$ : NULL
# .. ..$ : chr [1:5] "A" "B" "C" "D" ...
Instead of creating separate objects, you could create a vector of real titles. For example
v1 <- LETTERS[1:3]
names(l1) <- v1
data
set.seed(42)
l1 <- setNames(lapply(1:3, function(x)
matrix(sample(1:10, 5*5, replace=TRUE), ncol=5,
dimnames=list(NULL, LETTERS[1:5]))), rep('XXX',3))
Nom_F1 <- "Accessibility"
Nom_F2 <- "Access"
Nom_F3 <- "Poverty_and_SC"

Simulating stochastic integrals

I'm using the Sim.DiffProc package in R to simulate a Stratonovich stochastic integral. Using the following code I can simulate 5 paths of the stochastic integral from t=0 to t=5:
fun=expression(w)
strat=st.int(fun, type="str", M=5, lower=0, upper=5)
How can I get the values of the stochastic integral in t=5 given that the st.int() function doesn't give the values in the various t as output?
I'm not sure what you mean by t=5. The $X matrix is a of times series:
> str(strat)
List of 8
$ X : mts [1:1001, 1:5] 0.0187 0.0177 0.0506 0.0357 0.0357 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:5] "X1" "X2" "X3" "X4" ...
..- attr(*, "tsp")= num [1:3] 0 5 200
..- attr(*, "class")= chr [1:3] "mts" "ts" "matrix"
$ fun : symbol w
$ type : chr "str"
$ subdivisions: int 1000
$ M : num 5
$ Dt : num 0.005
$ t0 : num 0
$ T : num 5
- attr(*, "class")= chr "st.int"
If it is the fifth row of the values matrix is what you mean, it would be:
> (strat$X[5 , ])
X1 X2 X3 X4 X5
0.0031517578 0.0161278426 0.0003616453 0.0097594992 0.0012617410

Feature selection using the penalizedLDA package

I am trying to use the penalizedLDA package to run a penalized linear discriminant analysis in order to select the "most meaningful" variables. I have searched here and on other sites for help in accessing the the output from the penalized model to no avail.
My data comprises of 400 varaibles and 44 groups. Code I used and results I got thus far:
yy.m<-as.matrix(yy) #Factors/groups
xx.m<-as.matrix(xx) #Variables
cv.out<-PenalizedLDA.cv(xx.m,yy.m,type="standard")
## aplly the penalty
out <- PenalizedLDA(xx.m,yy.m,lambda=cv.out$bestlambda,K=cv.out$bestK)
Too get the structure of the output from the anaylsis:
> str(out)
List of 10
$ discrim: num [1:401, 1:4] -0.0234 -0.0219 -0.0189 -0.0143 -0.0102 ...
$ xproj : num [1:100, 1:4] -8.31 -14.68 -11.07 -13.46 -26.2 ...
$ K : int 4
$ crits :List of 4
..$ : num [1:4] 2827 2827 2827 2827
..$ : num [1:4] 914 914 914 914
..$ : num [1:4] 162 162 162 162
..$ : num [1:4] 48.6 48.6 48.6 48.6
$ type : chr "standard"
$ lambda : num 0
$ lambda2: NULL
$ wcsd.x : Named num [1:401] 0.0379 0.0335 0.0292 0.0261 0.0217 ...
..- attr(*, "names")= chr [1:401] "R400" "R405" "R410" "R415" ...
$ x : num [1:100, 1:401] 0.147 0.144 0.145 0.141 0.129 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:401] "R400" "R405" "R410" "R415" ...
$ y : num [1:100, 1] 2 2 2 2 2 1 1 1 1 1 ...
- attr(*, "class")= chr "penlda"
I am interested in obtaining a list or matrix of the top 20 variables for feature selection, more than likely based on the coefficients of the Linear discrimination.
I realized I would have to sort the coefficients in descending order, and get the variable names matched to it. So the output I would expect is something like this imaginary example
V1 V2
R400 0.34
R1535 0.22...
Can anyone provide any pointers (not necessarily the R code). Thanks in advance.
Your out$K is 4, and that means you have 4 discriminant vectors. If you want the top 20 variables according to, say, the 2nd vector, try this:
# get the data frame of variable names and coefficients
var.coef = data.frame(colnames(xx.m), out$discrim[,2])
# sort the 2nd column (the coefficients) in decreasing order, and only keep the top 20
var.coef.top = var.coef[order(var.coef[,2], decreasing = TRUE)[1:20], ]
var.coef.top is what you want.

Resources