R stack alternative - r

I am trying to write code that takes values from one column of each of many files and prints out a list of the values of a different column depending on the values found in the first. If that makes sense. I have read the files in, but I am having trouble managing the table. I would like to limit the table to just those two columns, because the files are very large, cumbersome and unnecessary. In my attempt to do so I had this line:
tmp<-stack(lapply(inputFiles,function(x) x[,3]))
But ideally I would like to include two columns (3 and 1), not just one, so that I may use a line, such as these ones:
search<-tmp[tmp$values < 100, "Target"]
write(search, file = "Five", ncolumns = 2)
But I am not sure how. I am almost certain that stack is not going to work for more than one column. I tried some different things, similar to this:
tmp<-stack(lapply(inputFiles,function(x) x[,3], x[,1]))
But of course that didn't work.
But I don't know where to look. Does anyone have any suggestions?

The taRifx package has a list method for stack that will do what you want. It stacks lists of data.frames.
Untested code:
library(taRifx)
tmp<-stack(lapply(inputFiles,function(x) x[,c(1,3)]))
But you didn't change anything! Why does this work?
lapply() returns a list. In your case, it returns a list where each element is a data.frame.
Base R does not have a special method for stacking lists. So when you call stack() on your list of data.frames, it calls stack.default, which doesn't work.
Loading the taRifx library loads a method of stack that deals specifically with lists of data.frames. So everything works fine since stack() now knows how to properly handle a list of data.frames.
Tested example:
dat <- replicate(10, data.frame(x=runif(2),y=rnorm(2)), simplify=FALSE)
str(dat)
stack(dat)
x y
1 0.42692948 0.32023455
2 0.75388820 0.24154125
3 0.64035957 1.96580059
4 0.47690790 -1.89772855
5 0.41668993 0.78083412
6 0.12643784 0.38029833
7 0.01656855 0.51225268
8 0.40653094 1.09408159
9 0.94236491 -0.13410923
10 0.05578115 1.12475364
11 0.75651062 -0.65441493
12 0.48210444 1.67325343
13 0.95348755 0.04828449
14 0.02315498 -0.28481193
15 0.27370762 0.43927826
16 0.83045889 0.75880763
17 0.40049367 0.06945058
18 0.86212662 1.49918712
19 0.97611629 0.13959291
20 0.29107186 0.64483646

Related

R - Using Stringr to identify a string across hundreds of rows

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!
You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

R - split list by defined interval

I have a list of files whose length will always be a multiple of 12. This is a simplified sample:
files <- c("LC82210802013322LGN00_B1.TIF", "LC82210802013322LGN00_B10.TIF",
"LC82210802013322LGN00_B11.TIF", "LC82210802013322LGN00_B2.TIF",
"LC82210802013322LGN00_B3.TIF", "LC82210802013322LGN00_B4.TIF",
"LC82210802013322LGN00_B5.TIF", "LC82210802013322LGN00_B6.TIF",
"LC82210802013322LGN00_B7.TIF", "LC82210802013322LGN00_B8.TIF",
"LC82210802013322LGN00_B9.TIF", "LC82210802013322LGN00_BQA.TIF",
"LC82210802013354LGN00_B1.TIF", "LC82210802013354LGN00_B10.TIF",
"LC82210802013354LGN00_B11.TIF", "LC82210802013354LGN00_B2.TIF",
"LC82210802013354LGN00_B3.TIF", "LC82210802013354LGN00_B4.TIF",
"LC82210802013354LGN00_B5.TIF", "LC82210802013354LGN00_B6.TIF",
"LC82210802013354LGN00_B7.TIF", "LC82210802013354LGN00_B8.TIF",
"LC82210802013354LGN00_B9.TIF", "LC82210802013354LGN00_BQA.TIF",
"LC82210802014021LGN00_B1.TIF", "LC82210802014021LGN00_B10.TIF",
"LC82210802014021LGN00_B11.TIF", "LC82210802014021LGN00_B2.TIF",
"LC82210802014021LGN00_B3.TIF", "LC82210802014021LGN00_B4.TIF",
"LC82210802014021LGN00_B5.TIF", "LC82210802014021LGN00_B6.TIF",
"LC82210802014021LGN00_B7.TIF", "LC82210802014021LGN00_B8.TIF",
"LC82210802014021LGN00_B9.TIF", "LC82210802014021LGN00_BQA.TIF",
"LC82210802014037LGN00_B1.TIF", "LC82210802014037LGN00_B10.TIF",
"LC82210802014037LGN00_B11.TIF", "LC82210802014037LGN00_B2.TIF",
"LC82210802014037LGN00_B3.TIF", "LC82210802014037LGN00_B4.TIF",
"LC82210802014037LGN00_B5.TIF", "LC82210802014037LGN00_B6.TIF",
"LC82210802014037LGN00_B7.TIF", "LC82210802014037LGN00_B8.TIF",
"LC82210802014037LGN00_B9.TIF", "LC82210802014037LGN00_BQA.TIF",
"LC82210802014085LGN00_B1.TIF", "LC82210802014085LGN00_B10.TIF",
"LC82210802014085LGN00_B11.TIF", "LC82210802014085LGN00_B2.TIF",
"LC82210802014085LGN00_B3.TIF", "LC82210802014085LGN00_B4.TIF",
"LC82210802014085LGN00_B5.TIF", "LC82210802014085LGN00_B6.TIF",
"LC82210802014085LGN00_B7.TIF", "LC82210802014085LGN00_B8.TIF",
"LC82210802014085LGN00_B9.TIF", "LC82210802014085LGN00_BQA.TIF"
)
Those files are satellite images. There are always 12 files (or bands) for each single date. In this case, there are five groups (dates) with 12 files each, totalling 60 elements.
What I need to do is to split this list into groups of 12, ideally creating a new variable. Using the sample data provided above, the new variable would have five elements (corresponding to dates), each one containing 12 files:
new<-list()
length(new) <- length(files)/12
# CODE BELOW DOESN'T WORK. I JUST WANT TO SHOW WHAT I NEED TO DO
new[1] <- files[1:12]
new[2] <- files[13:24]
new[3] <- files[25:36]
new[4] <- files[37:48]
new[5] <- files[49:60]
How to find a generic solution for this problem? Generic in the sense that the original list of files will always be a multiple of 12, but not always have lenght of 60 elements - sometimes 72, and sometimes 120.
Thanks in advance for any help.
After some more in depth research I came out with the following solution:
new <- split(files, ceiling(seq_along(files)/12))
which works ok. Any better idea?
Thanks,
Thiago.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

R eval list strings for data.frames and then concat the data.frames

I have the following situation where Im pretty desperate.
paste("crossdata","$geno$'",1:4,"'$data",sep="")
generates 4 strings which look like that:
"crossdata$geno$'1'$data" "crossdata$geno$'2'$data" "crossdata$geno$'3'$data" "crossdata$geno$'4'$data"
I want to retrieve the corresponding data.frames of these 4 strings via evaluation of one of these strings and the combine them via cbind. However when Im doing something like this:
cbind(sapply(parse(text=paste("crossdata","$geno$'",i,"'$data",sep="")),eval))
that does not work. Can anybody help me out?
Thanks
datlist <- list(adat=data.frame(u=1:5,v=6:10),bdat=data.frame(x=11:15,y=16:20))
extdat <- c("datlist$adat","datlist$bdat")
do.call('cbind',lapply(extdat,function(i) eval(parse(text=i))))
u v x y
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Of course this uses eval + parse, which usually means you are on the wrong track.
Using the combination of parse and eval is like saying that you know how to get from New York City to Boston and therefore making all your travel plans by going from your origin to New York, then to Boston, then to your desitination. In some cases this may not be to bad, but it is a bit of a long detour if you are traveling from London to Paris.
You should first learn the relationship and difference between subsetting lists using $ and [[ (see ?'[[' for the documentation) and when it is, and more importantly, is not appropriate to use $. Once you understand that you should be able to find solutions that do not require parse and eval.
Your problem may be as simple as (untested since your example is not reproducible):
do.call( cbind, lapply( 1:4, function(x) crossdata[['geno']][[x]][['data']] ) )
or possibly
do.call(cbind, lapply(as.character(1:4), function(x) crossdata$geno[[x]]$data ) )

Resources