R - split list by defined interval - r

I have a list of files whose length will always be a multiple of 12. This is a simplified sample:
files <- c("LC82210802013322LGN00_B1.TIF", "LC82210802013322LGN00_B10.TIF",
"LC82210802013322LGN00_B11.TIF", "LC82210802013322LGN00_B2.TIF",
"LC82210802013322LGN00_B3.TIF", "LC82210802013322LGN00_B4.TIF",
"LC82210802013322LGN00_B5.TIF", "LC82210802013322LGN00_B6.TIF",
"LC82210802013322LGN00_B7.TIF", "LC82210802013322LGN00_B8.TIF",
"LC82210802013322LGN00_B9.TIF", "LC82210802013322LGN00_BQA.TIF",
"LC82210802013354LGN00_B1.TIF", "LC82210802013354LGN00_B10.TIF",
"LC82210802013354LGN00_B11.TIF", "LC82210802013354LGN00_B2.TIF",
"LC82210802013354LGN00_B3.TIF", "LC82210802013354LGN00_B4.TIF",
"LC82210802013354LGN00_B5.TIF", "LC82210802013354LGN00_B6.TIF",
"LC82210802013354LGN00_B7.TIF", "LC82210802013354LGN00_B8.TIF",
"LC82210802013354LGN00_B9.TIF", "LC82210802013354LGN00_BQA.TIF",
"LC82210802014021LGN00_B1.TIF", "LC82210802014021LGN00_B10.TIF",
"LC82210802014021LGN00_B11.TIF", "LC82210802014021LGN00_B2.TIF",
"LC82210802014021LGN00_B3.TIF", "LC82210802014021LGN00_B4.TIF",
"LC82210802014021LGN00_B5.TIF", "LC82210802014021LGN00_B6.TIF",
"LC82210802014021LGN00_B7.TIF", "LC82210802014021LGN00_B8.TIF",
"LC82210802014021LGN00_B9.TIF", "LC82210802014021LGN00_BQA.TIF",
"LC82210802014037LGN00_B1.TIF", "LC82210802014037LGN00_B10.TIF",
"LC82210802014037LGN00_B11.TIF", "LC82210802014037LGN00_B2.TIF",
"LC82210802014037LGN00_B3.TIF", "LC82210802014037LGN00_B4.TIF",
"LC82210802014037LGN00_B5.TIF", "LC82210802014037LGN00_B6.TIF",
"LC82210802014037LGN00_B7.TIF", "LC82210802014037LGN00_B8.TIF",
"LC82210802014037LGN00_B9.TIF", "LC82210802014037LGN00_BQA.TIF",
"LC82210802014085LGN00_B1.TIF", "LC82210802014085LGN00_B10.TIF",
"LC82210802014085LGN00_B11.TIF", "LC82210802014085LGN00_B2.TIF",
"LC82210802014085LGN00_B3.TIF", "LC82210802014085LGN00_B4.TIF",
"LC82210802014085LGN00_B5.TIF", "LC82210802014085LGN00_B6.TIF",
"LC82210802014085LGN00_B7.TIF", "LC82210802014085LGN00_B8.TIF",
"LC82210802014085LGN00_B9.TIF", "LC82210802014085LGN00_BQA.TIF"
)
Those files are satellite images. There are always 12 files (or bands) for each single date. In this case, there are five groups (dates) with 12 files each, totalling 60 elements.
What I need to do is to split this list into groups of 12, ideally creating a new variable. Using the sample data provided above, the new variable would have five elements (corresponding to dates), each one containing 12 files:
new<-list()
length(new) <- length(files)/12
# CODE BELOW DOESN'T WORK. I JUST WANT TO SHOW WHAT I NEED TO DO
new[1] <- files[1:12]
new[2] <- files[13:24]
new[3] <- files[25:36]
new[4] <- files[37:48]
new[5] <- files[49:60]
How to find a generic solution for this problem? Generic in the sense that the original list of files will always be a multiple of 12, but not always have lenght of 60 elements - sometimes 72, and sometimes 120.
Thanks in advance for any help.

After some more in depth research I came out with the following solution:
new <- split(files, ceiling(seq_along(files)/12))
which works ok. Any better idea?
Thanks,
Thiago.

Related

Subset trials from a list using a map function

I am trying to create another list to include all of the trials from 2 out of the 3 variables shown in the picture.
I am trying to learn how to map this.
So far I have:
d2 <- map(d1 ,`[` , c("time_100L_1", "vertical_100L_1"))
but this only brings in the first trial. I need all 14 for time and vertical and force is in the middle of the list.
any suggestions? See picture for list
map(d1, [, c(paste0("time_100L_", 1:14), paste0("vertical_100L_", 1:14)))

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

R stack alternative

I am trying to write code that takes values from one column of each of many files and prints out a list of the values of a different column depending on the values found in the first. If that makes sense. I have read the files in, but I am having trouble managing the table. I would like to limit the table to just those two columns, because the files are very large, cumbersome and unnecessary. In my attempt to do so I had this line:
tmp<-stack(lapply(inputFiles,function(x) x[,3]))
But ideally I would like to include two columns (3 and 1), not just one, so that I may use a line, such as these ones:
search<-tmp[tmp$values < 100, "Target"]
write(search, file = "Five", ncolumns = 2)
But I am not sure how. I am almost certain that stack is not going to work for more than one column. I tried some different things, similar to this:
tmp<-stack(lapply(inputFiles,function(x) x[,3], x[,1]))
But of course that didn't work.
But I don't know where to look. Does anyone have any suggestions?
The taRifx package has a list method for stack that will do what you want. It stacks lists of data.frames.
Untested code:
library(taRifx)
tmp<-stack(lapply(inputFiles,function(x) x[,c(1,3)]))
But you didn't change anything! Why does this work?
lapply() returns a list. In your case, it returns a list where each element is a data.frame.
Base R does not have a special method for stacking lists. So when you call stack() on your list of data.frames, it calls stack.default, which doesn't work.
Loading the taRifx library loads a method of stack that deals specifically with lists of data.frames. So everything works fine since stack() now knows how to properly handle a list of data.frames.
Tested example:
dat <- replicate(10, data.frame(x=runif(2),y=rnorm(2)), simplify=FALSE)
str(dat)
stack(dat)
x y
1 0.42692948 0.32023455
2 0.75388820 0.24154125
3 0.64035957 1.96580059
4 0.47690790 -1.89772855
5 0.41668993 0.78083412
6 0.12643784 0.38029833
7 0.01656855 0.51225268
8 0.40653094 1.09408159
9 0.94236491 -0.13410923
10 0.05578115 1.12475364
11 0.75651062 -0.65441493
12 0.48210444 1.67325343
13 0.95348755 0.04828449
14 0.02315498 -0.28481193
15 0.27370762 0.43927826
16 0.83045889 0.75880763
17 0.40049367 0.06945058
18 0.86212662 1.49918712
19 0.97611629 0.13959291
20 0.29107186 0.64483646

Pasting (or merging) two elements of a column together

I have two sources of clinical procedure billing information that I have added together (with rbind). In each row there is a CPT field and a CPT.description field that supplys a brief explanation. However, the descriptions are slightly different from the two sources. I want to be able to combine them. That way, if different words or abbreviations are used, then I can just do a string search to find what I am looking for.
So lets make up a simplified representation of a data table that I was able to generate.
cpt <- c(23456,23456,10000,44555,44555)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy")
cpt.desc <- data.frame(cpt,description)
And here is what I want to get to.
cpt.wanted <- c(23456,10000,44555)
description.wanted <- c("tonsillectomy; tonsillectomy in >12 year old","brain transplant","castration; orchidectomy")
cpt.desc.wanted <- data.frame(cpt.wanted,description.wanted)
I have tried using functions such as unstack and then lapply(list,paste) but that is not pasting the elements of each list. I also tried reshape but there was no categorical variable to differentiate first or second version of description or even in some cases a third. The really annoying part is I had a similar problem a few months or years ago and someone helped me either on stackoverflow or on r-help and for the life of me I cannot find it.
So the underlying problem is, imagine that I have a spreadsheet in front of me. I need to do a vertical merge (paste) of two or maybe even three description cells who have the same CPT code in the adjacent column.
What buzzwords should I have been using to search for a solution to this problem.
Thank you so much for your help.
sapply( sapply(unique(cpt), function(x) grep(x, cpt) ),
# creates sets of index vectors as a list
function(x) paste(description[x], collapse=";") )
# ... and this pastes each set of selected items from "description" vector
[1] "tonsillectomy;tonsillectomy in >12 year old"
[2] "brain transplant"
[3] "castration;orchidectomy"
Here is an approach that uses plyr.
library("plyr")
cpt.desc.wanted <- ddply(cpt.desc, .(cpt), summarise,
description.wanted = paste(unique(description), collapse="; "))
which gives
> cpt.desc.wanted
cpt description.wanted
1 10000 brain transplant
2 23456 tonsillectomy; tonsillectomy in >12 year old
3 44555 castration; orchidectomy

Resources