Split data.frame into groups by column name - r

I'm new to R. I have a data frame with column names of such type:
file_001 file_002 block_001 block_002 red_001 red_002 ....etc'
0.05 0.2 0.4 0.006 0.05 0.3
0.01 0.87 0.56 0.4 0.12 0.06
I want to split them into groups by the column name, to get a result like this:
group_file
file_001 file_002
0.05 0.2
0.01 0.87
group_block
block_001 block_002
0.4 0.006
0.56 0.4
group_red
red_001 red_002
0.05 0.3
0.12 0.06
...etc'
My file is huge. I don't have a certain number of groups.
It needs to be just by the column name's start.

In base R, you can use sub and split.default like this to return a list of data.frames:
myDfList <- split.default(dat, sub("_\\d+", "", names(dat)))
this returns
myDfList
$block
block_001 block_002
1 0.40 0.006
2 0.56 0.400
$file
file_001 file_002
1 0.05 0.20
2 0.01 0.87
$red
red_001 red_002
1 0.05 0.30
2 0.12 0.06
split.default will split data.frames by variable according to its second argument. Here, we use sub and the regular expression "_\d+" to remove the underscore and all numeric values following it in order to return the splitting values "block", "file", and "red".
As a side note, it is typically a good idea to keep these data.frames in a list and work with them through functions like lapply. See gregor's answer to this post for some motivating examples.

Thank you lmo,
after using your code, it didn't work as I wanted, but I came with a solution thanks to your guidance.
So, in order to divide a Data Frame list:
myDfList <- split.default(dat, sub(x = as.character(names(dat)), pattern = "\\_.*", ""))
hope it'll help people in the future!

Related

Sum vector with number by dinamic intervals without looping

I have dinamic intervals in a Data Frame generated by calculation of percentage of my data, like below:
Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00
I have a vector with about 3000 numbers that I want to obtain how many numbers I have by each interval without using a Loop because is too much slow.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
Expected result in this case: 5,0,2,1....
You can use apply() to go though your start-finish data.frame, check if the numbers are between start and finish values and sum up the logical vector returned from data.tables' between() function.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
sf <-
read.table(text =
"Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00",
header = TRUE
)
apply(sf, 1, function(x) {
sum(data.table::between(Numbers, x[1], x[2]))
})
This will return:
5 0 2 1
We can use foverlaps
library(data.table)
setDT(df)
dfN <- data.table(Start = Numbers, Finish = Numbers)
setkeyv(df, names(df))
setkeyv(dfN, names(dfN))
foverlaps(df, dfN, which = TRUE, type = "any")[, sum(!is.na(yid)), xid]$V1
#[1] 5 0 2 1

How to retrieve column in sequence on SQLite using R

I uses a censor data which provide the wavelengths from 100-300nm with sequence 1, means: 100, 101,102,103,..300. I keep the data in SQLite using R, with table name as data
> data
obs 100 101 102 103 104 ... 300
1 0.1 0.1 0.9 0.1 0.2 0.5
2 0.8 1.0 0.9 0.0 1.0 0.4
3 0.7 0.8 0.3 0.8 0.5 0.2
4 0.7 0.1 0.2 0.4 0.7 0.6
5 0.9 0.4 0.6 0.6 0.6 0.4
6 0.7 0.1 0.6 0.7 0.9 0.9
I am interested to retrieve the column number with sequence 4 only starting 100. Means: 100, 104, 108, ...
I tried using sqldf("select 100, 104, 108, ... from data") but seems not efficient work. Is there someone can help using R? thanks!
You can use paste() inside of sqldf to make things like this easier. So the basic idea would be:
sqldf(paste("select",
paste0("`",seq(100,300,4),"`",collapse=", "),
"from data"))
Columns or tables with numeric names typically need to be surrounded with backticks. So that's why I've adjusted the statement to find `100`, instead of 100.
The full statement (simplified above) looks like this:
[1] "select `100`, `104`, `108`, `112`, `116`, `120`, `124`, `128`, `132`, `136`,
`140`, `144`, `148`, `152`, `156`, `160`, `164`, `168`, `172`, `176`,
`180`, `184`, `188`, `192`, `196`, `200`, `204`, `208`, `212`, `216`,
`220`, `224`, `228`, `232`, `236`, `240`, `244`, `248`, `252`, `256`,
`260`, `264`, `268`, `272`, `276`, `280`, `284`, `288`, `292`, `296`,
`300` from data"
sqldf loads the gsubfn package which provides fn$ for dealing with string interpolation. fn$ can preface any function invocation so, for example, use fn$sqldf("... $var ...") and then $var is replaced with its value.
Note that select 100 selects the number 100 and not the column named 100 so we use select [100] instead.
cn <- toString(sprintf("[%d]", seq(100, 300, 4))) # "[100], [104], ..."
fn$sqldf("select $cn from data")
or if we want to create the SQL statement in a variable and then run it:
sql <- fn$identity("select $cn from data")
sqldf(sql)
Note that this is pretty easy to do in straight R as well:
data[paste(seq(100, 300, 4))]

Creating a lookup based on two values

I have an excel that contains a matrix. Here you find a screenshot of the matrix I want to use: https://www.flickr.com/photos/113328996#N07/23026818939/in/dateposted-public/
What I would like to do now is to create some kind of lookup function. So when i have the rows:
Arsenal - Aston Villa
It should look up 114.6.
Of course I could create rows with all distances like:
Arsenal - Aston Villa - 144.6
And perform a lookup function but my instincts tell me this is not the most efficient way.
Any feedback on how I can deal with above most efficiently?
This lookup-function is the basic [ operator for data.frames and matrices in R.
Take this example data (from Here)
a <- cbind(c(0.1,0.5,0.25),c(0.2,0.3,0.65),c(0.7,0.2,0.1))
rownames(a) <- c("Lilo","Chops","Henmans")
colnames(a) <- c("Product A","Product B","Product C")
a
Product A Product B Product C
Lilo 0.10 0.20 0.7
Chops 0.50 0.30 0.2
Henmans 0.25 0.65 0.1
The lookupfunktion is this:
a["Lilo","Product A"] # 0.1
a["Henmans","Product B"] # 0.65

Complex subsetting of dataframe

Consider the following dataframe:
df <- data.frame(Asset = c("A", "B", "C"), Historical = c(0.05,0.04,0.03), Forecast = c(0.04,0.02,NA))
# Asset Historical Forecast
#1 A 0.05 0.04
#2 B 0.04 0.02
#3 C 0.03 NA
as well as the variable x. x is set by the user at the beginning of the R script, and can take two values: either x = "Forecast" or x = "Historical".
If x = "Forecast", I would like to return the following: for each asset, if a forecast is available, return the appropriate number from the column "Forecast", otherwise, return the appropriate number from the column "Historical". As you can see below, both A and B have a forecast value which is returned below. C is missing a forecast value, so the historical value is returned.
Asset Return
1 A 0.04
2 B 0.02
3 C 0.03
If, however, x= "Historical",simply return the Historical column:
Asset Historical
1 A 0.05
2 B 0.04
3 C 0.03
I can't come up with an easy way of doing it, and brute force is very inefficient if you have a large number of rows. Any ideas?
Thanks!
First, pre-process your data:
df2 <- transform(df, Forecast = ifelse(!is.na(Forecast), Forecast, Historical))
Then extract the two columns of choice:
df2[c("Asset", x)]

How to summarize multiple files into one file based on an assigned rule?

I have ~ 100 files in the following format, each file has its own file name, but all these files are save in the same directory, let's said, filecd is follows:
A B C D
ab 0.3 0.0 0.2 0.20
cd 0.7 0.0 0.3 0.77
ef 0.8 0.1 0.5 0.91
gh 0.3 0.5 0.6 0.78
fileabb is as follows:
A B C D
ab 0.3 0.9 1.0 0.20
gh 0.3 0.5 0.6 0.9
All these files have same number of columns but different number of rows.
For each file I want to summarize them as one row (0 for all cells in the same column are < 0.8; 1 for ANY of the cells in the same column is larger than or equal to 0.8), and the summerized results will be saved in a separate csv file as follows:
A B C D
filecd 1 0 0 1
fileabb 0 1 1 1
..... till 100
Instead of reading files and processing each files separately, could it be done by R efficiently? Could you give me help on how to do so? Thanks.
For the ease of discussion. I have add following lines for sample input files:
file1 <- data.frame(A=c(0.3, 0.7, 0.8, 0.3), B=c(0,0,0.1,0.5), C=c(0.2,0.3,0.5,0.6), D=c(0.2,0.77,0.91, 0.78))
file2 <- data.frame(A=c(0.3, 0.3), B=c(0.9,0.5), C=c(1,0.6), D=c(0.2,0.9))
Please kindly give me some more advice. Many thanks.
First make a vector of all the filenames.
filenames <- dir(your_data_dir) #you may also need the pattern argument
Then read the data into a list of data frames.
data_list <- lapply(filenames, function(fn) as.matrix(read.delim(fn)))
#maybe with other arguments passed to read.delim
Now calculate the summary.
summarised <- lapply(data_list, function(dfr)
{
apply(x, 2, function(row) any(row >= 0.8))
})
Convert this list into a matrix.
summary_matrix <- do.call(rbind, summarised)
Make the rownames match the file.
rownames(summary_matrix) <- filenames
Now write out to CSV.
write.csv(summary_matrix, "my_summary_matrix.csv")

Resources