I have a set of non-overlapping intervals, each with an ID. Given a number, I would like to assign it an ID depending on the interval which it belongs to (NA if no such interval exists).
intervals_id <- structure(list(ID = c(851, 852, 999), Lower = c(85101, 85201,
85301), Upper = c(85104, 85206, 85699)), .Names = c("ID", "Lower",
"Upper"), row.names = c(NA, -3L), class = "data.frame")
# ID Lower Upper
# 1 851 85101 85104
# 2 852 85201 85206
# 3 999 85301 85699
value <- c(15555, 85102, 85201, 85206, 85207, 85600, 86999)
I put together something using cut, it seems it works, but it feels messy. Any idea of a more elegant and straightforward solution?
intervals_id <- intervals_id[order(intervals_id$Lower),]
intervals_id$UpperP <- intervals_id$Upper + 0.01
position <- as.numeric(cut(value, breaks =
as.numeric(t(as.matrix(intervals_id[,c("Lower", "UpperP")]))), right = FALSE))
position[position %% 2 == 0] <- NA
position <- (position + 1) %/% 2
# desired result
data.frame(value, valueID = intervals_id$ID[position])
# value valueID
# 1 15555 NA
# 2 85102 851
# 3 85201 852
# 4 85206 852
# 5 85207 NA
# 6 85600 999
# 7 86999 NA
You can use foverlaps() function from a data.table package. It finds overlaps between two sets of intervals.
First we need to create data.table's and set keys for them.
library(data.table)
# Using OPs data
setDT(intervals_id)
setkey(intervals_id, Lower, Upper)
# Create dummy intervals (same coordinate) and set key
valueDT <- data.table(start = value, end = value)
setkey(valueDT, start, end)
Next, apply foverlaps() function:
foverlaps(valueDT, intervals_id)[, .(value = start, ID)]
Result:
# value ID
# 1: 15555 NA
# 2: 85102 851
# 3: 85201 852
# 4: 85206 852
# 5: 85207 NA
# 6: 85600 999
# 7: 86999 NA
PS. foverlaps output looks like this:
ID Lower Upper start end
1: NA NA NA 15555 15555
2: 851 85101 85104 85102 85102
3: 852 85201 85206 85201 85201
4: 852 85201 85206 85206 85206
5: NA NA NA 85207 85207
6: 999 85301 85699 85600 85600
7: NA NA NA 86999 86999
If needed you can play around with foverlaps options.
Use nomatch to filter out intervals without overlaps
Use mult to report "all", "first" or "last" overlap
Another data.table - baseR hybrid using data.table::between can be,
sapply(value, function(i) {i1 = df$ID[data.table::between(i, df$Lower, df$Upper)];
if (length(i1) == 0){NA}else{i1}})
#[1] NA 851 852 852 NA 999 NA
Related
I want to create several columns with a ifelse()-condition for multiple dataframes. In this case the dataframes are 3 time-series data for cryptocurrencies. Here is the code to download the 3 dataframes automatically:
library(tidyverse)
library(crypto)
crypto_chart <- crypto_prices()%>% select(-id, -symbol,-price_btc, -`24h_volume_usd`,-available_supply, -total_supply,-max_supply, -percent_change_1h, -percent_change_24h, -percent_change_7d, -last_updated)%>% slice(1:3)
list_cryptocurrencies <-crypto_chart$name
map(list_cryptocurrencies,
function(x) crypto_history(x, start_date = '20150101', end_date = '20190303')%>%
select(-slug, -symbol, -name, -`ranknow`))%>%
set_names(list_cryptocurrencies)%>%
list2env(envir = .GlobalEnv)
##Calculating return
map(mget(list_cryptocurrencies),
function(x) x %>% mutate(`return` = (close-open)/open * 100))%>%
list2env(mget(list_cryptocurrencies), envir = .GlobalEnv)
Now I want to detect positive overreactions (oR_pos) in the returns. I define an overreaction as a value (return) higher than the mean + 1 standard deviation. I want do this also for 1.5 and 2 standard deviations. Here ist my desired output for one cryptocurrencie (Bitcoin):
> Bitcoin
date open close return oR_pos>1sd oR_pos>1.5sd oR_pos>2sd
1 2018-01-01 14112.2 13657.2 -3.2241607 NA NA NA
2 2018-01-02 13625.0 14982.1 9.9603670 9.960367 9.960367 9.960367
3 2018-01-03 14978.2 15201.0 1.4874952 NA NA NA
4 2018-01-04 15270.7 15599.2 2.1511784 NA NA NA
5 2018-01-05 15477.2 17429.5 12.6140387 12.614039 12.614039 12.614039
6 2018-01-06 17462.1 17527.0 0.3716621 NA NA NA
7 2018-01-07 17527.3 16477.6 -5.9889430 NA NA NA
8 2018-01-08 16476.2 15170.1 -7.9271919 NA NA NA
9 2018-01-09 15123.7 14595.4 -3.4931928 NA NA NA
10 2018-01-10 14588.5 14973.3 2.6376941 NA NA NA
11 2018-01-11 14968.2 13405.8 -10.4381288 NA NA NA
12 2018-01-12 13453.9 13980.6 3.9148500 3.914850 NA NA
Now I have 3 new columns with overreactions(oR_pos) which are > 1sd; 1.5sd and 2sd.
I've already tried this code:
oR_pos_function <- function(y) {
n <- seq(1, 2, 0.5)
y[paste0("oR_pos>", n, "sd")] <-lapply(n, function(x)
ifelse(x$return > mean(x$return)+ sd(x$return),x$return, NA))
y
}
map(mget(list_cryptocurrencies), oR_pos_function)%>%
set_names(list_cryptocurrencies)%>%
list2env(envir = .GlobalEnv)
But it doesen't works.
Can someone help me?
The following closely matches your intended function, adding the desired columns onto your crypto, while allowing the desired sd thresholds to be passed-in as parameter for flexibility. Aside note, the solution below uses > as per OP, but you may wish to consider movement +/- direction from sd. Using solution below could be done using instead:
col <- ifelse(returns > (r_mean+(r_sd*threshold)) |
returns < (r_mean-(r_sd*threshold)),
returns,NA)
Solution as follows:
oR_pos_function <- function(returns,thresholds) {
r_mean <- mean(returns,na.rm=T)
r_sd <- sd(returns,na.rm=T)
cols <- lapply(thresholds,function(threshold) {
col <- ifelse(returns > (r_mean+(r_sd*threshold)),returns,NA)
return(col)
})
cols <- as.data.frame(cols)
names(cols) <- paste0("oR_pos>",thresholds,"sd")
return(cols)
}
new_cols <- oR_pos_function(returns=Bitcoin$return,thresholds=c(1,1.5,2))
Bitcoin <- cbind(Bitcoin,new_cols)
Results:
> head(Bitcoin[Bitcoin$date>="2018-01-01",])
date open high low close volume market close_ratio spread return oR_pos>1sd oR_pos>1.5sd oR_pos>2sd
1097 2018-01-01 14112.2 14112.2 13154.7 13657.2 10291200000 229119155396 0.5248042 957.5 -3.2241607 NA NA NA
1098 2018-01-02 13625.0 15444.6 13163.6 14982.1 16846600192 251377913955 0.7972381 2281.0 9.9603670 9.960367 9.960367 9.960367
1099 2018-01-03 14978.2 15572.8 14844.5 15201.0 16871900160 255080562912 0.4894961 728.3 1.4874952 NA NA NA
1100 2018-01-04 15270.7 15739.7 14522.2 15599.2 21783199744 261795321110 0.8845996 1217.5 2.1511784 NA NA NA
1101 2018-01-05 15477.2 17705.2 15202.8 17429.5 23840899072 292544135538 0.8898258 2502.4 12.6140387 12.614039 12.614039 12.614039
1102 2018-01-06 17462.1 17712.4 16764.6 17527.0 18314600448 294217423675 0.8043891 947.8 0.3716621 NA NA NA
>
Alternative per comments:
oR_pos_function <- function(coin_data,thresholds) {
returns <- coin_data$return
r_mean <- mean(returns,na.rm=T)
r_sd <- sd(returns,na.rm=T)
cols <- lapply(thresholds,function(threshold) {
col <- ifelse(returns > (r_mean+(r_sd*threshold)),returns,NA)
return(col)
})
cols <- as.data.frame(cols)
names(cols) <- paste0("oR_pos>",thresholds,"sd")
coin_data <- cbind(coin_data,cols)
return(coin_data)
}
You can use dplyr::mutate to add any such fields
library(dplyr)
Bitcoin %>%
mutate(oR_pos_1sd = ifelse(return > mean(return) + sd(return), return , NA),
oR_pos_1.5sd = ifelse(return > mean(return) + 1.5*sd(return), return , NA),
oR_pos_2sd = ifelse(return > mean(return) + 2*sd(return), return , NA))
The data set is this
badData <- list(c(296,310), c(330,335), c(350,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
reflectance = seq(-1,-61,-5.008667))
df
wavelength reflectance
300.0000 -1.000000
305.0087 -6.008667
310.0173 -11.017334
315.0260 -16.026001
320.0347 -21.034668
325.0433 -26.043335
330.0520 -31.052002
335.0607 -36.060669
340.0693 -41.069336
345.0780 -46.078003
350.0867 -51.086670
355.0953 -56.095337
The orginal question was whether to identify if wavelength fell in any of the ranges given in badData
The solution offered is this
https://stackoverflow.com/a/52070363/1012249
my question is using a similar syntax, how does one identify which badData bin it falls into. Lets say badData were structured like this, and bins are non-overlapping.
badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))
Here is an example using fuzzy join:
library(fuzzyjoin)
df %>%
fuzzy_left_join(badData, #join badData to df
by = c("wavelength" = "start", #variables to join by
"wavelength" = "end"),
match_fun=list(`>=`, `<=`)) #functions to use for each par of variables so "wavelength" >= "start" and "wavelength" <= "end" is the logic here
#output
wavelength reflectance bin start end
1 300.0000 -1.000000 a 296 310.01
2 305.0087 -6.008667 a 296 310.01
3 310.0173 -11.017334 <NA> NA NA
4 315.0260 -16.026001 <NA> NA NA
5 320.0347 -21.034668 <NA> NA NA
6 325.0433 -26.043335 <NA> NA NA
7 330.0520 -31.052002 b 330 335.00
8 335.0607 -36.060669 <NA> NA NA
9 340.0693 -41.069336 <NA> NA NA
10 345.0780 -46.078003 <NA> NA NA
11 350.0867 -51.086670 c 350 565.00
12 355.0953 -56.095337 c 350 565.00
You don't need a loop. You can simply use cut:
badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
reflectance = seq(-1,-61,-5.008667))
df$bins <- cut(df$wavelength, t(badData[, c("start", "end")]),
labels = head(c(t(cbind(as.character(badData$bin), "good"))), -1))
# wavelength reflectance bins
#1 300.0000 -1.000000 a
#2 305.0087 -6.008667 a
#3 310.0173 -11.017334 good
#4 315.0260 -16.026001 good
#5 320.0347 -21.034668 good
#6 325.0433 -26.043335 good
#7 330.0520 -31.052002 b
#8 335.0607 -36.060669 good
#9 340.0693 -41.069336 good
#10 345.0780 -46.078003 good
#11 350.0867 -51.086670 c
#12 355.0953 -56.095337 c
You haven't said which side of the intervals should be open or closed, but this can be adjusted.
I have data as follows:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
10 1986 NA
10 1986 NA
10 1987 240
10 1987 NA
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
12 1984 240
I need to filter rows based on following criteria:
For each PERMNO, sort data by date
Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA)
If the first row corresponds to company getting delisted, do not include any rows for that company
Based on these criteria, following is my expected output:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
I am using data.table in R to work with this data. The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs.
I implemented three different methods for doing this, as can be seen here:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3
Below I compare my implementations using a small dataset of 50k rows. Here are my results:
Time Comparison
system.time(dt <- filterbydelistingcode(dt)) # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt)) # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt)) # 114.3 seconds
As you can see all my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank you.
Edit: Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u
Also, here is a customized read function for this data:
readdata = function(filename){
data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
PRCABS = abs(data$PRC)
mcap = PRCABS * data$SHROUT
hpr = data$RET
HPR = as.numeric(levels(hpr))[hpr]
HPR[HPR==""] = NA
data = cbind(data,PRCABS,mcap, HPR)
return(data)
}
data <- readdata('fewdata.csv')
dt <- as.data.table(data)
Here's an attempt in data.table:
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
# PERMNO date DLSTCD
#1: 10 1983 NA
#2: 10 1985 250
#3: 11 1984 NA
#4: 11 1984 NA
#5: 11 1985 NA
#6: 11 1987 NA
Testing it on 2.5million rows, 400000 with a delisting date:
set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)
system.time({
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
})
# user system elapsed
# 0.74 0.00 0.76
Less than a second - not bad.
Building on #thelatemail's answer, here are two more variations on the same theme.
In both cases, setkey() first makes things easier to reason with :
setkey(dat,PERMNO,date) # sort by PERMNO, then by date within PERMNO
Option 1 : stack the data you want (if any) from each group
system.time(
ans1 <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
)
user system elapsed
2.604 0.000 2.605
That's quite slow because allocating and populating all the little bits of memory for the result for each group, only then to be stacked into one single result in the end again, takes time and memory.
Option 2 : (closer to the way you phrased the question) find the row numbers to delete, then delete them.
system.time({
todelete <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
}, keyby=PERMNO]
ans2 <- dat[ -todelete$V1 ]
})
user system elapsed
0.160 0.000 0.159
That's faster because it's only stacking row numbers to delete followed by a single operation to delete the required rows in one bulk operation. Since it's grouping by the first column of the key, it uses the key to make the grouping faster (groups are contiguous in RAM).
More info can be found about ?.SD and ?.I on this manual page.
You can inspect and debug what is happening inside each group just by adding a call to browser() and having a look as follows.
> ans1 <- dat[, {
browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD # type .SD to look at it
date DLSTCD
1: 21679 NA
2: 46408 1
3: 68378 NA
4: 75362 NA
5: 77690 NA
---
111: 2396559 1
112: 2451629 NA
113: 2461958 NA
114: 2484403 NA
115: 2485217 NA
Browse[1]> w # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD))) # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
date DLSTCD
1: 21679 NA
2: 46408 1
Browse[1]> # that is what is returned for this group
Browse[1]> n # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help # for browser commands
Let's say you find a problem or bug with one particular PERMNO. You can make the call to browser conditional as follows.
> ans1 <- dat[, {
if (PERMNO==42) browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD
date DLSTCD
1: 31018 NA
2: 35803 1
3: 37494 NA
4: 50012 NA
5: 52459 NA
---
128: 2405818 NA
129: 2429995 NA
130: 2455519 NA
131: 2478605 1
132: 2497925 NA
Browse[1]>
I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe
here is my questions: I got data with 3000 obs. and 5000 features, the 3000 obs. has a numeric names like 100.1,100.3,100.5,100.7. I changed the names into a integer variables by segs <-as.integer(names), then I want to use segs as a factor to sum all of the 3000 features. The length of the segs is 300 so the final data frame is 300 by 5000. I know tapply could be used to get the sum by factor for one variable but I have to use for to get all of the 5000 features summed. It is really time-consuming, so I want to know if there is a clear way in R to solve those problems or if there are some packages to solve this kind of problem.
This is the dirty code and df0 is the data while df is what I want:
df <- data.frame()
for(i in 2:ncol(df0)-1){
temp <- tapply(df0[,i],df2$segs,sum)
df <- cbind(df,temp)
}
Thanks!
=====
Thanks, Roland, a demo data is shown as follows:
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
Try this...
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
library(data.table)
dt = data.table(df0)
dt[,lapply(.SD, sum), by=seq ]
seq X Y Z
1: 23 164.8144774 1.293768670 -3.74807730
2: 4 8.9247301 1.909529066 -0.06277254
3: 13 40.2090180 -2.036599633 0.88836392
4: 16 147.8571697 -2.571487358 -1.35542918
5: 14 72.1640142 0.432493959 -1.49983832
6: 8 42.8498355 -0.582031919 -1.35989852
7: 25 75.9995653 0.896369560 -1.08024329
8: 9 27.5244048 0.833429855 -1.19363017
9: 30 30.1842371 0.188193035 -0.64574372
10: 32 32.8664539 0.108072728 2.03697217
11: -3 -7.5714175 -0.899304085 -1.27286230
12: 7 29.6254908 -0.929790177 2.75906514
27: 12 50.2535374 -0.620793351 -3.80900436
28: 24 24.4410126 -0.433169033 -0.02671746
29: -19 -19.9309008 -0.533492330 -1.01759612
30: 11 11.8523056 -1.071782384 0.96954501
31: 19 38.5407490 -0.751408534 -4.81312992
32: 0 -0.9642319 1.453325156 2.20977601
33: -1 -4.3685646 -0.834654913 -0.24624546
34: 18 18.2177311 -1.594588162 0.27369527
35: -4 -4.5921400 0.586487537 0.86256338