split dataframe with multiple delimiters in R - r

df1 <-
Gene GeneLocus
CPA1|1357 chr7:130020290-130027948:+
GUCY2D|3000 chr17:7905988-7923658:+
UBC|7316 chr12:125396194-125399577:-
C11orf95|65998 chr11:63527365-63536113:-
ANKMY2|57037 chr7:16639413-16685398:-
expected output
df2 <-
Gene.1 Gene.2 chr start end
CPA1 1357 7 130020290 130027948
GUCY2D 3000 17 7905988 7923658
UBC 7316 12 125396194 125399577
C11orf95 65998 11 63527365 63536113
ANKMY2 57037 7 16639413 16685398]]
I tried this way..
install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)
I would like to know if there is any other alternative way to do it in simpler way

Here you go...Just ignore the warning that does not affect the output; it actually has the side effect of removing the strand information (:+ or :-).
library(tidyr)
library(dplyr)
df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))
Output:
Gene.1 Gene.2 chr start end
1 CPA1 1357 7 130020290 130027948
2 GUCY2D 3000 17 7905988 7923658
3 UBC 7316 12 125396194 125399577
4 C11orf95 65998 11 63527365 63536113
5 ANKMY2 57037 7 16639413 16685398

I would suggest something like the following approach:
Make a single delimiter in your "GeneLocus" column (and strip out the unnecessary parts while you're at it).
Split both columns at once. Note that cSplit "balances" the columns being split according to the number of output columns detected. Thus, since the first column would only result in 2 columns when split, but the second would result in 4, you would need to drop columns 3 and 4 from the result.
library(splitstackshape)
GLPat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
cSplit(as.data.table(mydf)[, GeneLocus := gsub(
GLPat, "\\1|\\2|\\3|\\4", GeneLocus)], names(mydf), "|")[
, 3:4 := NULL, with = FALSE][]
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
Alternatively, you can try col_flatten from my "SOfun" package, with which you can do:
library(SOfun)
Pat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
Fun <- function(invec) strsplit(gsub(Pat, "\\1|\\2|\\3|\\4", invec), "|", TRUE)
col_flatten(as.data.table(mydf)[, lapply(.SD, Fun)], names(mydf), drop = TRUE)
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
SOfun is only on GitHub, so you can install it with:
source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/SOfun")

Related

How do I gsub the complete time string behind #

(this is my first question, if i need to improve anything about it, pls let me know!)
I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.
My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "#HH:MM").
I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "#HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "#HH" but keeping ":MM". for this I used
gsub("#([[:digit:]]|[_])*", "", df$timepoints)
, as found for a similar problem just with words here: remove all words that start with "#" from a string
The aim is to get (e.g.):
id
timepoints
3990
"7:16-7:23, 7:25-7:43"
or
id
timepoints
3990
"7:16-7:23", "7:25-7:43"
If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.
Any help would be greatly appreciated!
How about matching the strings you're interested in instead?
With base:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output (with a list column):
id timepoints new_timepoints
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23, 7:25-7:43
2 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
With tidyverse (in a long format for easy calculations!):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23
2 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:25-7:43
3 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
You can do something like this:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s[grepl("^\\d",s)]
})
}
and then apply that function to the timepoints column
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest(timepoints)
Output:
id timepoints
<chr> <chr>
1 3990 7:16-7:23
2 3990 7:25-7:43
3 3989 7:25-7:43
4 3004 7:30-7:39
5 3004 7:45-7:48
6 3004 7:49-7:54
You could also use unnest_wider() to get these as columns; for that I would adjust my f() to include the names of the timepoints:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s = s[grepl("^\\d",s)]
setNames(s, paste0("tp", 1:length(s)))
})
}
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest_wider(timepoints)
Output:
id tp1 tp2 tp3
<chr> <chr> <chr> <chr>
1 3990 7:16-7:23 7:25-7:43 NA
2 3989 7:25-7:43 NA NA
3 3004 7:30-7:39 7:45-7:48 7:49-7:54
Setting the data with the package data.table
library(data.table)
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.table(id, timepoints)
Note that I saved it as a data.table
Splitting the timepoints by comma and storing the value in the new_time column.
df[,new_time:=strsplit(timepoints, ",")]
Removing the string values that has #
df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[#]", x)]))]
Since the timepoints column has multiple commas in a row empty string("") exists I remove them
df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]
Now the new_time column looks like this
df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"
[[2]]
[1] "7:25-7:43"
[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"
If you want to have the new_time column to have whole strings
df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43" "7:25-7:43" "7:30-7:39, 7:45-7:48, 7:49-7:54"
1) list Split by comma and then grep out the components with a dash. No packages are used. This gives a list of character vectors as the timepoints column.
df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","),
grep, pattern = "-", value = TRUE)
df2
## id timepoints
## 1 3990 7:16-7:23, 7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ id : chr "3990" "3989" "3004"
## $ timepoints:List of 3
## ..$ : chr "7:16-7:23" "7:25-7:43"
## ..$ : chr "7:25-7:43"
## ..$ : chr "7:30-7:39" "7:45-7:48" "7:49-7:54"
2) character If you want a comma separated character string in each row add this:
transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
## id timepoints
## 1 3990 7:16-7:23,7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54
3) long form or if you prefer long form use this:
long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
## id timepoints
## 1 3990 7:16-7:23
## 2 3990 7:25-7:43
## 3 3989 7:25-7:43
## 4 3004 7:30-7:39
## 5 3004 7:45-7:48
## 6 3004 7:49-7:54
4) wide form or a wide form matrix:
nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
## seq
## id 1 2 3
## 3990 "7:16-7:23" "7:25-7:43" NA
## 3989 "7:25-7:43" NA NA
## 3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"

Dynamically determine if a dataframe column exists and mutate if it does

I have code that pulls and processes data from a database based upon a client name. Some clients may have data that does not include a specific column name, e.g., last_name or first_name. For clients that do not use last_name or first_name, I don't care. For clients that do use either of those fields, I need to mutate() those columns with toupper() so that I can join on those standardized fields later in the ETL process.
Right now, I'm using a series of if() statements and some helper functions to look into the names of a dataframe then mutate if they exist. I'm using if() statements because ifelse() is mostly vectorized and doesn't handle dataframes well.
library(dplyr)
set.seed(256)
b <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
another_col = sample(1000:9999, 5, FALSE))
d <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
last_name = sample(letters, 5, FALSE))
mutate_first_last <- function(df){
mutate_first_name <- function(df){
df %>%
mutate(first_name = first_name %>% toupper())
}
mutate_last_name <- function(df){
df %>%
mutate(last_name = last_name %>% toupper())
}
n <- c("first_name", "last_name") %in% names(df)
if (n[1] & n[2]) return(df %>% mutate_first_name() %>% mutate_last_name())
if (n[1] & !n[2]) return(df %>% mutate_first_name())
if (!n[1] & n[2]) return(df %>% mutate_last_name())
if (!n[1] & !n[2]) return(df)
}
I get what I expect to get this way
> b %>% mutate_first_last()
id col_name another_col
1 48 8318 6207
2 39 7155 7170
3 16 4486 4321
4 55 2521 8024
5 15 1412 4875
> d %>% mutate_first_last()
id col_name last_name
1 64 7438 A
2 43 4551 Q
3 48 7401 K
4 78 3682 Z
5 87 2554 J
but is this the best way to handle this kind of task? To dynamically look to see if a column name exists in a dataframe then mutate it if it does? It seems strange to have to have multiple if() statements in this function. Is there a more streamlined way to process these data?
You can use mutate_at with one_of, both from dplyr. This will mutate column only if it matches with one of c("first_name", "last_name"). If no match, it will generate a simple warning but you can ignore or suppress it.
library(dplyr)
d %>%
mutate_at(vars(one_of(c("first_name", "last_name")), toupper)
id col_name last_name
1 19 7461 V
2 52 9651 H
3 56 1901 P
4 13 7866 Z
5 25 9527 U
# example with no match
b %>%
mutate_at(vars(one_of(c("first_name", "last_name"))), toupper)
id col_name another_col
1 34 9315 8686
2 26 5598 4124
3 17 3318 2182
4 32 1418 4369
5 49 4759 6680
Warning message:
Unknown variables: `first_name`, `last_name`
Here are a bunch of other ?select_helpers in dplyr -
These functions allow you to select variables based on their names.
starts_with(): starts with a prefix
ends_with(): ends with a prefix
contains(): contains a literal string
matches(): matches a regular expression
num_range(): a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.
Update dplyr 1.0.0
In dplyr 1.0, the scoped variants of mutate such as _at or _all were replaced by across().
In addition, the best tidy_select helper for this case is any_of as it will perform on the variables which exist, but ignores those that don't exist (without warning message).
As result, you can write the following:
# purrr syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), ~toupper(.x)))
# function name syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
which both return the mutated column
id col_name last_name
1 19 4398 Q
2 72 1135 S
3 54 9767 V
4 60 4364 K
5 35 1564 X
while
b %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
ignores the columns and thus returns (without warning message):
id col_name another_col
1 42 7601 4482
2 22 1773 7072
3 47 2719 5884
4 1 9595 5945
5 81 8044 3927

filtering a dataset dependant on a value within a string

I am currently working with Google Analytics and R and have a query I hope someone can help me with.
I have exported my data from GA into R and have it in a dataframe ready for processing.
I want to create a for loop which goes through my data and sums a number of columns in my dataframe if one column contains a certain value.
For example, my dataframe looks like this
I have a list of ID's which are the individual 3 digit numbers, which I can use in a for loop.
My past experience of R I have been able to filter the list so that I have
data[data$ID == 341,] -> datanew
and I have found some code which can see if there is a certain string within a string producing a bool
grepl(value, chars)
Is there a way to link these up together so that I have a sum code similar to below
aggregate(cbind(users, conversion)~ID,data=datanew,FUN=sum) -> resultforID
Basically taking that data and for every 341 add the users and conversions..
I hope I have explained this the best way possible.
Thanks in advance
data table has 3 columns. ID, users, Conversion with the users and Conversion linked to the IDs.
Some ID's are on their own, so 341, others are 341|246 and some will have three numbers with them seperated by the |
# toy data
mydata = data.frame(ID = c("341|243","341|243","341|242","341","243",
"999","111|341|222"),
Users = 10:16,
Conv = 5:11)
# ID Users Conv
# 1 341|243 10 5
# 2 341|243 11 6
# 3 341|242 12 7
# 4 341 13 8
# 5 243 14 9
# 6 999 15 10
# 7 111|341|222 16 11
# are you looking for something like below:
# presume you just want to filter those IDs have 341.
library(dplyr)
mydata[grep("341",mydata$ID),] %>%
group_by(ID) %>%
summarise_each(funs(sum))
# ID Users Conv
# 1 111|341|222 16 11
# 2 341 13 8
# 3 341|242 12 7
# 4 341|243 21 11
If I understand your question correctly, you may want to look at cSplit from my "splitstackshape" package.
Using #KFB's sample data (which is hopefully representative of your actual data), try:
library(splitstackshape)
cSplit(mydata, "ID", "|", "long")[, lapply(.SD, sum), by = ID]
# ID Users Conv
# 1: 341 62 37
# 2: 243 35 20
# 3: 242 12 7
# 4: 999 15 10
# 5: 111 16 11
# 6: 222 16 11
Alternatively, from the Hadleyverse, you can use "dplyr" and "tidyr" together, like this:
library(dplyr)
library(tidyr)
mydata %>%
transform(ID = strsplit(as.character(ID), "|", fixed = TRUE)) %>%
unnest(ID) %>%
group_by(ID) %>%
summarise_each(funs(sum))
# Source: local data frame [6 x 3]
#
# ID Users Conv
# 1 111 16 11
# 2 222 16 11
# 3 242 12 7
# 4 243 35 20
# 5 341 62 37
# 6 999 15 10
I think this should work:
library(dplyr)
sumdf <- yourdf %>%
group_by(ID) %>%
summarise_each(funs(sum))
I'm not clear about the structure of your ID column, but if you need to just get the numbers you could try this:
library(tidyr)
newdf <- separate(yourdf, ID, c('id1', 'id2'), '|') %>%
filter(id1 == 341) # optional if you just want one ID
Here are two answers. The first being with subset and the second is with 'grep' using a string
initial run
x1<-sample(1:4,10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-subset(dat,subset=x1==i)
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z)
}
with GREP
x1<-sample(letters[1:3],10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-dat[grep(i,dat$x1),]
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z) #this will assign separate objects as your aggregates with names based on the string
}

Correct way of vectorizing "lookup" function

I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?
First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190

Apply sum by integer factor after make a round in R

here is my questions: I got data with 3000 obs. and 5000 features, the 3000 obs. has a numeric names like 100.1,100.3,100.5,100.7. I changed the names into a integer variables by segs <-as.integer(names), then I want to use segs as a factor to sum all of the 3000 features. The length of the segs is 300 so the final data frame is 300 by 5000. I know tapply could be used to get the sum by factor for one variable but I have to use for to get all of the 5000 features summed. It is really time-consuming, so I want to know if there is a clear way in R to solve those problems or if there are some packages to solve this kind of problem.
This is the dirty code and df0 is the data while df is what I want:
df <- data.frame()
for(i in 2:ncol(df0)-1){
temp <- tapply(df0[,i],df2$segs,sum)
df <- cbind(df,temp)
}
Thanks!
=====
Thanks, Roland, a demo data is shown as follows:
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
Try this...
set.seed(42)
df0 <- data.frame(
X = rnorm(100,10,10),
Y = rnorm(100),
Z = rnorm(100))
df0$seq <- as.integer(df0$X)
library(data.table)
dt = data.table(df0)
dt[,lapply(.SD, sum), by=seq ]
seq X Y Z
1: 23 164.8144774 1.293768670 -3.74807730
2: 4 8.9247301 1.909529066 -0.06277254
3: 13 40.2090180 -2.036599633 0.88836392
4: 16 147.8571697 -2.571487358 -1.35542918
5: 14 72.1640142 0.432493959 -1.49983832
6: 8 42.8498355 -0.582031919 -1.35989852
7: 25 75.9995653 0.896369560 -1.08024329
8: 9 27.5244048 0.833429855 -1.19363017
9: 30 30.1842371 0.188193035 -0.64574372
10: 32 32.8664539 0.108072728 2.03697217
11: -3 -7.5714175 -0.899304085 -1.27286230
12: 7 29.6254908 -0.929790177 2.75906514
27: 12 50.2535374 -0.620793351 -3.80900436
28: 24 24.4410126 -0.433169033 -0.02671746
29: -19 -19.9309008 -0.533492330 -1.01759612
30: 11 11.8523056 -1.071782384 0.96954501
31: 19 38.5407490 -0.751408534 -4.81312992
32: 0 -0.9642319 1.453325156 2.20977601
33: -1 -4.3685646 -0.834654913 -0.24624546
34: 18 18.2177311 -1.594588162 0.27369527
35: -4 -4.5921400 0.586487537 0.86256338

Resources