I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!
You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))
Related
I have 2 separate DFs, I want to mutate 2 new columns in dat2 ('Avg_of_nonNA', and a 'Cols' to track which column its using) based on the non-NA columns in dat1. I need take a subset of dat2 because the matrix is dense whereas dat1 is sparse (So I can take advantage of the sparse-ness). The only way to match the columns is to match the common elements in the names: (0-1,1-2,2-3,3-4) in my case. The rest of the column names are gibberish. Its requiring string splitting and matching--causing many problems because I can't chain stuff together because each row has a different combination of columns to average (dummy example is simplified). I do have a working solution, but it is painfully slow across my 1M+ rows. Here is that solution:
I'm looking for a way to get rid of the for loop. Any suggestions?
for (z in 1:5) {
relevant_cols=dat1[z,] %>%
select_if(~!all(is.na(.))) %>%
names %>% strsplit(.,'_') %>% map(.,2) %>% unlist()
id=dat1[z,'ID']$`ID`
dat2[`ID`== id,`:=`(Avg_of_nonNA = (mean(as.numeric(.SD))),Cols=paste0(relevant_cols,collapse='/')), .SDcols=names(dat2) %like% paste0(relevant_cols,collapse='|')]
}
Data Below
> dat1
ID gjfkg_0-1_fkjdk_fjdkd jdfsje_1-2_fhks_ejfskj dfjs_2-3_vjskf_wqew gdlkrzc_3-4_rjrkj Avg_of_nonNA_otherDT
1: 1 2.23 1.37 NA NA 1.5
2: 2 1.98 NA NA 1.760 6.5
3: 3 NA 4.45 9.350 3.320 11.0
4: 4 NA NA 6.642 2.019 15.5
5: 5 NA 3.21 3.677 NA 18.5
> dat2
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd
1: 1 1 2 3 4
2: 2 5 6 7 8
3: 3 9 10 11 12
4: 4 13 14 15 16
5: 5 17 18 19 20
dput(dat1)
structure(list(ID = 1:5, `gjfkg_0-1_fkjdk_fjdkd` = c(2.23, 1.98,
NA, NA, NA), `jdfsje_1-2_fhks_ejfskj` = c(1.37, NA, 4.45, NA,
3.21), `dfjs_2-3_vjskf_wqew` = c(NA, NA, 9.35, 6.642, 3.677),
`gdlkrzc_3-4_rjrkj` = c(NA, 1.76, 3.32, 2.019, NA)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
dput(dat2)
structure(list(ID = 1:5, `ewrwer_0-1_iopi_opop` = c(1L, 5L, 9L,
13L, 17L), `erewtt_1-2_rueiwu_vcvbc` = c(2L, 6L, 10L, 14L, 18L
), `erewr_2-3_iirew_rewr` = c(3L, 7L, 11L, 15L, 19L), `mnmn_3-4_cxzxzc_gjd` = c(4L,
8L, 12L, 16L, 20L)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
Expected output:
Here is an option:
setDT(dat1)
setDT(dat2)
nm <- sapply(strsplit(names(dat1[, -"ID"]), "_"), `[[`, 2L)
dat2[, c("Avg_of_nonNA_otherDT", "Cols") := {
nas <- is.na(dat1[,-"ID"])
m <- col(nas)
m[] <- nm[m]
m[nas] <- ""
.(rowMeans(.SD * NA^nas, na.rm=TRUE),
gsub("\\s+", "/", trimws(do.call(paste, as.data.frame(m)))))
}, .SDcols=-"ID"]
output:
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd Avg_of_nonNA_otherDT Cols
1: 1 1 2 3 4 1.5 0-1/1-2
2: 2 5 6 7 8 6.5 0-1/3-4
3: 3 9 10 11 12 11.0 1-2/2-3/3-4
4: 4 13 14 15 16 15.5 2-3/3-4
5: 5 17 18 19 20 18.5 1-2/2-3
This question already has answers here:
Find nearest matches for each row and sum based on a condition
(4 answers)
Closed 3 years ago.
There are 3 parts to this problem:
1) I want to sum values in column b,c,d for any two adjacent rows which have the same values for columns(b,c,d)
2) I would like to keep values in other columns the same. (Some other column (eg. a) may contain character data.)
3) I would like to keep the changes by replacing the original value in columns b,c,d in the first row (of the 2 same rows) with the new values (the sums) and delete the second row(of the 2 same rows).
Time a b c d id
1 2014/10/11 A 40 20 10 1
2 2014/10/12 A 40 20 10 2
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 20 7 8 6
7 2014/10/17 B 20 7 8 7
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
Expected outcome:
Time a b c d id
1 2014/10/11 A 80 40 20 1 *
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 40 14 16 6 *
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
id 1 and 2 combined to become id 1; id 6 and 7 combined to become id 6.
Thank you. Any contribution is greatly appreciated.
Using dplyr functions along with data.table::rleid. To get same values for adjacent b, c and d columns we paste them and use rleid to create groups. For each group we sum the values at b, c and d columns and keep only the 1st row.
library(dplyr)
df %>%
mutate(temp_col = paste(b, c, d, sep = "-")) %>%
group_by(group = data.table::rleid(temp_col)) %>%
mutate_at(vars(b, c, d), sum) %>%
slice(1L) %>%
ungroup %>%
select(-temp_col, -group)
# Time a b c d id
# <fct> <fct> <int> <int> <int> <int>
#1 2014/10/11 A 80 40 20 1
#2 2014/10/13 B 9 10 9 3
#3 2014/10/14 D 16 5 12 4
#4 2014/10/15 D 1 6 5 5
#5 2014/10/16 B 40 14 16 6
#6 2014/10/18 A 11 9 5 8
#7 2014/10/19 C 31 20 23 9
data
df <- structure(list(Time = structure(1:9, .Label = c("2014/10/11",
"2014/10/12", "2014/10/13", "2014/10/14", "2014/10/15", "2014/10/16",
"2014/10/17", "2014/10/18", "2014/10/19"), class = "factor"),
a = structure(c(1L, 1L, 2L, 4L, 4L, 2L, 2L, 1L, 3L), .Label = c("A",
"B", "C", "D"), class = "factor"), b = c(40L, 40L, 9L, 16L,
1L, 20L, 20L, 11L, 31L), c = c(20L, 20L, 10L, 5L, 6L, 7L,
7L, 9L, 20L), d = c(10L, 10L, 9L, 12L, 5L, 8L, 8L, 5L, 23L
), id = 1:9), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9"))
I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)
I used the aggregate function in R to bring down my data entries from 90k to 1800.
a=test$ID
b=test$Date
c=test$Value
d=test$Value1
sumA=aggregate(c, by=list(Date=b,Id=a), FUN=sum)
sumB=aggregate(d, by=list(Date=b,Id=a), FUN=sum)
final[1]=sumA[1],final[2]=sumA[2]
final[3]=sumA[3]/sumB[3]
Now I have data in 20 different dates in a month with close to 90 different ids each day so its around 1800 entries in the final table .
My question is that I want to aggregate further down and find the maximum value of final[3] for each date so that I am just left with 20 values .
In simple terms -
There are 20 days .
Each day has 90 values for 90 ids
I want to find maximum of these 90 values for each day .
So at last I would be left with just 20 values for 20 days .
Now aggregate function is not working here with function 'max' instead of sum
Date ID Value Value1
1 A 20 10
1 A 25 5
1 B 50 5
1 B 50 5
1 C 25 25
1 C 35 5
2 A 30 10
2 A 25 45
2 B 40 10
2 B 40 30
This is the Data
Now By using Aggregate function I got final table as
Date ID x
1 A 45/15=3
1 B 100/10=10
1 c 60/30=2
2 A 55/55=1
2 B 80/40=2
Now I want maximum value for date 1 and 2 thats it
Date max- Value
1 10
2 2
This is a one step process using data table. The data.table is an evolved version of data.frame, and works really well. It has the class of data.frame, so works just like data.frame.
Step0: Converting data.frame to data.table:
library(data.table)
setDT(test)
setkey(test,Date,ID)
Step1: Do the computation
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
Here the explanation of the step:
The first part creates what you call the final table in your question:
test[,sum(Value)/sum(Value1),by=key(test)]
# Date ID V1
# 1: 1 A 3
# 2: 1 B 10
# 3: 1 C 2
# 4: 2 A 1
# 5: 2 B 2
Now this is passed to the second item to do the max function by Date:
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
# Date V1
# 1: 1 10
# 2: 2 2
Hope this helps.
It's a very well documented package. You should read more about it.
May be this helps.
test <- structure(list(Date = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), ID = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B"),
Value = c(20L, 25L, 50L, 50L, 25L, 35L, 30L, 25L, 40L, 40L
), Value1 = c(10L, 5L, 5L, 5L, 25L, 5L, 10L, 45L, 10L, 30L
)), .Names = c("Date", "ID", "Value", "Value1"), class = "data.frame", row.names = c(NA,
-10L))
res1 <- aggregate(. ~ID+Date, data=test, FUN=sum)
res1 <- transform(res1, x=Value/Value1)
res1
# ID Date Value Value1 x
#1 A 1 45 15 3
#2 B 1 100 10 10
#3 C 1 60 30 2
#4 A 2 55 55 1
#5 B 2 80 40 2
aggregate(. ~Date, data=res1[,-c(1,3:4)], FUN=max)
# Date x
# 1 1 10
# 2 2 2
First I run the aggregate based on two grouping variables (ID and Date) on the two value column by using. ~`
Created a new variable x i.e. Value/Value1 with transform
Did the final run of aggregate with one grouping variable (Date) and removed the rest of the variables except x.
I have a rather large data frame. Here is a simplified example:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
2 CBA 14 Good
2 FDA 14 Good
3 JHA 16 Good
3 AHF 16 Good
3 AKF 17 Good
Here it is as a dput:
dat <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Element = structure(c(1L,
2L, 5L, 6L, 7L, 8L, 3L, 4L), .Label = c("AAA", "ABA", "AHF",
"AKF", "AVA", "CBA", "FDA", "JHA"), class = "factor"), Value = c(11L,
12L, 13L, 14L, 14L, 16L, 16L, 17L), Note = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Good", class = "factor")), .Names = c("Group",
"Element", "Value", "Note"), class = "data.frame", row.names = c(NA,
-8L))
I'm trying to separate it based on the group. so let's say
Group 1 will be a data frame:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
Group 2:
2 CBA 14 Good
2 FDA 14 Good
and so on.
You can use split for this.
> dat
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
## 6 3 JHA 16 Good
## 7 3 AHF 16 Good
## 8 3 AKF 17 Good
> x <- split(dat, dat$Group)
Then you can access each individual data frame by group number with x[[1]], x[[2]], etc.
For example, here is group 2:
> x[[2]] ## or x[2]
## Group Element Value Note
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
ADD: Since you asked about it in the comments, you can write each individual data frame to file with write.csv and lapply. The invisible wrapper is simply to suppress the output of lapply
> invisible(lapply(seq(x), function(i){
write.csv(x[[i]], file = paste0(i, ".csv"), row.names = FALSE)
}))
We can see that the files were created by looking at list.files
> list.files(pattern = "^[0-9].csv")
## [1] "1.csv" "2.csv" "3.csv"
And we can see the data frame of the third group with read.csv
> read.csv("3.csv")
## Group Element Value Note
## 1 3 JHA 16 Good
## 2 3 AHF 16 Good
## 3 3 AKF 17 Good
Obligatory plyr version (pretty much equiv to Richard's, but I'll bet it's slower, too:
library(plyr)
groups <- dlply(dat, .(Group), function(x) { return(x) })
length(groups)
## [1] 3
groups$`1` # can also do groups[[1]]
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
groups[[2]]
## Group Element Value Note
## 1 2 CBA 14 Good
## 2 2 FDA 14 Good