I'm trying to do the following:
Evaluate a POSIXct class column in a data frame, if an observation is an even second do nothing, and if its an odd second add 1.
This is what I have so far:
df[,1] <- lapply(df[,1], function(x) ifelse(as.integer(x)%%2==0, yes = x, no = x+1))
At this point my computer is running out of memory, probably because of the yes = x in the ifelse function... But I'm not sure.
Is there a better way to approach this problem?
Note: There are around 150,000 obs.
Edit:
Here is a sample of my data:
Timestamp PSTC01 PSTC02 PSTC03 PSTC04 PSTC05 PSTC06 PSTC07 PSTC08 PSTC09 PSTC10 PSTC11
1 2013-09-02 23:56:02 0.225339 NA NA NA NA 0.222298 0.253884 NA 0.243435 NA NA
2 2013-09-02 23:56:32 0.220459 NA NA NA NA 0.220009 0.250797 NA 0.241659 NA NA
3 2013-09-02 23:57:02 0.218379 NA NA NA NA 0.216663 0.252008 NA 0.240208 NA NA
4 2013-09-02 23:57:32 0.218264 NA NA NA NA 0.215935 0.256784 NA 0.240165 NA NA
5 2013-09-02 23:58:02 0.222438 NA NA NA NA 0.220964 0.253382 NA 0.241622 NA NA
6 2013-09-02 23:58:32 0.222154 NA NA NA NA 0.222533 0.252455 NA 0.242187 NA NA
7 2013-09-02 23:59:02 0.223612 NA NA NA NA 0.226128 0.253611 NA 0.243376 NA NA
8 2013-09-02 23:59:32 0.221370 NA NA NA NA 0.225215 0.253617 NA 0.243793 NA NA
9 2013-09-06 00:00:01 0.268708 NA NA NA 0.207481 NA 0.277915 NA NA 0.241519 0.242069
10 2013-09-06 00:00:31 0.268708 NA NA NA 0.207481 NA 0.277915 NA NA 0.241519 0.242069
11 2013-09-06 00:01:01 0.265310 NA NA NA 0.218782 NA 0.269295 NA NA 0.236030 0.211069
12 2013-09-06 00:01:31 0.270377 NA NA NA 0.217813 NA 0.272507 NA NA 0.236714 0.219648
13 2013-09-06 00:02:01 0.271845 NA NA NA 0.213403 NA 0.271097 NA NA 0.236685 0.218460
df[1] <- df[1] + as.integer(df[[1]]) %% 2
will be much faster due to vectorized operations. By the way: You don't need yes explicitly in ifelse, but it doesn't affect speed.
The above command adds one second to all odd values. If you want to do the opposite, i.e., add one second to even values, you just have to insert the logical not operator, !:
df[1] <- df[1] + !as.integer(df[[1]]) %% 2
Try
df[df[,1] %% 2 == 0,1] <- df[df[,1] %% 2 == 0,1] + 1
EDIT:
After reading your comment is looks like Sven's idea would work best. Try this (and adopting Sven's approach):
df[,1] <- df[,1] + {as.integer(df[,1]) %% 2}
Related
I am working with globally gridded data of annual maximum precipitation. However, I want to isolate those maximum value for land areas "only" for each of my 145 years by using a mask (so 145 maximum values based on all land areas). That said, I am receiving only NA values when I apply the mask, and I cannot understand why (when the mask is not applied, the below procedure works just fine). Here is what I have done so far:
Model66 <- brick("MaxPrecNOAA-GFDLGFDL-ESM2Ghistorical.nc", var="onedaymax")
#Applying the mask to isolate land areas only:
data("wrld_simpl")
b <- wrld_simpl
land <- mask(Model66,b)
#To derive highest maximum value for each layer/year for land only (145 years = 145 maximum values)
Gmax <- sapply(unstack(land), function(r){max(values(r))})
Gmax
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[40] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[79] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[118] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why would this be happening? I isolated land only, and my plots correctly show that the mask worked, as only land has values on the plots for each layer/year (and the idea would be take the highest value among these for each layer/year, as I attempted to do with object "Gmax"). Again, when a mask is not applied, NAs don't show up, so I wonder if it is just a small detail causing this when using the mask?
Any help with this would be greatly appreciated!
Thanks!
Try with:
Gmax <- sapply(unstack(land), function(r){max(values(r), na.rm=T)})
Your NAs are considered by R like the maximum value (positive infinitum), you can disable that option with na.rm=TRUE
I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize.
where the data frame is:
ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA
library(stringr)
df[str_detect(names(df), "ABBA" )]
works, and returns:
ABBA01A ABBA01B ABBA02A ABBA02B
1908 NA NA NA NA
So, I would like to create a dataframe for each of my species:
Speciesnames=unique ( substring (names(df),0, 4))
Speciesnames
[1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL"
I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it.
and I would like to add additional calculations in the loop
for ( i in seq_along(Speciesnames)){
df=df[str_detect(names(df), pattern =[i])]
print(df)
#my function for the subsetted dataframe
}
thank you for your help!
Using your data you could do the following:
create a list to hold the data.frames to be created.
filter the data.frames and store in the list
give each data.frame the name of of the specie
bring all the data.frames to the global environment out of the list
Speciesnames <- unique(substring(names(df),0, 4))
data <- vector("list", length(Speciesnames))
for(i in seq_along(Speciesnames)) {
data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
}
names(data) <- Speciesnames
list2env(data, envir = globalenv())
The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.
An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.
# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))
# Pass each species
listOfDFs <- mapply(function(x){
df[,startsWith(names(df),x)] # Return only columns starting with species
}, species, SIMPLIFY=FALSE)
listOfDFs
# $ABBA
# ABBA01A ABBA01B ABBA02A ABBA02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
#
# $ACRU
# ACRU01A ACRU01B ACRU02A ACRU02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
Data:
df <- read.table(text =
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA",
header = TRUE, stringsAsFactors = FALSE)
I think that you should select all matching columns first, and then subselect your data.frame.
patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]
res object is a list of matched columns for each pattern
Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.
If you are writing production code probably it is not the best answer.
Set every non-NA value that has a non-NA value to "his left" to NA.
Data
a <- c(3,2,3,NA,NA,1,NA,NA,2,1,4,NA)
[1] 3 2 3 NA NA 1 NA NA 2 1 4 NA
Desired Output
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
My working but ugly solution:
IND <- !(is.na(a)) & data.table::rleidv(!(is.na(a))) %>% duplicated
a[IND]<- NA
a
There's gotta be a better solution ...
Alternatively,
a[-1][diff(!is.na(a)) == 0] <- NA; a
# [1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
OK for brevity...
a[!is.na(dplyr::lag(a))]<-NA
a
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
You can do a simple ifelse statement where you add your vector with a lagged vector a. If the result is NA then the value should remain the same. Else, NA, i.e.
ifelse(is.na(a + dplyr::lag(a)), a, NA)
#[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
My code is as below:
Form_CharSizePorts2 <- function(main, size, var, wght, ret) {
main.cln <- main %>%
select(date, permno, exchcd, eval(parse(text=size)), eval(parse(text=var)), eval(parse(text=wght)), eval(parse(text=ret))) %>%
data.table
Bkpts.NYSE <- main.cln %>%
filter(exchcd == 1) %>%
group_by(date) %>%
summarize(var.P70 = quantile(.[[var]], probs=.7, na.rm=TRUE),
var.P30 = quantile(.[[var]], probs=.3, na.rm=TRUE),
size.Med = quantile(.[[size]], probs=.5, na.rm=TRUE))
main.rank <- main.cln %>%
merge(Bkpts.NYSE, by="date", all.x=TRUE) %>%
mutate(Size = ifelse(.[[size]]<size.Med, "Small", "Big"),
Var = ifelse(.[[var]]<var.P30, "Low", ifelse(.[[var]]>var.P70, "High", "Neutral")),
Port = paste(Size, Var, sep="."))
Ret <- main.rank %>%
group_by(date, Port) %>%
summarize(ret.port = weighted.mean(.[[ret]], .[[wght]], na.rm=TRUE)) %>%
spread(Port, ret.port) %>%
mutate(Small = (Small.High + Small.Neutral + Small.Low)/3,
Big = (Big.High + Big.Neutral + Big.Low)/3,
SMB = Small - Big,
High = (Small.High + Big.High)/2,
Low = (Small.Low + Big.Low)/2,
HML = High - Low)
return(Ret)
}
Form_FF4Ports <- function(dt) {
dt.cln <- dt %>%
group_by(permno) %>%
mutate(lag.ret.12t2 = lag(ret.12t2, 1))
output <- dt.cln %>%
group_by(date) %>%
summarize(MyMkt = weighted.mean(retadj.1mn, w=port.weight, na.rm=TRUE)) %>%
as.data.frame %>%
merge(Form_CharSizePorts2(dt.cln, "lag.ME.Jun", "lag.BM.FF", "port.weight", "retadj.1mn"),
by="date", all.x=TRUE) %>%
transmute(date, MyMkt, MySMB=SMB, MySMBS=Small, MySMBB=Big, MyHML=HML, MyHMLH=High, MyHMLL=Low) %>%
merge(Form_CharSizePorts2(dt.cln, "lag.ME.Jun", "lag.ret.12t2", "port.weight", "retadj.1mn"),
by="date", all.x=TRUE) %>%
transmute(date, MyMkt, MySMB, MySMBS, MySMBB, MyHML, MyHMLH, MyHMLL, MyUMD=HML, MyUMDU=High, MyUMDD=Low)
return(output)
}
dt.myFF4.m <- Form_FF4Ports(data.both.FF.m)
Part of my data is as below:
date permno shrcd exchcd cfacpr cfacshr shrout prc vol retx retadj.1mn me port.weight datadate
1 Dec 1925 10006 10 1 7.412625 7.260000 600 109.00 NA NA NA 65.40000 NA <NA>
2 Dec 1925 10022 10 1 9.365437 9.365437 200 56.00 NA NA NA 11.20000 NA <NA>
3 Dec 1925 10030 10 1 9.969793 9.155520 156 150.00 NA NA NA 23.40000 NA <NA>
4 Dec 1925 10057 11 1 4.000000 4.000000 500 12.25 NA NA NA 6.12500 NA <NA>
5 Dec 1925 10073 10 1 0.200000 0.200000 138 17.50 NA NA NA 2.41500 NA <NA>
6 Dec 1925 10081 10 1 1.000000 1.000000 1192 9.00 NA NA NA 10.72800 NA <NA>
7 Dec 1925 10102 10 1 18.137865 18.000000 201 109.75 NA NA NA 22.05975 NA <NA>
8 Dec 1925 10110 10 1 1.010000 1.000000 500 10.50 NA NA NA 5.25000 NA <NA>
9 Dec 1925 10129 10 1 1.000000 1.000000 270 -132.00 NA NA NA 35.64000 NA <NA>
10 Dec 1925 10137 11 1 21.842743 20.920870 613 71.75 NA NA NA 43.98275 NA <NA>
comp.count at revt ib dvc BE OpProf GrProf Cflow Inv AstChg Davis.bkeq d.shares ret.12t2 ME.Dec ME.Jun BM.FF OpIB
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
GrIA CFP.FF BM.m CFP.m lag.ME.Jun lag.BM.FF lag.OpIB lag.AstChg
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA
When I run the rode, I got the error message Error in eval(expr, envir, enclos) : object 'lag.ME.Jun' not found.
I guess the reason could be that I used the eval(parse(text = )) function here, and the environment is not set up correctly. However, other than this function, I am not sure which approach I should use when creating a universal purpose function suitable for data with different column names.
Specifically, I would like to know how I can use my function for different data frames without having to change the column names before I use them in my function.
Your issue is discussed, and solved, in the 'Programming with dplyr' vignette.
The bottomline is instead of quoting lag.ME.Jun yourself by referring to it using "lag.ME.Jun", you should rely on enquo(lag.ME.Jun) and !!lag.ME.Jun. However, this would mean that it should be in the function call.
Your function at several other points also refers to variables that are not created in the function environment (e.g. exchcd, date), so R will currently throw errors on any dataset that does not contain these variables. In general, it is unwise for functions to rely on inputs that were not part of the function call.
I am trying to get my head around how to use data.tables. It is not going well.
I have a large data.table with a bunch of returns and AUM. I subsetted that data.table into two data.tables, one with returns, and one with AUM. I now want to subset the returns data.table, to get only the returns from funds with AUM less than the 50th percentile.
To give you an idea, this is my code:
fundDetails <- data.table(read.table("Fund_Deets.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
fundNAV <- data.table(read.table("NAV_AUM.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
allFundDetails <- fundDetails[Currency == 'USD']
allFundNAV <- fundNAV[Fund.ID %in% allFundDetails$Fund.ID]
allFundAUM <- allFundNAV[Type == 'AUM', -c(1,3), with = FALSE]
allFundAUM <- setnames(data.table(t(sapply(allFundAUM[,-1, with = FALSE],as.numeric))), as.character(allFundAUM$Fund.ID))
allFundReturns <- allFundNAV[Type == 'Return', -c(1,3), with = FALSE]
allFundReturns <- setnames(data.table(t(sapply(allFundReturns[,-1, with = FALSE],as.numeric)/100)), as.character(allFundReturns$Fund.ID))
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
This Produces the following three tables (smallFundReturns is obviously just NA's):
> allFundAUM[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 1 27
2: NA NA NA NA NA NA 117 NA 1 27
3: NA NA NA NA NA NA 120 NA 1 27
4: NA NA NA NA NA NA 133 NA 1 27
5: NA NA NA NA NA NA 146 NA 1 29
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> allFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 0.0188 -0.0116
2: NA NA NA NA NA NA -0.0315 NA -0.0120 0.0134
3: NA NA NA NA NA NA -0.0978 NA -0.0908 -0.0206
4: NA NA NA NA NA NA -0.0445 NA -0.0269 -0.0287
5: NA NA NA NA NA NA 0.0139 NA 0.0298 -0.0141
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
I am trying to subset using this for loop (using a for loop in an attempt to debug):
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
This produces an error:
Error in `[<-.data.table`(`*tmp*`, i, theSubset, with = FALSE, value = list( :
unused argument (with = FALSE)
I tried removing the 'with' part, but this spits out a bunch of warnings:
> warnings()
Warning messages:
1: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '41526' (3019 unused)
2: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45993' (3019 unused)
3: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45994' (3019 unused)
4: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
I am confused on how to do this. Any ideas on how I can subset the second data.table by the subset on the first?
EDIT:
I tried the suggestion below:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
And I got these warnings():
> warnings()
Warning messages:
1: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
2: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
3: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
And the code produced this, with 'TRUE' everywhere I would expect a number:
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA TRUE TRUE
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
EDIT 2:
I figured out the issue. Apparently, this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
created the table as being logical. I changed it to this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) as.numeric(rep(NA, length(x)))))
And everything worked after #HubertL fix. Thanks!!
You have to write it like that:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
Suggestions for improvement:
Try reading data with fread instead of read.table if possible. It's way faster and the result is data.table not data.frame.
When doing "data.table operations" with the statement ", with=FALSE" you actually force R to use the much slower data.frame operations instead of using the blazingly fast data.table methods.
Have fun