reading/writing data frame to google sheets using pygsheets - pygsheets

What is the correct program flow to write different sized data frame to the same worksheet but ensure only the most recent data values written are visible?
Here was my original sequence:
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(sheet_name)
wks = sh.worksheet_by_title(wks_name)
wks.set_dataframe(df, (1, 1))
Problem with above sequence is if 1st write was 3800 rows x 12 cols and 2nd write was 2400 rows x 12 cols the wks would still show data from the prior write for rows above 2400.
My 2nd solution (basically a hack just to get it to work for me):
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(spreadsheet_name)
wks = sh.worksheet_by_title(sheet_name)
sh.del_worksheet(wks)
sh.add_worksheet(sheet_name, rows=len(df) + 1, cols=len(df.columns))
wks = sh.worksheet_by_title(sheet_name)
wks.set_dataframe(df, (1, 1))
The above sequence basically does what I want but I do not like having to delete the wks (I lose all my manual formatting). I know there must be a correct way to accomplish but I do not know the pygsheets API very well.
Will a more advanced pygsheet users please advise proper program flow and methods to use?
TIA,
--Rj

fit=True will basically resize the sheet to fit you data frame. so if you wanna keep the sheet at same size, you can clear the sheet before next write. it wold be easier than your second solution. Also if you just wanna clear the range you had written earlier, you can pass a range to clear function.
wks.set_dataframe(df, (1, 1))
wks.clear()
wks.set_dataframe(df, (1, 1))

Related

How to create a "dynamic" column in R?

I'm coding a portfolio analysis tool based off back-tests. In short, I want to add a column that starts at X value which will be the initial capital plus the result of the first trade, and have the rest of the values updated from the % change of each trade, but I haven't sorted out a way to put that logic into code. The following code is a simplified example.
profit <- c(10, 15, -5, -6, 20)
change <- profit / 1000
balance <- c(1010, 1025, 1020, 1014, 1036)
data <- data.frame(profit, change, balance)
So far, the only way I can think about is to create a separate vector that increases or decreases based off the change column, but I'm not sure how to do it in a way that it takes into account like the previous value, so doing balance = start_capital * (1 + change) which would give the proportional increase taking always into account the same initial value, not the previous value plus the change of the new one (I hope I explained myself).
Thanks,
Fernando.
EDIT
I have the correct change value on the actual program as each back-test updates the balance with the result of each new trade, so the change column on the real data is correct as it is properly updating, but my code combines several back-test and as the balance update is for each separate back-test and not the combined, it is not usable when combining everything, that's why I added the change column.
If you want to do this via change column we can use Reduce
start_capital <- 1000
Reduce(function(x, y) x + x*y, data$change, init = start_capital, accumulate = TRUE)[-1]
#[1] 1010.000 1025.150 1020.024 1013.904 1034.182
Reduce with accumulate = TRUE gives the output in a cumulative form taking the output of the current iteration as input to the next one.

Extracting data from lower layers in a Rasterbrick

So I'm extracting data from a rasterbrick I made using the method from this question: How to extract data from a RasterBrick?
In addition to obtaining the data from the layer given by the date, I want to extract the data from months prior. In my best guess I do this by doing something like this:
sapply(1:nrow(pts), function(i){extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i-1], nl=1)})
So it the extracting should look at layerindex i-1, this should then give the data for one month earlier. So a point with layerindex = 5, should look at layer 5-1 = 4.
However it doesn't do this and seems to give either some random number or a duplicate from months prior. What would be the correct way to go about this?
Your code is taking the value from the layer of the previous point, not the previous layer.
To see that imagine we are looking at the point in row 2 (i=2). your code that indicates the layer is pts$layerindex[i-1], which is pts$layerindex[1]. In other words, the layer of the point in row 1.
The fix is easy enough. For clarity I will write the function separetely:
foo = function(i) extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i]-1, nl=1)
sapply(1:nrow(pts), foo)
I have not tested it, but this should be all.

Divide column values within a vector

I'm not sure if my title is properly expressing what I'm asking. Once I'm done writing, it'll make sense. Firstly, I just started learning R, so I am a newbie. I've been reading through tutorial series and PDF's I've found online.
I'm working on a data set and I created a data frame of just the year 2001 and the DAM value Bon. Here's a picture.
What I want to do now is create a matrix with 3 columns: Coho Adults, Coho Jacks and the third column the ratio of Coho Jacks to Adults. This is what I'm having trouble with. The ratio between Coho Jacks to Adults.
If I do a line of code like this I get a normal output.
(cohoPassage <- matrix(fishPassage1995BON[c(5,6, 7)], ncol = 3))
The values are 259756, 6780 114934.
I'm figuring in order to get the ratio, I should divide column 5 and column 6's values. So basically 259756/6780 = 38.31
I've tried many things like:
(cohoPassage <- matrix(fishPassage1995BON[c(5,6, 5/6)], ncol = 3))
This just outputs the value of the fifth column instead of dividing for some reason
I've tried this:
matrix(fishPassage1995BON[c(5,6)],fishPassage1995BON[,5]/fishPassage1995BON[,6], ncol = 3)
Which gives me an incorrect output
I decided to break down the problem and divide the fifth and sixth columns separately and it gave the correct ratio.
If I create a matrix like this
matrix(fishPassage1995BON[,5]/fishPassage1995BON[,6])
It outputs the correct ratio of 38.31209. But when I try to combine everything, I just keep getting errors.
What can I do? Any help would be appreciated. Thank you.

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources