Dataframe modification consisting of multiple steps - r

I have these two datasets that I am trying to use for linear regression. One contains daily average values (independent variables) measured from a weather station.
date ST5_mean ST1_mean ST0_mean ST10_mean Snowheight Precipitation
1 2014-10-08 11.136713 10.980278 11.333995 11.622550 0.23680556 118
2 2014-10-09 9.255580 8.727486 8.796319 11.635243 0.00000000 124
3 2014-10-10 10.297521 9.441427 9.376736 12.879920 0.00000000 108
4 2014-10-11 9.080031 9.172347 9.389281 9.372538 0.01041667 152
5 2014-10-12 10.059455 9.428875 9.392774 11.866694 0.00000000 425
.
.
.
242 2015-06-06 12.946955 11.979896 11.50326 14.060399 0.00000000 470
243 2015-06-07 12.918128 11.737031 11.17246 13.691757 0.00000000 407
244 2015-06-08 12.214410 11.779344 11.50781 12.370771 0.00000000 100
245 2015-06-09 11.271517 10.942083 10.79751 11.324122 0.00000000 19
246 2015-06-10 8.597696 9.730661 10.20789 8.181455 0.01180556 481
The second one is basically a logger data (dependent variable) which may have several measurements per day or none (logger dataset) (data table jpeg). I need to modify the logger data so that is consistent with the station data and can run regression on these (which means there should be 1 row per day). Logger measurements ( the "Distance" column) that happened in the same day need to be summed up so that a single value per day is obtained; so if there are for example 3 measurements for 1.2.2014, there should be a value of 2.355 (3 x 0.785). Additionally, I need to create a row for every day of the period to match the sample size of the station data. A day for which the logger has no measurements, should have value of 0. I need to perform these modifications for numerous datasets so I need to figure out a code that does this in an automatic/semi-automatic manner. Manually adding data would be absurd as datasets have up to few thousand rows. Unfortunately, I couldn't come up with anything meaningful the last few days. Any help is appreciated.
I hope I managed to explain the problem here. Let me know if you would need more clarification. Thanks in advance!
P.S I managed the first part where I aggregate by the date and obtain the daily sums, however I am still stuck at creating a row for every day in the given time period and assigning 0 for the "distance" variable. This is what I have so far.
startTime <- as.Date("2014-10-08")
endTime <- as.Date("2015-06-10")
start_end <- c(startTime,endTime)
startTime <- as.Date("2014-10-08")
logger1 <- read.csv("124106_106.csv",header=TRUE, sep=",")
logger1$date <- as.Date(logger1$Date, "%d.%m.%Y")
logger1_sum <- aggregate (logger1$Distance, by = list(logger1$date), FUN = sum, na.rm=TRUE)"
names (logger1_sum) <- c("date", "distance")
head(logger1_sum, 5)
date distance
1 2014-10-02 1.570
2 2014-10-03 3.140
3 2014-10-08 3.925
4 2014-10-23 9.420
5 2014-10-24 3.925
tail(logger1_sum, 5)
date distance
45 2015-05-26 1.570
46 2015-05-27 1.570
47 2015-05-28 1.570
48 2015-06-10 0.785
49 2015-07-06 1.570

I think this should do the job. I use the data.table package which makes joins super easy and fast.
For brevity, I do not report your data so you will see the code starting as if the logger and station data.frame are already in the environment. The code does the following: it sums the columns Distance and AccuDist (assuming those two columns are the one important) by the column date, which is the one correctly formatted in Date class.
Then, I set the merging keys with the function setkey(). If you want to read more about how to joins work and how to perform them with data.table, please refer to this link. If you instead want to know more about data.table in general, you can refer to the official website here.
I then define the data.table final which comes out of right outer join. This way, I will retain all the observations (i.e., rows) in the object station.
library(data.table)
# this converts the two data.frame in data.table by reference
setDT(logger)
setDT(station)
# sum Distance by date
logger_summed <- logger[ , .( sum_Distance = sum(Distance),
sum_AccuDist = sum(AccuDist)), by = date]
> head(logger_summed)
## date sum_Distance sum_AccuDist
## 1: 2014-10-02 1.570 2.355
## 2: 2014-10-03 3.140 14.130
## 3: 2014-10-08 3.925 35.325
## 4: 2014-10-23 9.420 164.850
## 5: 2014-10-24 3.925 102.050
## 6: 2014-10-25 2.355 70.650
setkey( logger_summed, date )
setkey( station, date )
final <- logger_summed[ station ]
final[ is.na(sum_Distance), `:=` ( sum_Distance = 0, sum_AccuDist = 0) ]
> final
## date sum_Distance sum_AccuDist ST5_mean ST1_mean ST0_mean ST10_mean Snowheight Precipitation
## 1: 2014-10-08 3.925 35.325 11.136713 10.980278 11.333995 11.622550 0.23680556 118
## 2: 2014-10-09 0.000 0.000 9.255580 8.727486 8.796319 11.635243 0.00000000 124
## 3: 2014-10-10 0.000 0.000 10.297521 9.441427 9.376736 12.879920 0.00000000 108
## 4: 2014-10-11 0.000 0.000 9.080031 9.172347 9.389281 9.372538 0.01041667 152
## 5: 2014-10-12 0.000 0.000 10.059455 9.428875 9.392774 11.866694 0.00000000 425
## ---
## 242: 2015-06-06 0.000 0.000 12.946955 11.979896 11.503257 14.060399 0.00000000 470
## 243: 2015-06-07 0.000 0.000 12.918128 11.737031 11.172462 13.691757 0.00000000 407
## 244: 2015-06-08 0.000 0.000 12.214410 11.779344 11.507812 12.370771 0.00000000 100
## 245: 2015-06-09 0.000 0.000 11.271517 10.942083 10.797510 11.324122 0.00000000 19
## 246: 2015-06-10 0.785 115.395 8.597696 9.730661 10.207893 8.181455 0.01180556 481
Does this help?

Related

Subsetting only positive values of specific column in a list

I have the following code to get options data list and create a new list to get only puts data (only_puts_list)
library(quantmod)
Symbols<-c ("AA","AAL","AAOI","ABBV","ABC","ABNB")
Options.20221111 <- lapply(Symbols, getOptionChain)
names(Options.20221111) <- Symbols
only_puts_list <- lapply(Options.20221111, function(x) x$puts)
I'd like now to subset the only_puts_list and create a new list (i.e. new_list1) to subset and get only the data which has a positive value in the column ChgPct of the only_puts_list.
I guess lapply should work, but how to apply to only positive values of a specific column ChgPct?
We could use subset after looping over the list with lapply
new_list1 <- lapply(only_puts_list, subset, subset = ChgPct > 0)
If we check the output, most of the list elements returned have only 0 rows as there were no positive observations in 'ChgPct'. We could Filter to keep only those having any rows
new_list1_sub <- Filter(nrow, new_list1)
-output
new_list1_sub
$ABBV
ContractID ConractSize Currency Expiration Strike Last Chg ChgPct Bid Ask Vol OI LastTradeTime IV
31 ABBV221202P00155000 REGULAR USD 2022-12-02 155.0 0.66 0.1100000 20.00000 0.56 0.66 70 480 2022-11-29 13:10:43 0.2690503
32 ABBV221202P00157500 REGULAR USD 2022-12-02 157.5 1.49 0.2400000 19.20000 1.41 1.51 544 383 2022-11-29 13:17:43 0.2627027
33 ABBV221202P00160000 REGULAR USD 2022-12-02 160.0 3.05 0.4300001 16.41222 2.79 2.99 34 308 2022-11-29 12:07:54 0.2692944
34 ABBV221202P00162500 REGULAR USD 2022-12-02 162.5 4.95 1.6499999 50.00000 4.80 5.05 6 28 2022-11-29 13:26:10 0.3017648
ITM
31 FALSE
32 FALSE
33 TRUE
34 TRUE
$ABC
ContractID ConractSize Currency Expiration Strike Last Chg ChgPct Bid Ask Vol OI LastTradeTime IV ITM
18 ABC221202P00165000 REGULAR USD 2022-12-02 165 1.05 0.1999999 23.5294 0.6 0.8 3 111 2022-11-29 09:51:47 0.2710034 FALSE

R - [DESeq2] - How use TMM normalized counts (from EdgeR) in inputs for DESeq2?

I have several RNAseq samples, from different experimental conditions. After sequencing, and alignment to reference genome, I merged the raw counts to get a dataframe that looks like this:
> df_merge
T0 DJ21 DJ24 DJ29 DJ32 Rec2 Rec6 Rec9
G10 421 200 350 288 284 198 314 165
G1000 17208 10608 11720 11421 10142 10768 10331 6121
G10000 37 16 19 21 28 12 9 4
G10002 45 13 44 27 12 35 74 14
G10003 136 79 162 429 184 112 192 162
G10004 54 162 73 169 102 300 429 180
G10006 1 0 1 0 0 0 0 0
G10007 3 4 7 2 1 1 1 0
G1001 9030 8366 10608 13604 9808 10654 11663 7985
... ... ... ... ... ... ... ... ...
I use EdgeR to perform TMM normalization, which is the normalization method I want to use, and is not available in DESeq2. For that I use the following script:
## Normalisation by the TMM method (Trimmed Mean of M-value)
dge <- DGEList(df_merge) # DGEList object created from the count data
dge2 <- calcNormFactors(dge, method = "TMM") # TMM normalization calculate the normfactors
I then obtain the following normalization factors:
> dge2$samples
group lib.size norm.factors
T0 1 129884277 1.1108130
DJ21 1 110429304 0.9453988
DJ24 1 126410256 1.0297216
DJ29 1 123008035 1.0553169
DJ32 1 118968544 0.9927826
Rec2 1 119000510 0.9465131
Rec6 1 114775318 1.0053686
Rec9 1 90693946 0.9275454
I normalize the raw counts with the normalization factors:
# Normalized pseudo counts are obtained with the function cpm and stored in a data frame:
pseudo_TMM <- log2(cpm(dge2) + 1)
df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$method <- rep("TMM", nrow(df_TMM))
And I get TMM normalized counts, in a new dataframe:
> pseudo_TMM
T0 DJ21 DJ24 DJ29 DJ32 Rec2 Rec6 Rec9
G10 1.970115581 1.54384913 1.88316953 1.68642670 1.76745996 1.46356074 1.89575666 1.56628879
G1000 6.910138402 6.68101996 6.50839579 6.47542172 6.44077248 6.59395683 6.50032388 6.20481983
G10000 0.329354263 0.20571418 0.19656414 0.21632677 0.30692404 0.14605339 0.10835095 0.06701850
G10002 0.391657436 0.16931112 0.42010652 0.27261134 0.13960084 0.39037793 0.71483462 0.22209164
G10003 0.958011321 0.81287356 1.16642722 2.10593537 1.35494357 0.99592405 1.41354030 1.54881003
G10004 0.458675608 1.35147467 0.64230087 1.20281148 0.89809414 1.87320592 2.23810756 1.65064058
G10006 0.009964976 0.00000000 0.01104103 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
G10007 0.029690785 0.05424318 0.07556948 0.02205789 0.01216343 0.01275200 0.01244875 0.00000000
G1001 5.990679797 6.34224022 6.36623615 6.72515956 6.39302663 6.57876150 6.67346174 6.58377191
... ... ... ... ... ... ... ... ...
And this is where it gets complicated. Usually I do my DGE analysis with DESeq2 with the DESeqDataSetFromHTSeqCount() and DESeq() functions, which itself runs an RLE normalization. Now I would like to use DESeq2 directly to do the DGE analysis on my already normalized data. I saw that the DeseqDataSet object could be created from a matrix with the DESeqDataSetFromMatrix() function.
If someone has already succeeded in using DESeq2 with data from TMM normalization, I would appreciate some advice
I remembered I saw something about how the norm factors must be converted to the appropriate size factors in DESeq2 and I found the thread on Bioconductor:
https://support.bioconductor.org/p/p133964/
It was suggested to read the following in order to get a better understanding of the conversion necessary:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0157022
Essentially in the supplementary info, they give the following code snippet for the conversion:
tmm <- calcNormFactors(geneCount, method="TMM")
N <- colSums(geneCount) #vector of library size
tmm.counts <- N*tmm/exp(mean(log(N*tmm)))
Cheers

how to bind rows of two data frames such that the rows with same value in ID column stays next to each other

I have two data frames that looks like:
data.one:
cust_id stats AIRLINE AUTO RENTAL CLOTHING STORES
1: 495 SUM 45493.42 3103.90 20927.56
2: 692 SUM 39954.78 0.00 20479.60
3: 728 SUM 25813.03 3504.74 5924.71
4: 1794 SUM 0.00 0.00 0.00
5: 3060 SUM 0.00 0.00 7146.31
data.two:
cust_id stats AIRLINE AUTO RENTAL CLOTHING STORES
1: 495 MAX 4950.00 1000.00 3140
2: 692 MAX 6479.71 0.00 1880
3: 728 MAX 5642.68 1752.37 1395
4: 1794 MAX 0.00 0.00 0
5: 3060 MAX 0.00 0.00 1338
I want to bind them together (row-wise) such that the resulting data frame will look like this:
cust_id stats AIRLINE AUTO RENTAL CLOTHING STORES
1: 495 SUM 45493.42 3103.90 20927.56
2: 495 MAX 4950.00 1000.00 3140
3: 692 SUM 39954.78 0.00 20479.60
4: 692 MAX 6479.71 0.00 1880
5: 728 SUM 25813.03 3504.74 5924.71
6: 728 MAX 5642.68 1752.37 1395
.
.
.
meaning, rows with same cust_id from both the data frames stay next to each other in the binded data frame.
Thank you for your time in advance.
Maybe the arrange function in dplyr would be useful:
custid <- c(111,222,333)
otherVar <- c(1,2,3)
df1 <- data.frame(custid, otherVar)
custid <- c(222,333,444)
otherVar <- c(2,3,4)
df2 <- data.frame(custid, otherVar)
df <- df1 %>%
bind_rows(df2) %>%
arrange(custid)
Just bind Thema together using rbind and then sort the dataframe using order:
mydata <- rbind(data1, data2)
mydata[order(mydata$cust_id), ]

list unique values for each column in a data frame

Suppose you have a very large input file in "csv" format. And you want to know the different values that occur in each column. How would you do that?
ex.
column1 column2 column3 column4
----------------------------------------
value11 value12 value13 value14
value21 value22 value23 value24
...
valueN1 valueN2 valueN3 valueN4
So I want my output to be something like:
column1 has these values: value11, value21, ...valueN1. but I don't need to see reoccurrences of the same value. I need this just to get an idea of what my data is all about.
Let dat be your data frame after reading in the csv file, you can do
ulst <- lapply(dat, unique)
If you further want to know the number of unique values for each column, do
k <- lengths(ulst)
I find the describe() function from the Hmisc package very handy to get an overview on a dataset, e.g.,
Hmisc::describe(chickwts)
chickwts
2 Variables 71 Observations
----------------------------------------------------------------------------------------------------------------
weight
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
71 0 66 1 261.3 90.26 140.5 153.0 204.5 258.0 323.5 359.0
.95
385.0
lowest : 108 124 136 140 141, highest: 380 390 392 404 423
----------------------------------------------------------------------------------------------------------------
feed
n missing distinct
71 0 6
Value casein horsebean linseed meatmeal soybean sunflower
Frequency 12 10 12 11 14 12
Proportion 0.169 0.141 0.169 0.155 0.197 0.169
----------------------------------------------------------------------------------------------------------------

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources