Creating plots, after applying functions on splitted data in R - r

thanks in advance for your time on reading and answering this.
I have a data frame (15264*3) the head of which is:
head(actData)
steps date interval
289 0 2012-10-02 0
290 0 2012-10-02 5
291 0 2012-10-02 10
292 0 2012-10-02 15
293 0 2012-10-02 20
294 0 2012-10-02 25
There are 53 of the "date" variable (factor); I want to split the data based on date, calculate the mean of the steps/date and then create a plot for interval vs. steps' mean;
What I have done:
mn<- ddply(actData, c("date"), function (x) apply(x[1], 2, mean)) # to calculate mean of steps per day (with the length of 53)
splt<- split(actData, actData$date) # split the data based on date (it should divide the data into 53 parts)
Now I have two variables with the same length (53); but when I try plotting them, I get an error for the difference in their length:
plot(splt$interval, mn[,2], type="l")
Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ
when I check the length of splt$interval, it gives me "0"!
I've also visited here "How to split a data frame by rows, and then process the blocks?", "Split data based on column values and create scatter plot." and so on... with a lot of good suggestions but none of them addresses my questions!
Sorry if my question is a little stupid, I am not an expert in R :)
I am using windows 7, Rstudio 3.0.1.
Thanks.
EDIT:
head(splt, 2)
$`2012-10-01`
[1] steps date interval
<0 rows> (or 0-length row.names)
$`2012-10-02`
steps date interval
289 0 2012-10-02 0
290 0 2012-10-02 5
291 0 2012-10-02 10
292 0 2012-10-02 15
head(mn)
date steps
1 2012-10-02 0.43750
2 2012-10-03 39.41667
3 2012-10-04 42.06944
4 2012-10-05 46.15972
5 2012-10-06 53.54167
6 2012-10-07 38.24653

I want to split the data based on date, calculate the mean of the steps/date and then create a plot for interval vs. steps' mean;
After step 2, you will have a matrix like this:
mean(steps) date
289 0.23 2012-10-02
290 0.42 2012-10-03
291 0.31 2012-10-04
You want to plot this against "the intervals", but there are also multiple intervals per 'date'. What are you exactly trying to plot in x vs y?
The mean steps per date?
The mean steps vs mean intervals (i.e. an x-y point per date)?

Related

Calculate the length of an interval if data are equal to zero

I have a dataframe with time points and the corresponding measure of the activity in different subjects. Each time point it's a 5 minutes interval.
time Subject1 Subject2
06:03:00 6,682129 8,127075
06:08:00 3,612061 20,58838
06:13:00 0 0
06:18:00 0,9030762 0
06:23:00 0 0
06:28:00 0 0
06:33:00 0 0
06:38:00 0 7,404663
06:43:00 0 11,55835
...
I would like to calculate the length of each interval that contains zero activity, as the example below:
Subject 1 Subject 2
Interval_1 1 5
Interval_2 5
I have the impression that I should solve this using loops and conditions, but as I am not so experienced with loops I do not know where to start. Do you have any idea to solve this? Any help is really appreciated!
You can use rle() to find runs of consecutive values and the length of the runs. We need to filter the results to only runs where the value is 0:
result = lapply(df[-1], \(x) with(rle(x), lengths[values == 0]))
result
# $Subject1
# [1] 1 5
#
# $Subject2
# [1] 5
As different subjects can have different numbers of 0-runs, the results make more sense in a list than a rectangular data frame.

Is there an existing function in R that sorts a continuous variable into an EQUAL number of observations per group? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a 2319 row data frame df; I would like to sort the continuous variable var and split in into a specified number of groups with an equal (or as close as possible) number of observations per group. I have seen a similar post where cut2() from Hmisc was recommended, but it does not always provide an equal number of observations per group. For example, what I have using cut2()
df$Group <- as.numeric(cut2(df$var, g = 10))
var Group
1415 1
1004 1
1285 1
2099 2
2119 2
2427 4
...
table(df$Group)
1 2 3 4 5 6 7 8 9 10
232 232 241 223 233 246 219 243 226 224
Has anyone used/written something that does not rely on the underlying distribution of the variable (e.g. var), but rather the number of observations in the data and number of groups specified? I do have non-unique values.
What I want is a more equal number of observations, for example:
table(df$Group)
1 2 3 4 5 6 7 8 9 10
232 232 231 233 231 233 232 231 231 233
cut/cut2 and other function depends on the distribution of the data to create groups. If you want more or less equal number of observations one option would be to use rep.
library(dplyr)
n <- 10
df %>%
arrange(var) %>%
mutate(Group = cummax(rep(seq_len(n), each = n()/n, length.out = n())))

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!
Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

Detection of time-series outliers

I'm working on a university project forecasting. I have a huge database with demand between two cities. However, I know that this dataset is contaminated. However, I do not know which data points are obscured. The dataset is a panel data set that follows demand between city pairs on a monthly basis. Below is a part of the data that I am working with.
CAI.JED CAI.RUH ADD.DXB CAI.IST ALG.IST
2013-01-01 19196 14777 16 1413 12
2013-02-01 19913 8 18203 1026 5
2013-03-01 34242 11751 17836 985 1
2013-04-01 23481 12000 13479 948 27
2013-05-01 24428 16046 16391 954 9
2013-06-01 31791 23479 16571 1 4
2013-07-01 33716 20090 11323 0 5724
2013-08-01 35553 2 11121 0 0
2013-09-01 18746 13423 12119 0 26
2013-10-01 10 12223 10239 0 0
2013-11-01 19 20234 14231 5 2
2013-12-01 15198 1 12132 10 5
The dataset is a combination from two datasets. The persons that provided me the data told me that in some months, only one of the two dataset is working. However, it is not known for which months, which specific dataset is available.
Now comes my question: for the next part of the project, I need to get annual demand numbers. However, as I know that the figures are obscured, I would like to remove outliers. What techniques are available in R to do this?
As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf). However, I could not get this working. Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.
After knowing what the outliers are, I would like to either replace them (e.g. with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.
I prefer density based clustering algorithm such as DBSCAN.
If you modify the epsilon and num-samples, you can filter outliers very specifically
using a plot to visualize the result (label -1 are the outliers)

Using data table to run 100,000 Fisher's Exact Tests is slower than apply

Good morning,
I'm trying to use R to run 100,000 Fisher's exact tests on simulated genetic data very quickly, preferably in under 30 seconds (since I need to permute case-control labels and iterate the process 1,000 times, so it runs overnight).
I tried using data tables on melted, tidy data, which contains about 200,000,000 rows and four columns (subject ID, disease status, position and 'value' [the number of wild-type alleles, a 3-factor variable]). The function groups by position, then performs Fisher exact tests on value against disease.
> head(casecontrol3)
ident disease position value
1: 1 0 36044 2
2: 2 0 36044 2
3: 3 0 36044 1
4: 4 0 36044 1
5: 5 0 36044 2
6: 6 0 36044 1
> setkey(casecontrol3,position)
> system.time(casecontrol4 <- casecontrol3[,list(p=fisher.test(value,
+ factor(disease))$p.value), by=position])
user system elapsed
215.430 11.878 229.148
> head(casecontrol4)
position p
1: 36044 6.263228e-40
2: 36495 1.155289e-68
3: 38411 7.842216e-19
4: 41083 1.272841e-69
5: 41866 2.264452e-09
6: 41894 9.833324e-10
However, it's really slow in comparison to using a simple apply function on flattened, messy, case-control tables (100,000 rows; the columns contain info re: disease status and number of wild-type alleles, so the apply function first converts each row into a 2x3 case-control tables, and uses the matrix syntax of Fisher's exact test). It takes about 20 seconds of running time to convert the data from a previous (unmelted) form into this form (not shown).
> head(cctab)
control_aa control_aA control_AA case_aa case_aA case_AA
[1,] 291 501 208 521 432 47
[2,] 213 518 269 23 392 585
[3,] 170 499 331 215 628 157
[4,] 657 308 35 269 619 112
[5,] 439 463 98 348 597 55
[6,] 410 480 110 323 616 61
> myfisher <- function(row){
+ contab <- matrix(as.integer(row),nrow=2,byrow=TRUE)
+ pval <- fisher.test(contab)$p.value
+ return(pval)
+ }
> system.time(tab <- apply(cctab,1,"myfisher"))
user system elapsed
28.846 10.989 40.173
> head(tab)
[1] 6.263228e-40 1.155289e-68 7.842216e-19 1.272841e-69 2.264452e-09 9.833324e-10
As you can see, using apply is much faster than data.table, which really surprises me. And the results are exactly the same:
> identical(casecontrol4$p,tab)
[1] TRUE
Does anyone who is an expert at using data.table know how I could speed up my code with it? Or is the data just too big for me to use it in the melted form (which rules out using data.table, dplyr, etc)? Note that I haven't tried dplyr, as I've heard that data.table is faster for big data sets like this.
Thanks.
I would suggest another route -- adding an HPC element to your approach.
You can use mutliple CPU or GPU cores, scale up a free cluster of computers on AWS EC2, connect to AWS EMR, or use any of a plethora of great HPC tools to faciliate your existing code.
Check our the CRAN HPC Task View and this tutorial.

Resources