I am struggling with manipulation of time series data. The dataset has first Column containing information about time points of data collection, 2nd column onwards contains data from different studies.I have several hundred studies. As an example I have included sample data for 5 studies. I want to stack the dataset vertically with time and datapoints for each study. Example data set looks like data provided below:
TIME Study1 Study2 Study3 Study4 Study5
0.00 52.12 53.66 52.03 50.36 51.34
90.00 49.49 51.71 49.49 48.48 50.19
180.00 47.00 49.83 47.07 46.67 49.05
270.00 44.63 48.02 44.77 44.93 47.95
360.00 42.38 46.28 42.59 43.25 46.87
450.00 40.24 44.60 40.50 41.64 45.81
540.00 38.21 42.98 38.53 40.08 44.78
I am looking for an output in the form of:
TIME Study ID
0 52.12 1
90 49.49 1
180 47 1
270 44.63 1
360 42.38 1
450 40.24 1
540 38.21 1
0 53.66 2
90 51.71 2
180 49.83 2
270 48.02 2
360 46.28 2
450 44.6 2
540 42.98 2
0 52.03 3
90 49.49 3
180 47.07 3
270 44.77 3
...
This is a classic 'wide to long' dataset manipulation. Below, I show the use of the base function ?reshape for your data:
d.l <- reshape(d, varying=list(c("Study1","Study2","Study3","Study4","Study5")),
v.names="Y", idvar="TIME", times=1:5, timevar="Study",
direction="long")
d.l <- d.l[,c(2,1,3)]
rownames(d.l) <- NULL
d.l
# Study TIME Y
# 1 1 0 52.12
# 2 1 90 49.49
# 3 1 180 47.00
# 4 1 270 44.63
# 5 1 360 42.38
# 6 1 450 40.24
# 7 1 540 38.21
# 8 2 0 53.66
# 9 2 90 51.71
# 10 2 180 49.83
# 11 2 270 48.02
# 12 2 360 46.28
# 13 2 450 44.60
# 14 2 540 42.98
# 15 3 0 52.03
# 16 3 90 49.49
# 17 3 180 47.07
# ...
However, there are many ways to do this in R: the most basic reference on SO (of which this is probably a duplicate) is Reshaping data.frame from wide to long format, but there are many other relevant threads (see this search: [r] wide to long). Beyond using reshape, #lmo's method can be used, as well as methods based on the reshape2, tidyr, and data.table packages (presumably among others).
Here is one method using cbind and stack:
longdf <- cbind(df$TIME, stack(df[,-1], ))
names(longdf) <- c("TIME", "Study", "id")
This returns
longdf
TIME Study id
1 0 52.12 Study1
2 90 49.49 Study1
3 180 47.00 Study1
4 270 44.63 Study1
5 360 42.38 Study1
6 450 40.24 Study1
7 540 38.21 Study1
8 0 53.66 Study2
9 90 51.71 Study2
...
If you want to change id to integers as in your example, use
longdf$id <- as.integer(longdf$id)
Related
I'm pretty fresh to r (like 2 days old). I have a set of data that is a time series taken every 200 msecs over a few hours. Here's the
head(dat):
Date Time MSec Sample Pat1 Pat2 Pat3
1 8/7/~ 14:34 411 0 100 13 13
2 8/7/~ 14:34 615 1 13 13 143
3 8/7/~ 14:34 814 2 13 13 13
4 8/7/~ 14:34 12 3 130 13 13
5 8/7/~ 14:34 216 4 13 13 130
6 8/7/~ 14:34 417 5 139 13 13
It goes down for 2 hours, so several thousands points and over for several hundred patients. The 13 is our baseline and what we are interested in spikes in activity over say 100. I have been trying to create a new column for each Patient column for every time a signal is over 100. I've worked out the follow code:
dat$Pat1exc <- as.numeric(dat$Pat1 >=100)
This works and gives me the new column and my data looks like below:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1exc
1 8/7/~ 14:34 411 0 100 13 13 1
2 8/7/~ 14:34 615 1 13 13 143 0
3 8/7/~ 14:34 814 2 13 13 13 0
4 8/7/~ 14:34 12 3 130 13 13 1
5 8/7/~ 14:34 216 4 13 13 130 0
6 8/7/~ 14:34 417 5 139 13 13 1
This is exactly what I want, but I don't know how to iterate through each column to create Pat2exc, Pat3exc, etc. I figured I could use sapply or vapply after I create a function. However, I can't get the function to work.
excite <- function(x, y) {y <- as.numeric(x >=100)}
excite(x=dat$Pat2, y=dat$Pat2exc)
This gives me no errors, but doesn't modify the dat data frame. Essentially, in the end I just want to sum up all the excited columns (>=100). If there is an easier way to count the samples over 100 for each patient then I'd be happy to learn how to do that as well.
Sorry if this is unclear. Thanks in advance.
P.S.: I am also looking for a good way to combine the Time and Msec columns.
Edit: Added in unabbreviated data:
Date Time Msecs
8/7/2018 14:34:07 411
8/7/2018 14:34:07 615
8/7/2018 14:34:07 814
8/7/2018 14:34:08 12
8/7/2018 14:34:08 216
8/7/2018 14:34:08 417
8/7/2018 14:34:08 619
8/7/2018 14:34:08 816
8/7/2018 14:34:09 15
We can use mutate_at from dplyr to create the binary variables and mutate + rowSums to add them all up:
library(dplyr)
df %>%
mutate_at(vars(starts_with("Pat")), funs(exc = (. >= 100)*1)) %>%
mutate(exc_total = rowSums(.[grepl('_exc', names(.))]))
Result:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1_exc Pat2_exc Pat3_exc exc_total
1 8/7/~ 14:34 411 0 100 13 13 1 0 0 1
2 8/7/~ 14:34 615 1 13 13 143 0 0 1 1
3 8/7/~ 14:34 814 2 13 13 13 0 0 0 0
4 8/7/~ 14:34 12 3 130 13 13 1 0 0 1
5 8/7/~ 14:34 216 4 13 13 130 0 0 1 1
6 8/7/~ 14:34 417 5 139 13 13 1 0 0 1
I would like to calculate and plot changing numbers of differently colored animals over time using dplyr and ggplot2.
I have observations of different animals on random dates and so I would first like to group those observations into 4-day brackets and then calculate mean color for each 4-day bracket. I created the column Bracket.mean with a gimmick result for the first few just to show what I have in mind. I would like to add those means in the same data frame (as opposed to creating a new data.frame or vectors) for a later analysis and plotting, if possible.
And for the plot I’m hoping to show the bracket means with some measure of variance around it (SD or boxplots) as well as the daily observations (perhaps a faded overlay of the observations in the background) over time.
Below is a part of the dataset I'm using (with a made up 'Bracket.mean' column I’m hoping to calulcate). 'Count' is the number of animals on a given 'Date' of a specific 'Color'.
Date Julian Count Color Bracket.color
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 100 mean of 4/19-4/22
4/20/16 111 4 50 mean of 4/19-4/22
4/20/16 111 1 0 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/20/16 111 1 50 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 4 50 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 1 0 mean of 4/19-4/22
4/22/16 113 2 0 mean of 4/19-4/22
4/22/16 113 4 50 mean of 4/23-4/26
4/23/16 114 6 0 mean of 4/23-4/26
4/23/16 114 1 50 mean of 4/23-4/26
4/24/16 115 2 0 mean of 4/23-4/26
4/26/16 117 5 0 mean of 4/23-4/26
4/30/16 121 1 50
5/2/16 123 1 NA
5/2/16 123 1 50
5/7/16 128 2 0
5/7/16 128 3 0
5/7/16 128 3 0
5/8/16 129 4 0
5/8/16 129 1 0
5/10/16 131 1 50
5/10/16 131 4 50
5/12/16 133 1 0
5/13/16 134 1 50
5/14/16 135 1 0
5/14/16 135 2 50
5/14/16 135 2 0
5/14/16 135 1 0
5/17/16 138 1 0
5/17/16 138 2 0
5/23/16 144 1 0
5/24/16 145 4 0
5/24/16 145 1 0
5/24/16 145 1 0
5/27/16 148 3 NA
5/27/16 148 1 0
5/27/16 148 1 50
Any help would be greatly appreciated. Thanks very much in advance!
Something like this should get you started.
library(dplyr)
df <- df %>% mutate(Date = as.Date(Date, format='%m/%d/%y'),
Start = as.Date(cut(Date, breaks= seq(min(Date), max(Date)+4, by = 4)))) %>%
mutate(End = Start+3) %>%
group_by(Start,End) %>%
summarise(meanColor = mean(Color, na.rm=T),
sdColor = sd(Color, na.rm=T))
df
#Source: local data frame [10 x 4]
#Groups: Start [?]
# Start End meanColor sdColor
# <date> <date> <dbl> <dbl>
#1 2016-04-19 2016-04-22 56.25000 35.93976
#2 2016-04-23 2016-04-26 12.50000 25.00000
#3 2016-04-27 2016-04-30 50.00000 NA
#4 2016-05-01 2016-05-04 50.00000 NA
#5 2016-05-05 2016-05-08 0.00000 0.00000
#6 2016-05-09 2016-05-12 33.33333 28.86751
#7 2016-05-13 2016-05-16 20.00000 27.38613
#8 2016-05-17 2016-05-20 0.00000 0.00000
#9 2016-05-21 2016-05-24 0.00000 0.00000
#10 2016-05-25 2016-05-28 25.00000 35.35534
Then plot using,
library(ggplot)
ggplot(df) + geom_line(aes(Start,meanColor))
Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)
I'm struggling with a space-time problem. Help is very much appreciated! Here's first what I'm looking for:
I've got a dataframe with fixes of roe deer datapoints (x,y-coordinates) taken in a irregular approximatly five minute interval.
I want to find all fixes within a certain radius (let's say 15 Meters) of the acutall position of the roe deer, with the condition that they are consecutiv for about 30 Minutes (so six fixes). Oviously the number of fixes can vary along the trajectory. I want to write down the number of fixes into a new column and want to mark in another column (0=no,1=yes), if the condition is met.
If the conditions are met, I would like to calculate the center of gravity (in x,y-coordinates) of said pointcloud and write it to this or another dataframe.
According to another users question (Listing number of obervations by location) I was able to find the fixes within the radius, but i couldn't figure a way how to ensure that they are consecutive
Here's some rows of my dataframe (in realty there's more than 10000 rows, because i subsetted this data.frame the ID's are not starting with one):
FID No CollarID Date Time Latitude__ Longitude_ Height__m_ DOP FixType
1 0 667 7024 2013-10-22 06:01:49 47.26859 8.570701 609.94 10.6 GPS-3D
2 1 668 7024 2013-10-22 06:06:04 47.26861 8.570634 612.31 10.4 GPS-3D
3 2 669 7024 2013-10-22 06:11:07 47.26871 8.570402 609.43 9.8 GPS-3D
4 3 670 7024 2013-10-22 06:16:14 47.26857 8.570796 665.40 4.4 val. GPS-3D
5 4 671 7024 2013-10-22 06:20:36 47.26855 8.570582 653.65 4.6 val. GPS-3D
6 5 672 7024 2013-10-22 06:25:50 47.26850 8.570834 659.03 4.8 val. GPS-3D
7 6 673 7024 2013-10-23 06:00:53 47.27017 8.569882 654.86 3.6 val. GPS-3D
8 7 700 7024 2013-10-26 12:00:18 47.26904 8.569596 651.88 3.8 val. GPS-3D
9 8 701 7024 2013-10-26 12:05:41 47.26899 8.569640 652.76 3.8 val. GPS-3D
10 9 702 7024 2013-10-26 12:10:40 47.26898 8.569534 650.42 4.6 val. GPS-3D
11 10 703 7024 2013-10-26 12:16:17 47.26896 8.569606 653.77 11.4 GPS-3D
12 11 704 7024 2013-10-26 12:20:18 47.26903 8.569792 702.49 9.8 val. GPS-3D
13 12 705 7024 2013-10-26 12:25:47 47.26901 8.569579 670.12 2.4 val. GPS-3D
14 13 706 7024 2013-10-26 12:30:18 47.26900 8.569477 685.65 2.0 val. GPS-3D
15 14 707 7024 2013-10-26 12:35:23 47.26885 8.569400 685.15 6.2 val. GPS-3D
Temp___C_ X Y ID Trajectory distance speed timelag timevalid
1 19 685667.7 235916.0 RE01 RE01 5.420858 0.021258268 4.250000 1
2 20 685662.6 235917.8 RE01 RE01 21.276251 0.070218649 5.050000 1
3 20 685644.9 235929.5 RE01 RE01 34.070730 0.110979577 5.116667 1
4 20 685675.0 235913.5 RE01 RE01 16.335573 0.062349516 4.366667 1
5 20 685658.8 235911.3 RE01 RE01 19.896906 0.063365941 5.233333 1
6 20 685677.9 235905.7 RE01 RE01 199.248728 0.002346781 1415.050000 0
7 22 685603.2 236090.4 RE01 RE01 126.831124 0.000451734 4679.416667 0
8 22 685583.4 235965.1 RE01 RE01 6.330467 0.019598970 5.383333 1
9 22 685586.8 235959.8 RE01 RE01 8.270701 0.027661208 4.983333 1
10 23 685578.8 235957.8 RE01 RE01 5.888147 0.017472246 5.616667 1
11 22 685584.3 235955.7 RE01 RE01 16.040998 0.066560158 4.016667 1
12 23 685598.3 235963.6 RE01 RE01 16.205330 0.049256322 5.483333 1
13 23 685582.2 235961.6 RE01 RE01 7.742184 0.028568946 4.516667 1
14 23 685574.5 235960.9 RE01 RE01 18.129019 0.059439406 5.083333 1
15 23 685568.8 235943.7 RE01 RE01 15.760165 0.051672673 5.083333 1
Date_text Time_text DateTime Flucht FluchtALL
1 22.10.2013 06:01:49 22.10.2013 06:01:49 0 0
2 22.10.2013 06:06:04 22.10.2013 06:06:04 0 0
3 22.10.2013 06:11:07 22.10.2013 06:11:07 0 0
4 22.10.2013 06:16:14 22.10.2013 06:16:14 0 0
5 22.10.2013 06:20:36 22.10.2013 06:20:36 0 0
6 22.10.2013 06:25:50 22.10.2013 06:25:50 0 0
7 23.10.2013 06:00:53 23.10.2013 06:00:53 0 0
8 26.10.2013 12:00:18 26.10.2013 12:00:18 0 0
9 26.10.2013 12:05:41 26.10.2013 12:05:41 0 0
10 26.10.2013 12:10:40 26.10.2013 12:10:40 0 0
11 26.10.2013 12:16:17 26.10.2013 12:16:17 0 0
12 26.10.2013 12:20:18 26.10.2013 12:20:18 0 0
13 26.10.2013 12:25:47 26.10.2013 12:25:47 0 0
14 26.10.2013 12:30:18 26.10.2013 12:30:18 0 0
15 26.10.2013 12:35:23 26.10.2013 12:35:23 0 0
and here's the code i've got so far:
for (i in seq(nrow(df)))
{
# circle's centre
xcentre <- df[i,'X']
ycentre <- df[i,'Y']
# checking how many fixes lie within 15 m of the above centre, noofcloserest column will contain this value
df[i,'noofclosepoints'] <- sum(
(df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2
) - 1
cat(i,': ')
# this prints the true/false vector for which row is within the radius, and which row isn't
cat((df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2)
cat('\n')
}
I tried to test the consecutive-condition on the TRUE/FALSE List from cat(), but i couldn't access the results to process them any further.
I know this is a mile long question, I would be very glad if someone could help me with this or part of this problem. I will be thankful till the end of my life :). Btw: you may have noticed, that I'm an unfortunate R beginner. Many thanks!
Here's how
You can apply a rolling function to your time series for a window of 6.
library(xts)
## You read you time serie using the handy `read.zoo`
## Note the use of the index here
dx <-
read.zoo(text="Date Time DOP X Y noofclosepoints
4705 09.07.2014 11:05:33 3.4 686926.8 231039.3 14
4706 09.07.2014 11:10:53 3.2 686930.5 231042.5 14
4707 09.07.2014 11:16:29 15.8 686935.2 231035.5 14
4708 09.07.2014 11:20:08 5.2 686932.9 231035.6 14
4709 09.07.2014 11:25:17 4.8 686933.8 231038.6 14
4710 09.07.2014 11:30:16 2.2 686938.0 231037.0 15
4711 09.07.2014 11:35:13 2.0 686930.9 231035.8 14
4712 09.07.2014 11:40:09 2.0 686930.6 231035.7 14
4713 09.07.2014 11:45:25 3.4 686907.2 231046.8 0
4714 09.07.2014 11:50:25 3.2 686936.1 231037.1 14",
index = 1:2,format="%d.%m.%Y %H:%M:%S",tz="")
## You apply your rolling to all columns
## for each window you compute the distance between the current
## point and others , then you return the number of points within the radius
as.xts(rollapplyr(dx,7,function(x) {
curr <- tail(x,1)
others <- head(x,-1)
dist <- sqrt((others[,"X"]-curr[,"X"])^2 + (others[,"Y"]-curr[,"Y"])^2 )
sum(dist<15)
},by.column=FALSE))
# [,1]
# 2014-07-09 11:35:13 6
# 2014-07-09 11:40:09 6
# 2014-07-09 11:45:25 0
# 2014-07-09 11:50:25 5
I have a data like this
> bbT11
range X0 X1 total BR GDis BDis WOE IV Index
1 (1,23] 5718 194 5912 0.03281461 12.291488 8.009909 0.42822753 1.83348973 1.534535
2 (23,26] 5249 330 5579 0.05915039 11.283319 13.625103 -0.18858848 0.44163352 1.207544
3 (26,28] 3105 209 3314 0.06306578 6.674549 8.629232 -0.25685394 0.50206815 1.292856
4 (28,33] 6277 416 6693 0.06215449 13.493121 17.175888 -0.24132650 0.88874916 1.272937
5 (33,37] 4443 239 4682 0.05104656 9.550731 9.867878 -0.03266713 0.01036028 1.033207
6 (37,41] 4277 237 4514 0.05250332 9.193895 9.785301 -0.06234172 0.03686928 1.064326
7 (41,46] 4904 265 5169 0.05126717 10.541702 10.941371 -0.03721203 0.01487247 1.037913
8 (46,51] 4582 230 4812 0.04779717 9.849527 9.496284 0.03652287 0.01290145 1.037198
9 (51,57] 4039 197 4236 0.04650614 8.682287 8.133774 0.06526000 0.03579599 1.067437
10 (57,76] 3926 105 4031 0.02604813 8.439381 4.335260 0.66612734 2.73386708 1.946684
I need to add an additional column "Bin" that will show numbers from 1 to 10, depending on BR column being in descending order, so for example 10th row becomes first, then first row becomes second, etc.
Any help would be appreciated
A very straightforward way is to use one of the rank functions from "dplyr" (eg: dense_rank, min_rank). Here, I've actually just used rank from base R. I've deleted some columns below just for presentation purposes.
library(dplyr)
mydf %>% mutate(bin = rank(BR))
# range X0 X1 total BR ... Index bin
# 1 (1,23] 5718 194 5912 0.03281461 ... 1.534535 2
# 2 (23,26] 5249 330 5579 0.05915039 ... 1.207544 8
# 3 (26,28] 3105 209 3314 0.06306578 ... 1.292856 10
# 4 (28,33] 6277 416 6693 0.06215449 ... 1.272937 9
# 5 (33,37] 4443 239 4682 0.05104656 ... 1.033207 5
# 6 (37,41] 4277 237 4514 0.05250332 ... 1.064326 7
# 7 (41,46] 4904 265 5169 0.05126717 ... 1.037913 6
# 8 (46,51] 4582 230 4812 0.04779717 ... 1.037198 4
# 9 (51,57] 4039 197 4236 0.04650614 ... 1.067437 3
# 10 (57,76] 3926 105 4031 0.02604813 ... 1.946684 1
If you just want to reorder the rows, use arrange instead:
mydf %>% arrange(BR)
bbT11$Bin[order(bbT11$BR)] <- 1:nrow(bbT11)