R: Doing calculations on multiple factors/levels (Dummy variables) - r

I have two equally long matching vectors of time series data: Price (x) and hour (h). Hour goes from 0-23. My hour variable is my dummy variable (or factor/level variable I guess it is called in R).
Right now i've defined 24 different dummy variables, and for each hour I type my dummy variable. So for example generating 24 plots to look at or calculate 24 means etc I would type:
plot.ts(hour1) # and so on for all 24.
I would like to do this for all 24 variables as easily as possible? So I can run a lot of different calculations. For example, how could I just compute the mean for all 24 dummy variables without making 24 lines of code, changing each dummy variable?
EDIT: Sorry, thought it was clear with the two vectors. Example:
1. Price Hour
2. 8 0
3. 12 1
4. 14 2
5. 16 3
6. 18 4
7. 20 5
8. 22 6
9. 24 7
10. 26 8
11. 28 9
12. 24 10
13. 26 11
14. 23 12
15. 23 13
16. 23 14
17. 14 15
18. 19 16
19. 25 17
20. 26 18
21. 28 19
22. 30 20
23. 33 21
24. 24 22
25. 10 23
26. 14 0
27. 12 1
28. 13 2
29. x ect.

It is not clear how your data are stored since you don't give a reproducible example. I assume you have separate variables for each hour1.
Generally, It is better to put your hourxx variable in a list to perform calculations.
For example, this will compute mean for all hours:
lapply(lapply(ls(pattern='hour.*'),get),mean)
EDIT after OP clarification:
You shuld create a new variable to distinguish between Hours intervals. Something like :
dat <- data.frame(Price=rnorm(24*5),Hour=rep(0:23,5))
dat$id <- cumsum(c(0,diff(dat$Hour)==-23))
Then using ply package for example , you can compute mean by id:
library(plyr)
ddply(dat,.(id),summarise,mPrice=mean(Price))
id mPrice
1 0 0.2999602
2 1 -0.2201148
3 2 0.2400192
4 3 -0.2087594
5 4 0.1666915

Related

Show only even numbers from a data set

I am trying to extract only the even numbers from the "cars" data set.
I know I need to create a new function.
I have come this far:
Is.even = function(x) x %% 2 == 0
When I enter in:
Is.even(cars[1])
It gives me back a logical response. I want to only display the actual even numbers in integer form and hide the odd numbers.
What am I doing wrong?
Apart from #neilfws' suggestion, if you pass your values as a vector you can also use Filter
Filter(Is.even, cars[, 1])
#[1] 4 4 8 10 10 10 12 12 12 12 14 14 14 14 16 16 18 18 18 18 20 20 20 20 20 22 24 24 24 24

R: counting amount of patterns of numbers [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm fairly new here and also fairly new to R so apologies if anything is unclear.
Basically, I have a csv table of numbers for each person, 1 number for each week for 38 weeks.
For example, Anthony has number 6 in week 1, 12 in week 2 and so on, these numbers are fairly random and range from 1-20.
I have taken the numbers from the table and saved them into a string, hence Anthonys string when printed would look like
"6 12 18 7 17 4 16 11 20 15 3 5 19 10 8 9 1 14 13 19 11 16 18 4 17 7 6 12 14 1 10 13 20 15 3 5 8 9"
What I'm trying to do with this is find/count the amount of times a number between 1 and 10 occurs in groups of 3 consecutively and then groups of 4 consecutively and possibly 5.
For example, in this string 8, 9 and 1 occur consecutively and then 3, 5, 8 and 9 occur consecutively, meaning the amount of occurrences is 2.
I've tried using str_count from the stringr package and also tried a few different functions located here - Count the number of overlapping substrings within a string
I can't seem to find a method/function to get this to output what I want (a simple count of the number of occurrences).
If anyone could provide any insight/help it would be greatly appreciated.
It would be easier to keep these as numbers. Here I use scan() to turn your string into a vector of values indicating if each number is less than 10 or not then I call rle() on it to calculate run lenths
x <- "6 12 18 7 17 4 16 11 20 15 3 5 19 10 8 9 1 14 13 19 11 16 18 4 17 7 6 12 14 1 10 13 20 15 3 5 8 9"
rr <- rle(scan(text=x)<10)
Now I can mangle this into a data.frame and see which runs were longer than 2
subset(as.data.frame(unclass(rr)), values==T & lengths>2)
# lengths values
# 9 3 TRUE
# 17 4 TRUE
So we can see that we had a run of 3 and a run of 4.
I could clean this up by defining a function to turn the rle into a data.frame more easily and track the starting indexes
as.data.frame.rle <- function(x) {
data.frame(unclass(x), start=head(cumsum(c(0,rr$lengths))+1,-1))
}
and can then run
subset(as.data.frame(rle(scan(text=x)<10)), values==T & lengths>2)
# lengths values start
# 9 3 TRUE 15
# 17 4 TRUE 35
so we can see those runs start at positions 15 and 35.

How to make data in a single column (long) with multiple, nested group categories wide

I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9

Creating a numerical variable order

I have a set of data with 3 columns: index column (with no name), colour, colour of seed, and germination time.
How do I create a numerical variable called 'order' with values 1 to 22 (the number of data sets)?
I don't know if I get you right, but simplest way would be:
> order <- c(1:22)
> order
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
No, if you run:
class(order)
you will get:
[1] "integer"
but you can easily get every element of object order (especially in a loop)
for(i in 1:length(order)){
print(order[i])
}

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources