Creating Mean Function for Subset in R - r

I'm trying to create a function that will take a few parameters and return the total average hourly return. My data set looks like this:
Location Time units
1 Columbus 3:35 12
2 Columbus 3:58 199
3 Chicago 6:10 -45
4 Chicago 6:19 87
5 Detroit 12:05 -200
6 Detroit 0:32 11
What I would like returned would be
Location Time units unitsph
Columbus 7:33 211 27.9
Chicago 12:29 42 3.4
Detroit 12:37 -189 -15.1
while also retaining the other items
basically total units produced and units per hour.
I tried out
thing <- time %>% group_by(Location) %>% summarize(sum(units))
which returned locations and total units but not units per hour. Then I moved to
thing <- time %>% group_by(Location) %>% summarize(sum(units)) %>% summarize(sum(Time))
which returned
Error in eval(expr, envir, enclos) : object 'Time' not found
I also tried mutate but to no effect:
fin <- mutate(time, as.numeric(sum(Time))/as.numeric(sum(units)))
Error in Summary.factor(c(118L, 131L, 174L, 178L, 57L), na.rm = FALSE) :
‘sum’ not meaningful for factors
Any help here much appreciated. I also have a few other columns that I'd like to retain (they're geocodes for the locations etc), but didn't list those here. If that's important I can add back in.

Your time is a a string object. You can use
data <- data.frame(loc=c("C","C","D","D"),time=c("1:22","1:23","1:24","1:25"),u=c(1,2,3,4))
basetime <- strptime("00:00","%H:%M")
data$in.hours <- as.double(strptime(data$time,"%H:%M")-basetime)
thing <- data %>% group_by(loc) %>% summarize(sum(u),sum(in.hours))
The conversion into hours is not exactly beautiful. It first turns the time into a Posix.ct object to convert it in turn to a double. But guess ok.
The converted data
loc time u in.hours
1 C 1:22 1 1.366667
2 C 1:23 2 1.383333
3 D 1:24 3 1.400000
4 D 1:25 4 1.416667
so 1.366 means 1h + 1/3h.
The final result is then
loc sum(u) sum(in.hours)
(fctr) (dbl) (dbl)
1 C 3 2.750000
2 D 7 2.816667
hence for C you have 2 hours and 0.75*60 minutes

I ended up taking part of what #CAFEBABE recommended and modifying it.
I used
mutated_time <- time %>%
group_by(Location) %>%
summarize(play
= sum(as.numeric(Time)/60),
unitsph = sum(units))
and that plus
selektor <- as.data.frame(select(distinct(mutated_time), Location,unitsph))
got me where I wanted to go. Thank you all for the many helpful comments.

Related

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

Subset data frame by ID but within 7 days

I have data frame with two variables ID and arrival. Here is head of my data frame:
head(sun_2)
Source: local data frame [6 x 2]
ID arrival
(chr) (dats)
1 027506905 01.01.15
2 042363988 01.01.15
3 026050529 01.01.15
4 028375072 01.01.15
5 055384859 01.01.15
6 026934233 01.01.15
How could I subset data by ID which has arrive within 7 days?
So like a lot of the other folks were saying, without more information (what the original observation looks like for example) we can't get at exactly what your issue is without making some assumptions.
I assumed that you have a column of data that indicates the original Date? And that these rows are formatted as.Date.
#generate Data
Data <- data.frame(ID = as.character(1394:2394),
arrival = sample(seq(as.Date('2015/01/01'), as.Date('2016/01/01'), by = 'day'), 1001, replace = TRUE))
# Make the "Original Observation" Variable
delta_times <- sample(c(3:10), 1001, replace = TRUE)
Data$First <- Data$arrival - delta_times
this gives me a data set that looks like this
ID arrival First
1 1394 2015-11-06 2015-10-28
2 1395 2015-08-04 2015-07-26
3 1396 2015-04-19 2015-04-16
4 1397 2015-05-13 2015-05-03
5 1398 2015-07-18 2015-07-11
6 1399 2015-01-08 2015-01-03
If that is the case then the solution is to use difftime, like so:
# Now we need to make a subsetting variables
Data$diff_times <- difftime(Data$arrival, Data$First, units = "days")
Data$diff_times
within_7 <- subset(Data, diff_times <=7)
max(within_7$diff_times)
Time difference of 7 days
It's a bit difficult to be sure given the information you've provided, but I think you could do it like this:
library(dplyr)
dt %>% group_by(ID) %>% filter(arrival < min(arrival) + 7)

How to re-order data in R, and creating a new variable for the data?

I have been working with the CDC FluView dataset, retrieved by this code:
library(cdcfluview)
library(ggplot2)
usflu <- get_flu_data("national", "ilinet", years=1998:2015)
What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.
I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.
Here is what I have tried so far, I think it's supposed to be something like this:
wk_num <- 1
for(i in nrow(usflu)){
if(week == 31){
wk_num <- 1
wk_new[i] <- wk_num
wk_num <- wk_num+1
}
if(week < 53){
season[i] <- paste(Yr[i], '-', Yr[i] +1)
}
else{
}
Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.
Here's one way to accomplish this with the packages dplyr and tidyr:
library(dplyr)
library(tidyr)
usflu_df <- tbl_df(usflu)
usflu_df %>%
complete(YEAR, WEEK) %>%
filter(!(YEAR == 1998 & WEEK < 30)) %>%
mutate(season = cumsum(WEEK == 30),
season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
group_by(season) %>%
mutate(new_wk = seq_along(season)) %>%
select(YEAR, WEEK, new_wk, season, season_nm)
# YEAR WEEK new_wk season season_nm
# (int) (int) (int) (int) (chr)
# 1 1998 30 1 1 1998-1999
# 2 1998 31 2 1 1998-1999
# 3 1998 32 3 1 1998-1999
# 4 1998 33 4 1 1998-1999
# 5 1998 34 5 1 1998-1999
# 6 1998 35 6 1 1998-1999
# 7 1998 36 7 1 1998-1999
# 8 1998 37 8 1 1998-1999
# 9 1998 38 9 1 1998-1999
# 10 1998 39 10 1 1998-1999
Talking through this...
First, use tidyr::complete to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season and season_nm via cumsum and a simple paste function. The season simply increments anytime it sees WEEK == 30 -- this is useful because of leap years. We then group_by season so that we can seq_along season to create the new_wk variable.

How do you compare data from two experiments

I am often trying to measure percentage changes under two distinct scenarios/test/period.
An example dataset:
library(dplyr)
set.seed(11)
toy_dat <- data.frame(state = sample(state.name,3, replace=F),
experiment=c('control','measure'),
accuracy=sample(30:50, size=6, replace=T),
speed=sample(21:39, size=6, replace=T)) %>% arrange(state)
state experiment accuracy speed
1 Alabama measure 31 24
2 Alabama control 36 37
3 Indiana control 30 23
4 Indiana measure 31 38
5 Missouri control 50 29
6 Missouri measure 48 34
I then resort to writing something horrible like this:
result <- toy_dat %>% group_by(state) %>% arrange(experiment) %>%
summarise(acc_delta = (accuracy[2]-accuracy[1])/accuracy[1],
speed_delta = (speed[2]-speed[1])/speed[1])
However, the above solution does not scale at all when the number of measurable begins to grow. In addition, the code is very fragile in terms of the ordering.
I am very new to R. I was hoping that this is a common enough pattern that there are well-known (smarter) solutions to the problem.
I would greatly appreciate any help/pointers.
Just create your own custom function and use summarise_each in order to apply it on all the measurements at once (it doesn't matter how many measurements you have)
delta_fun <- function(x) diff(x)/x[1L]
toy_dat %>%
group_by(state) %>%
arrange(experiment) %>%
summarise_each(funs(delta_fun), -experiment)
# Source: local data frame [3 x 3]
#
# state accuracy speed
# 1 Alabama -0.13888889 -0.3513514
# 2 Indiana 0.03333333 0.6521739
# 3 Missouri -0.04000000 0.1724138
As you mentioned that you are new to R, here's another awesome package you can use in order to achieve the same effect
library(data.table)
setDT(toy_dat)[order(experiment),
lapply(.SD, delta_fun),
.SDcols = -"experiment",
by = state]
# state accuracy speed
# 1: Alabama -0.13888889 -0.3513514
# 2: Indiana 0.03333333 0.6521739
# 3: Missouri -0.04000000 0.1724138

Resources