ddply type functionality on multiple datafrmaes - r

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!

In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.

dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Related

r group by date difference with respect to first date

I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks
Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.

Replace value with the mean based on two classes

I have a dataset with 2 calendar variables (Week & Hour) and 1 Amount variable:
Week Hour Amount
35 1 367
35 2 912
36 1 813
36 2 482
37 1 112
37 2 155
35 1 182
35 2 912
36 1 551
36 2 928
37 1 125
37 2 676
I wish to replace each value of Amount with the mean from each observation with the same Week/Hour pair. For instance, here there are 2 obs. for (Week=35, Hour=1), with Amount values of 367 and 182. Hence, for this example, the 2 rows with (Week=35, Hour=1) should have the Amount replaced with mean(c(367,182). The final output should be:
Week Hour Amount
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
I have the following code that solves this issue. However, for the complete dataset with thousands of rows, it is very slow. Is there any way to automatically reshape with with this paired means?
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))
means <- reshape2::dcast(dataset, Week~Hour, value.var="Value", mean)
for (i in 1:nrow(dataset)) {
print(i)
dataset$Amount[i] <- means[means$Week==dataset$Week[i],which(colnames(means)==dataset$Hour[i])]
}
Possible solution with dplyr:
dataset %>%
group_by(Week, Hour) %>%
summarise(mean_amount = mean(Amount))
You group by Week and Hour and calculate the mean based on this condition.
EDIT
To keep the original structure (number of rows) alter the code to
dataset %>%
group_by(Week, Hour) %>%
mutate(Amount = mean(Amount))
If the idea is just to get the mean Amount by Week and Hour, this would work:
aggregate(Amount ~ ., dataset, mean)
Week Hour Amount
1 35 1 274.5
2 36 1 682.0
3 37 1 118.5
4 35 2 912.0
5 36 2 705.0
6 37 2 415.5
EDIT:
If, however, the idea is to put the averages back into the dataset, then this should work:
x <- aggregate(Amount ~ ., dataset, mean)
dataset$Amount <- x$Amount[match(apply(dataset[,1:2], 1, paste0, collapse = " "),
apply(x[,1:2], 1, paste0, collapse = " "))]
dataset
Week Hour Amount
1 35 1 274.5
2 35 2 912.0
3 36 1 682.0
4 36 2 705.0
5 37 1 118.5
6 37 2 415.5
7 35 1 274.5
8 35 2 912.0
9 36 1 682.0
10 36 2 705.0
11 37 1 118.5
12 37 2 415.5
Explanation:
This pastes together into strings the rows of the first two columns in the means dataframe x and in datasetusing the function apply it uses match on these strings to assign the means values to the corresponding rows in dataset
EDIT 2:
Alternatively, you can use interaction and, respectively, %in% for this transformation:
dataset$Amount <- x$Amount[match(interaction(dataset[,1:2]), interaction(x[,1:2]))]
# or:
dataset$Amount <- x$Amount[interaction(x[,1:2]) %in% interaction(dataset[,1:2])]
Base R solution:
dataset$Amount <- with(dataset, ave(dataset$Amount, dataset$Week, dataset$Hour, FUN = mean))
Data:
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))

How to create a data frame with all ordinal variables as columns and with frequencies of specific event

I have an ordinal data frame which has answers in the survey format. I want to convert each factor into a possible column so as to get them by frequencies of a specific event.
I have tried lapply, dplyr to get frequencies but failed
as.data.frame(apply(mtfinal, 2, table))
and
mtfinalf<-mtfinal %>%
group_by(q28) %>%
summarise(freq=n())
Expected Results in the form of data.frame
Frequency table with respect to q28's factors
Expected Results in the form of data.frame
q28 sex1 sex2 race1 race2 race3 race4 race5 race6 race7 age1 age2
2 0
3 0
4 23
5 21
Actual Results
$age
1 2 3 4 5 6 7
6 2 184 520 507 393 170
$sex
1 2
1239 543
$grade
1 2 3 4
561 519 425 277
$race7
1 2 3 4 5 6
179 21 27 140 17 1307
7
91
$q8
1 2 3 4 5
127 259 356 501 539
$q9
1 2 3 4 5
993 224 279 86 200
$q28
2 3 4 5
1034 533 94 121
This will give you a count of number of unique combinations. What you are asking is impossible since there would be overlaps between levels of sex, race and age.
mtfinalf<-mtfinal %>%
group_by(q28,age,race,sex) %>%
tally()

Dividing proportionally row values based on common identifier and specific column in a data frame

After a merging process, I got a data frame that looks like:
df <- data.frame(trip=c(315,328,422,422,458,652,652,652,699),
catch_kg=c(10,8,12,2,26,4,18,14,11),
age_1=c(0,0,0,0,0,0,0,0,0),
age_2=c(2,1,7.5,7.5,8,11,11,11,13),
id=c(1,2,3,3,4,5,5,5,6))
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 7.5 3
422 2 0 7.5 3
458 26 0 8 4
652 4 0 11 5
652 18 0 11 5
652 14 0 11 5
699 11 0 13 6
where trips represents the fishing trip, catch_kg the amount of caught fish (in kg), age_1 & age_2 is the number of individuals in each trip and per age group, and id represents the haul identity in each trip.
In some fishing trips I have more than 1 haul - this can be accessed in the id column, where trips with more than 1 haul have the same id number. For instance: trip number 422 has two hauls (id=3).
At this very moment, for a trip with more than 1 haul, I have that the number of individuals within each age group is equally divided by the number of hauls that appears within that specific trip. For example, in trip 422 I have a total of 15 individuals, but since there are 2 hauls, this number was divided by 2 leading to 7.5 individuals per haul.
What I would like, however, is to compute the number of individuals within each age group as a proportion of the total catch in each haul group.
Thus, at the end I would like to have a data frame that looks like:
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 13 3
422 2 0 2 3
458 26 0 8 4
652 4 0 4 5
652 18 0 16 5
652 14 0 13 5
699 11 0 13 6
This is basically a rule of three calculation, where for trip 422 (2 hauls), for instance, I would have the following calculation:
haul1: 12*(7.5 + 7.5)/(12 + 2) = 13 individuals
haul2: 2*(7.5 + 7.5)/(12 + 2) = 2 individuals
Is there an easy way to compute these calculations?
Any help would be much appreciated.
-M
You could use dplyr to help with this
library(dplyr)
df %>% group_by(trip) %>%
mutate(age_2=catch_kg/sum(catch_kg)*sum(age_2))
# trip catch_kg age_1 age_2 id
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 315 10 0 2.000000 1
# 2 328 8 0 1.000000 2
# 3 422 12 0 12.857143 3
# 4 422 2 0 2.142857 3
# 5 458 26 0 8.000000 4
# 6 652 4 0 3.666667 5
# 7 652 18 0 16.500000 5
# 8 652 14 0 12.833333 5
# 9 699 11 0 13.000000 6
Not sure exactly what rounding rule you were using to get to integer counts of people, but you'd likely run into trouble with parts not adding up to wholes in more complicated scenarios.
Another solution using data.table:
library(data.table)
setDT(df)
df[, age_2 := catch_kg * sum(age_2) / sum(catch_kg), trip]
# trip catch_kg age_1 age_2 id
#1: 315 10 0 2.000000 1
#2: 328 8 0 1.000000 2
#3: 422 12 0 12.857143 3
#4: 422 2 0 2.142857 3
#5: 458 26 0 8.000000 4
#6: 652 4 0 3.666667 5
#7: 652 18 0 16.500000 5
#8: 652 14 0 12.833333 5
#9: 699 11 0 13.000000 6
If you want you can round age_2 with round(): age_2 := round(catch_kg * sum(age_2) / sum(catch_kg))

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Resources