Using the dataset 'cars' in R I would like to add a new column to this dataset that takes the average of the column 'dist' dependent on the values in the column 'speed', while also having R evaluating the 'speed' as a grouping parameter.
So first I need 19 groups reflecting the unique speeds in cars$speed:
4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
For each of these 19 groups I would like to know what the average dist is, but only if at least one of the entries in each of these 19 categories meet a criteria (e.g. at least one dist-value is above 20.
With the cars-dataset I would get something like this back for the cars with speed 4 to 12:
speed dist avr_dist_if_one_speed_is_above20
4 2 none
4 10 none
7 4 13
7 22 13
8 16 none
9 10 none
10 18 26
10 26 26
10 34 26
11 17 22.5
11 28 22.5
12 14 21.5
12 20 21.5
12 24 21.5
12 28 21.5
...
Since the 2 cars that have speed 4 both have a dist below 20, I do not get an average for these two entries. For the cars that have speed 7 I get an average dist of 13 since at least one car with speed 7 have a dist above 20.
For the cars with speed 8 and 9 I do not get an average, as both of these cars have a dist below 20. The cars with speed 10 should return an average of 26
since two of the cars with speed 10 have a dist above 20.
For cars with speed 11 I get 22.5
For cars with speed 12 I get 21.5.
The R-code should calculate an average dist for all the remaining speed-categories, as the rest all include cars with dist>20.
This will do what you are looking for if I understand your question right.
library(dplyr)
cars %>%
group_by(speed) %>%
summarise(n = n(), avg_dist = ifelse(any(dist > 20),mean(dist, na.rm = T), NA)
Try this:
library(dplyr)
cars %>%
group_by(speed, dist) %>%
group_by(speed) %>%
mutate(avr_dist_if_one_speed_is_above20 = mean(dist[max(dist)>20]))
Related
I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I have a dataset of 22 point coordinates (points represent landmarks on photo of fish-lateral view).
I would like to measure 24 distances between these points (24 different measurements). For example distance between point 1 and 5 and so on.
And I would like to make a loop from it (always will measure the same set of 24 distances - I have 2000 of such lists of coordinates where I have to measure these 24 distances).
I tried "dist" function (see below) and it gave me all possible measurements between all points.
getwd()
setwd("C:/Users/jakub/merania")
LCmeasure <- read.csv("LC_meranie2.csv", sep = ";", dec = ",", header = T)
LCmeasure
head(LCmeasure)
names(LCmeasure)
> LCmeasure
point x y
1 1 1724.00000 1747.00000
2 2 1864.00000 1637.00000
3 3 1862.00000 1760.00000
4 4 2004.00000 1757.00000
5 5 2077.00000 1533.00000
6 6 2134.00000 1933.00000
7 7 2293.00000 1699.00000
8 8 2282.00000 1588.00000
9 9 2728.00000 1576.00000
10 10 2922.00000 1440.00000
11 11 3018.00000 1990.00000
12 12 3282.00000 1927.00000
13 13 3435.00000 1462.00000
14 14 3629.00000 1548.00000
15 15 3948.00000 1826.00000
16 16 3935.00000 1571.00000
17 17 4463.00000 1700.00000
18 18 4661.00000 1978.00000
19 19 4671.00000 1445.00000
20 20 4101.00000 1699.00000
21 21 2203.00000 2806.00000
22 22 4772.00000 2788.00000
df= data.frame(LCmeasure)
df
dflibrary(tidyverse)
dist(df[,-1])
Points <- data.frame(p1=c(1,1,1,3,4,5,1,1,1,7,10,10,11,12,12,14,15,11,13,7,20,20,20,1),p2=c(8,2,3,4,8,6,11,10,13,10,13,11,13,13,20,20,16,12,14,9,18,17,19,20))
Points
Dists <- Points %>% rowwise() %>% mutate(dist=dist(filter(LCmeasure, Point %in% c(p1,p2))))
Dists
Now I need to specify in R to measure for me only those specific 24 distances. For example between point 1 and 5, then between point 2 and 10, and so on.
And to make a loop from it (always will be the same set of 24 distances measured).
Here is my solution to your problem:
Generate a new dataframe with your desired pairs of points and then use dplyr to generate distances based on those points:
library(tidyverse)
Points <- data.frame(p1=c(1,2,4,5,6),p2=c(5,10,14,15,17))
Dists <- Points %>% rowwise() %>% mutate(dist=dist(filter(LCMeasure, point %in% c(p1,p2))))
> Dists
> p1 p2 dist
> <dbl> <dbl> <dbl>
> 1 1 5 413.
> 2 2 10 1076.
> 3 4 14 1638.
> 4 5 15 1894.
> 5 6 17 2341.
1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)
I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
I have data that illustrates hurricane tracks crossing through a series of "gates". How would I code it to output the GateID, and the count of times that each GateID occurs in the total data frame?
track_id day hour month year rate gate_id pres_inter vmax_inter
9 10 0 7 1 9.6451E-06 2 97809 23.545
9 10 0 7 1 9.6451E-06 17 100170 13.843
10 3 6 7 1 9.6451E-06 2 96662 31.568
13 22 12 8 1 9.6451E-06 1 94449 48.466
13 22 12 8 1 9.6451E-06 17 96749 30.55
16 13 0 8 1 9.6451E-06 4 98702 19.205
16 13 0 8 1 9.6451E-06 16 98585 18.143
19 27 6 9 1 9.6451E-06 9 98838 20.053
header <- read.table(fname_in, nrows=1)
track <- read.table(fname_in, sep=',', skip=1)
colnames(track) <- c("ID", "day", "month", "year", "hour", "rate", "gate_id", "pres_inter", "vmax_inter")
I think I would like to count the occurrence of each gate_id, and also perhaps output the maximum wind per gate (vmax_inter), etc....
Totally reading your mind, since you provide nothing concrete to go on. But if GateID is one of your data frame columns, you can get the count for each unique GateID along with other parameters using count from package plyr.
install.packages("plyr")
library("plyr")
count(mydf, vars = "GateID")
See ?count after installing for further details.
For the 2nd part of your question, see ?aggregate and consider the formula interface. For example,
aggregate(gate_id ~ vmax_inter, data = mydf, FUN = max)
or something similar. By the way, you can combine your two read.table steps with 'read.csv`