How to subsetting rows group wise in R? - r

Probably my question title is not appropriate, sorry for that. I have a csv file named "table_parameter". Please, download from here.. Data look like this:
time Avg.PM10 sill range nugget
1 1 2012030101 52.269231 0.11054330 45574.072 0.037261216
2 2 2012030102 55.314286 0.20250974 87306.391 0.048315377
3 3 2012030103 56.038095 0.17711558 56806.827 0.034956709
4 4 2012030104 55.904762 0.16466350 104767.669 0.030752835
5 5 2012030105 57.123810 0.23638953 87306.391 0.037308364
6 6 2012030106 58.542857 0.24130317 87306.391 0.042108754
7 7 2012030107 60.066667 0.20362439 87306.391 0.037353980
8 8 2012030108 63.790476 0.19417801 87306.391 0.034144464
.
.
.
In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.
I want to subset this dataframe time wise. I want a dataframe contains only morning times of every 7 days. morning time is 1.00 am to 5.00 a.m. That means I want a dataframe which contais all the value belongs to 2012030101 to 2012030105, 2012030201 to 2012030205..........2012030701 to 2012030705.in other words,I want a dataframe like below:
time Avg.PM10 sill range nugget
1 49 49 2012030301 17.371429 0.7154449 48239.54 0.17163448
2 50 50 2012030302 17.811321 1.1201199 117603.55 0.12425337
3 51 51 2012030303 17.094340 0.5799705 55103.16 0.12061258
4 52 52 2012030304 16.679245 0.8486774 86725.77 0.15210005
5 53 53 2012030305 16.885714 1.2408621 154677.61 0.09743375
6 73 73 2012030401 21.619048 0.4417369 104767.67 0.08567888
7 74 74 2012030402 20.485714 2.0271124 215474.54 0.06340464
8 75 75 2012030403 20.552381 0.4509354 104767.67 0.06319812
9 76 76 2012030404 20.104762 0.4438798 104767.67 0.05639840
10 77 77 2012030405 20.133333 0.5050201 104767.67 0.09037341
.
.
.
For doing this I wrote these code:
table<-read.csv("table_parameter.csv")
table
table_morning<-subset(table, time %in% c(2012030101:2012030105,
2012030201:2012030205,
2012030301:2012030305,
2012030401:2012030405,
2012030501:2012030505,
2012030601:2012030605,
2012030701:2012030705) & Avg.PM10 <=30)
table_morning
But this code is not efficient.as you see, I wrote all the hour values to subset! If want to do the same work for 90 days then Its very inefficient. So, how can I do this subsetting efficiently? If you have any further query please let me know.

you could use substring like below:
table_morning <- subset(table, substring(time, 9, 10) %in% c("01", "02","03","04", "05") & Avg.PM10 <=30)

I would extract the hour from the time and then filter accordingly.
For example:
library(dplyr)
data_orpheus = read.csv('table_parameter.csv')
data_orpheus$hour = as.numeric(substr(as.character(data_orpheus$time),9,10))
data_morning = data_orpheus %>% filter(hour >= 1 & hour <= 5)
The dplyr operator %>% is not necessary, you could filter with data_morning = data_orpheus[with(data_orpheus,hour >= 1 & hour <= 5)]
Update
I am still learning dplyr, so here is a beautiful one-liner that does it all:
data_morning = read.csv('table_parameter.csv') %>% # Read CSV
mutate(hours = as.numeric(substr(time,9,10))) %>% # Extract hours
filter(hours >= 1 & hours <= 5) %>% # Keep only mornings
select(-hours) # Drop hours, if not needed
head(data_morning)
X time Avg.PM10 sill range nugget
1 1 2012030101 52.26923 0.1105433 45574.07 0.03726122
2 2 2012030102 55.31429 0.2025097 87306.39 0.04831538
3 3 2012030103 56.03810 0.1771156 56806.83 0.03495671
4 4 2012030104 55.90476 0.1646635 104767.67 0.03075283
5 5 2012030105 57.12381 0.2363895 87306.39 0.03730836
6 25 2012030201 67.10476 0.1434977 72755.33 0.03003781

Thanks a lot for Other answers. My improvised answer for my future advantage:
table<-read.csv("table_parameter.csv")
times<- as.numeric(substr(table$time,9,10))
table_morning<- subset(table, times>=1 & times<=5 & Avg.PM10<=30)

Related

R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

if statement with ddply function

I am trying to use the if statement with ddply but am having issues with the if statement.
An example dataset is:
data<-data.frame(Gear=c(rep("S",10),rep("C",10)),TowSurvey=c(0,0,1,1,0,1,1,1,1,0),TowCom=c(0,1,1,1,0,1,1,1,1,0),
StationID=c(1,2,3,4,5,6,7,8,9,10),Totwght=c(2,8,6,4,12,9,56,7,89,10),Totexpwght=c(5,8,12,45,89,56,23,78,56,41),
Expnum=c(1,5,6,98,45,2,6,3,7,45),Exp=c(56,25,85,74,1,23,56,45,89,75))
My first try was
if(data$Gear=="S" & data$TowSurvey== 1 | data$Gear=="C" & data$TowCom== 1){
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))}
print(datad)
But the records that don't meet the if statement criteria are included in datad.
Then I found this post: Aggregate (count) rows that match a condition, group by unique values. Aggregate (count) rows that match a condition, group by unique values
So my second attempt based on the answer from the post was
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght[Gear=="S" & TowSurvey== 1 | Gear=="C" & TowCom== 1]))
I only tried with one column as a test and am getting the same results. Any help would be appreciated in trying to figure this out.
Thanks
If you run your first attempt you should actually get an error message since if can only evaluate a logical vector of length 1.
You really don't need an if statement here. Subsetting your data will do just fine.
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
You can run your ddply statement using data_sub rather than data.
Or if you're going to be using the a lot you can wrap it in a function:
datad_func <- function(data){
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
datad<-ddply(data_sub, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))
rm('data_sub')
print(datad)
}
datad_func(data)
StationID Gear Totwghtpertow Totexppertow Totnum Totexpnum
1 2 C 8 8 5 25
2 3 C 6 12 6 85
3 3 S 6 12 6 85
4 4 C 4 45 98 74
5 4 S 4 45 98 74
6 6 C 9 56 2 23
7 6 S 9 56 2 23
8 7 C 56 23 6 56
9 7 S 56 23 6 56
10 8 C 7 78 3 45
11 8 S 7 78 3 45
12 9 C 89 56 7 89
13 9 S 89 56 7 89
plyr is not so good at subsetting in the function, so you can do it before or after like #scribbles said.
You could also try dplyr and pipe them together:
library(dplyr)
data %>% filter((data$Gear == "S" & data$TowSurvey == 1) | (data$Gear == "C" & data$TowCom == 1)) %>%
group_by(StationID, Gear) %>%
summarise_each(funs(sum), Totwght, Totexpwght, Expnum, Exp)

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

Grouping R variables based on sub-groups

I have a data formatted as
PERSON_A PERSON_B MEET LEAVE
That describes basically when a PERSON_A met a PERSON_B at time MEET and they said "bye" to each other at moment LEAVE. The time is expressed in seconds, and there is a small part of the data on http://pastie.org/2825794 (simple.dat).
What I need is to count the number of meetings grouping it by day. At the moment, I have a code that works, the appearance is not beautiful. Anyway, I'd like a help in order to transform it in a code that reflects the grouping Im trying to do, e.g, using ddply, etc. Therefore, my main aim is to learn from this case. Probably there are many mistakes in this code regarding good practices in R.
library(plyr)
data = read.table("simple.dat", stringsAsFactors=FALSE)
names(data)=c('PERSON_A','PERSON_B','MEET','LEAVE')
attach(data)
min_interval = min(MEET)
max_interval = max(LEAVE)
interval = max_interval - min_interval
day = 86400
number_of_days = floor(interval/day)
g = data.frame(MEETINGS=c(0:number_of_days)) # just to store the result
g[,1] = 0
start_offset = min_interval # start of the first day
for (interval in c(0:number_of_days)) {
end_offset = start_offset + day
meetings = (length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_A) + length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_B))
g[interval+1, ] = meetings
start_offset = end_offset # start next day
}
g
This code iterates over the days (intervals of 86400 seconds) and stores the number of meetings on the dataframe g. The correct output (shown bellow) of this code when executed on the linked dataset gives for each line (day) the number o meetings.
MEETINGS
1 38
2 10
3 16
4 18
5 24
6 6
7 4
8 10
9 28
10 14
11 22
12 2
13 .. 44 0 # I simplified the output here
45 2
Anyway, I know that I could use ddply to get the number of meetings for each pair o nodes:
contacts <- ddply(data, .(PERSON_A, PERSON_B), summarise
, CONTACTS = length(c(PERSON_A, PERSON_B)) /2
)
but there is a huge hill for me between this and the result I need.
As a end note, I read How to make a great R reproducible example? and tried my best :)
Thanks,
try this:
> d2 <- transform(data, m = floor(MEET/86400) + 1, l = floor(LEAVE/86400) + 1)
> d3 <- subset(d2, m == l)
> table(d3$m) * 2
1 2 3 4 5 6 7 8 9 10 11 12 45
38 10 16 18 24 6 4 10 28 14 22 2 2
floor(x/(60*60*24)) is a quick way to convert second into day.

Resources