I need to replicate the following function many times using different elements. I am very new to R and the only way I know how to is with copy-paste.
I need to calculate the proportion that each program represents in its area excluding the "Undecided" of the area. And I need proportions of each area stored in a separate list or vector for further calculations.
df2 = data.frame (area=rep(c("Eng", "Hum"),each=3), program=c("Chem", "Mech", "Undecided","Hist", "Law", "Undecided"))
df2
area program
1 Eng Chem
2 Eng Mech
3 Eng Undecided
4 Hum Hist
5 Hum Law
6 Hum Undecided
p.Mech = sum(program=="Mech" & area=="Eng") / (sum(area=="Eng")- sum(program=="Undecided" & area=="Eng"))
p.Chem = sum(program=="Chem" & area=="Eng") / (sum(area=="Eng")- sum(program=="Undecided" & area=="Eng"))
p.Hist = sum(program=="Hist" & area=="Hum") / (sum(area=="Hum")- sum(program=="Undecided" & area=="Hum"))
p.law = sum(program=="Law" & area=="Hum") / (sum(area=="Hum")- sum(program=="Undecided" & area=="Hum"))
In my real data I have 9 areas and about 5 programs for each area.
This is my first post ever on stack.exchange so sorry if the question is too dumb or doesn't belong here. Hope anyone can help.
I'd probably just do the following. Remove undecided entries, tally up the results by program and area, and calculate proportions by area.
df2= data.frame (area=rep(c("Eng", "Hum"),each=3), program=c("Chem", "Mech", "Undecided","Hist", "Law", "Undecided"))
library(data.table)
library(magrittr)
dt2 <- as.data.table(df2) # just converting to a data.table
dt2 %>%
.[program != "Undecided"] %>%
.[, .N, keyby = .(area, program)] %>%
.[, P := N / sum(N), keyby = "area"] %>%
.[] # just for displaying
#> area program N P
#> 1: Eng Chem 1 0.5
#> 2: Eng Mech 1 0.5
#> 3: Hum Hist 1 0.5
#> 4: Hum Law 1 0.5
Created on 2019-03-12 by the reprex package (v0.2.1)
Related
I have a data set looking sth like this ....
Date Remaining Volume ID
1990-01-01 0 1000 1
1990-01-01 1 2000 2
1990-01-01 1 5000 3
1990-02-01 0 200 4
1990-03-01 1 4000 5
1990-03-01 0 3000 6
I filter the data according to a series of conditional statements and assign the binary flag variable to the data.table. A value of 0 means that the particular row entry doesn't meet the defined requirements and will subsequently be excluded; 1-flagged rows remain in the data.table. The key is ID and is unique for each row.
I would like to show two relationships.
(1) A stacked normalized/percentage bar chart over the monthly time series to show the percentage of entries remaining/being excluded in the data.set for each month,
f.ex. Jan 1990 --> 2/3 values remaining --> 66.6% vs. 33.3% of entries remain vs. are excluded
(2) A stacked normalized/percentage bar chart showing the normalized percentage of volume remaining/ being excluded by the filtering operation for each month,
f.ex. Jan 1990 --> 2k + 5k out of 8k remaining --> 87.5% vs. 12.5% of volume remains vs. is excluded
I tried various things so far, f.ex. compute the number of occurences of each flag-value per month and the sum of the corresponding "bucket" (0/1) volume, but all my attempts failed so far.
# dt_1 is the original data.table
id.vec <- dt_1[ , id]
dt_2 <- dt_1
# dt_1 is filterd subsequently
id_remaining.vec <- dt_1[ , id]
dt_2 <- dt_2[id.vec %in% id_remaining.vec, REMAIN := 1]
dt_2 <- dt_2[id.vec %notin% id_remaining.vec, REMAIN := 0]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN := .N]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN_MON := .N]
# Tried the code below to no avail
ggplot(data = dt_2, aes(x = Date, y = REMAIN, color = REMAIN, fill = REMAIN)) +
geom_bar(position = "fill", stat = "identity")
Usually, I find ggplot grammar very intuitive, but I guess I am overlooking sth here or maybe the data set is not in the right format.
Any pointer or idea highly appreciated!
Here's how I'd do it with dplyr:
library(dplyr)
dt_2 %>%
mutate(Remaining = as.character(Remaining)) %>% # just to make the charts use scale_fill_discrete by default
group_by(Date, Remaining) %>%
summarize(entries = n(),
volume = sum(Volume)) %>%
mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
> dt_2_summary
# A tibble: 5 x 6
Date Remaining entries volume share_entries share_volume
<chr> <chr> <int> <int> <dbl> <dbl>
1 1990-01-01 0 1 1000 0.333 0.125
2 1990-01-01 1 2 7000 0.667 0.875
3 1990-02-01 0 1 200 1 1
4 1990-03-01 0 1 3000 0.5 0.429
5 1990-03-01 1 1 4000 0.5 0.571
Then to chart:
dt_2_summary %>%
ggplot(aes(Date, share_entries, fill = Remaining)) +
geom_col()
dt_2_summary %>%
ggplot(aes(Date, share_volume, fill = Remaining)) +
geom_col()
Just as an appendix to Jon's great soution.
I had a large project with >25 libraries loaded and while the proposed code seemingly worked, it only did work for the share_entries and not for share_volume. Output of dt_2_summary was weird. The share_entries column was apparently computed to the total number of entries and not within each group and the share_volume column only showed NAs.
After hours of troubleshooting, I identified the culprit to be the pkg plyr, which did overwrite some functions. Thus, I had to specify which version of the applied functions I wanted to use.
The code below did the trick for me.
library(plyr) # the culprit
library(dplyr)
dt_2 %>%
dplyr::mutate(Remaining = as.character(Remaining)) %>%
group_by(Date, Remaining) %>%
dplyr::summarize(entries = n(),
volume = sum(Volume)) %>%
dplyr::mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
Thanks again Jon for your wonderful solutiopn!
I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni
After performing a survey on perceived problems per neighborhood I get this dataframe. Since the survey had different options to choose from + an open one, the results on the open question are frequently irrelevant (see below):
library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")
# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")
df = df %>%
group_by(Problems) %>%
summarise(Total = n()) %>%
mutate(freq = Total/sum(Total)*100) %>%
arrange(rank = desc(rank(freq)))
Resulting in this data frame:
> df
Source: local data table [34 x 3]
Problems Total freq
1 Hurtos o robos sin violencia 245 25.6008359
2 Drogas 232 24.2424242
3 Peleas callejeras 162 16.9278997
4 Ningún problema 149 15.5694880
5 Agresiones 66 6.8965517
6 Robos con violencia 62 6.4785789
7 Quema contenedores 6 0.6269592
8 Ruidos 5 0.5224660
9 NS/NC 4 0.4179728
10 Desempleo 2 0.2089864
.. ... ... ...
>
As you can see results after row 9 are mostly irrelevant (only one or two respondants per option), so I'd like them to be grouped into a single option (such as "others") without losing their relation to the neighborhood (that's why I cant rename the values now). Any suggestions?
The splitstackshape imports the data.table package (so you don't even need to library it) and assigns a data.table class to your data set, so I would simply proceed with data.table syntax from there, especially because nothing beats data.table when it comes to assignments in a subset.
In other words, intead of this long dplyr piping, you can simply do
df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]
And you good to go.
You can check the new summary table using
df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2: Drogas 24.242424
# 3: Peleas callejeras 16.927900
# 4: Ningֳ÷n problema 15.569488
# 5: Agresiones 6.896552
# 6: Robos con violencia 6.478579
# 7: OTHER 4.284222
How do I get a histogram-like summary of interval data in R?
My MWE data has four intervals.
interval range
Int1 2-7
Int2 10-14
Int3 12-18
Int4 25-28
I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins.
The function output should look like this:
bin count which
[0-4] 1 Int1
[5-9] 1 Int1
[10-14] 2 Int2 and Int3
[15-19] 1 Int3
[20-24] 0 None
[25-29] 1 Int4
Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.
I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.
Update:
So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.
> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1
But I am still unable to get which are the ranges that overlap. Will update if I get through.
Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though.
I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package:
require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""),
start = c(2,10,12,25),
end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25),
end = c(4,9,14,19,24,29))
setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)),
which = paste(interval, collapse=", ")),
by = .(i.start, i.end)]
# i.start i.end count which
# 1: 0 4 1 int1
# 2: 5 9 1 int1
# 3: 10 14 2 int2, int3
# 4: 15 19 1 int3
# 5: 20 24 0 NA
# 6: 25 29 1 int4
I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.