Combining irrelevant/similar observations into one (others) - r

After performing a survey on perceived problems per neighborhood I get this dataframe. Since the survey had different options to choose from + an open one, the results on the open question are frequently irrelevant (see below):
library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")
# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")
df = df %>%
group_by(Problems) %>%
summarise(Total = n()) %>%
mutate(freq = Total/sum(Total)*100) %>%
arrange(rank = desc(rank(freq)))
Resulting in this data frame:
> df
Source: local data table [34 x 3]
Problems Total freq
1 Hurtos o robos sin violencia 245 25.6008359
2 Drogas 232 24.2424242
3 Peleas callejeras 162 16.9278997
4 Ningún problema 149 15.5694880
5 Agresiones 66 6.8965517
6 Robos con violencia 62 6.4785789
7 Quema contenedores 6 0.6269592
8 Ruidos 5 0.5224660
9 NS/NC 4 0.4179728
10 Desempleo 2 0.2089864
.. ... ... ...
>
As you can see results after row 9 are mostly irrelevant (only one or two respondants per option), so I'd like them to be grouped into a single option (such as "others") without losing their relation to the neighborhood (that's why I cant rename the values now). Any suggestions?

The splitstackshape imports the data.table package (so you don't even need to library it) and assigns a data.table class to your data set, so I would simply proceed with data.table syntax from there, especially because nothing beats data.table when it comes to assignments in a subset.
In other words, intead of this long dplyr piping, you can simply do
df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]
And you good to go.
You can check the new summary table using
df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2: Drogas 24.242424
# 3: Peleas callejeras 16.927900
# 4: Ningֳ÷n problema 15.569488
# 5: Agresiones 6.896552
# 6: Robos con violencia 6.478579
# 7: OTHER 4.284222

Related

Use part of row data for new columns in R

I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)
For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

How can I iterate a function over specific columns of a series of dataframes where I can set the order?

I work for an insurance company and I am trying to improve something that I built. I have about 150 data frames that look like this:
library(data.table)
dt_Premium<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
I take the base premium and then I multiply by factors, and then add a fixed expense at the very end. My code is currently something like:
dt_Final_Premium<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
What I hate about this:
-The 2:4 stuff (I would like to be able to use a named range)
-The typing is monstrous considering all of the tables and policies I actually have
-It is very confusing for anybody except me (the author) to understand and edit/adjust the code
-I would like to be able to have each rating step as part of a list, and then just iterate over that list (or a similar process).
-Ideally I would be able to get the values at each step. For example :
step2_answer<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4])
There just has to be a way were I can take a dataframe/datatable and then just multiply or add to the next dataframe/datatable in the series. Thanks for taking a look at this?
How about something like this using dplyr?!
Here I am using the same calculation that you have mentioned but row wise using mutate function of dplyr which makes it clear to see the step by step and for anyone to understand the calculation easily.
library(data.table)
library(dplyr)
dt_Premium <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
dt_Final_Premium <- cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
new_dt_final_premium <-
dt_Premium %>%
# Joining all tables together
left_join(dt_Discount_Factors, by = "Policy") %>%
left_join(dt_Territory_Factors, by = "Policy") %>%
left_join(dt_Fixed_Expense, by = "Policy") %>%
# Calculating required calculation
mutate(
Base_Premium_Fire =
Base_Premium_Fire * Discount_Factor_Fire * Territory_Factor_Fire + Fixed_Expense_Fire,
Base_Premium_Water =
Base_Premium_Water * Discount_Factor_Water * Territory_Factor_Water + Fixed_Expense_Water,
Base_Premium_Theft =
Base_Premium_Theft * Discount_Factor_Theft * Territory_Factor_Theft + Fixed_Expense_Theft) %>%
select(Policy, Base_Premium_Fire, Base_Premium_Water, Base_Premium_Theft)
Since your columns have a clean naming, some pivoting may do the work:
library(tidyverse) #to be run after library(data.table)
dt_Premium %>%
left_join(dt_Discount_Factors, by="Policy") %>%
left_join(dt_Territory_Factors, by="Policy") %>%
left_join(dt_Fixed_Expense, by="Policy") %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #or calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
After joining all the columns, you can pivot to a long format and turn columns to rows. There, you can separate the name into the real name (Base, Discount, Fixed...) and the object (Fire, Water, ...) and return to the wide format. The tricky part is to get a good regular expression, as your names use the underscore twice. Mine can be vastly improved but will do the work for now.
After this, you can calculate whatever you want, select only the result and pivot to wide one last time. If you want to get all the results, you may tweak this last pivot with prefixes.
Pivoting is quite a gymnastics, but it has proven to be very effective once you get used to it.
As you have a lot of tables, if you can get them as a list, you can also use purrr::reduce to join them all at once and simplify the first lines of code:
list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense) %>%
reduce(left_join, by='Policy') %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #of calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
Another option is to reorganize the data by converting into a long format, merge and then perform the calculations:
DT <- Reduce(merge, lapply(dtList, function(d) {
vn <- sub('_([^_]*)$', '', names(d)[2L]) #see reference [1]
melt(d, id.vars="Policy", value.name=vn)[,
variable := gsub("(.*)_(.*)_(.*)", "\\3", variable)]
}))
DT
DT[, disc_prem := Base_Premium * Discount_Factor][,
disc_prem_loc := disc_prem * Territory_Factor][,
Final_Premium := disc_prem_loc + Fixed_Expense]
output:
Policy variable Base_Premium Discount_Factor Territory_Factor Fixed_Expense disc_prem disc_prem_loc Final_Premium
1: Pol123 Fire 45 0.90 1.90 5 40.50 76.9500 81.9500
2: Pol123 Theft 3 1.00 1.00 9 3.00 3.0000 12.0000
3: Pol123 Water 20 0.80 1.03 7 16.00 16.4800 23.4800
4: Pol333 Fire 55 0.95 1.20 5 52.25 62.7000 67.7000
5: Pol333 Theft 5 1.00 1.50 9 5.00 7.5000 16.5000
6: Pol333 Water 21 0.85 1.30 7 17.85 23.2050 30.2050
7: Pol555 Fire 105 0.99 0.91 5 103.95 94.5945 99.5945
8: Pol555 Theft 6 1.00 1.00 9 6.00 6.0000 15.0000
9: Pol555 Water 24 0.90 1.25 7 21.60 27.0000 34.0000
10: Pol999 Fire 92 0.97 1.03 5 89.24 91.9172 96.9172
11: Pol999 Theft 7 1.00 0.50 9 7.00 3.5000 12.5000
12: Pol999 Water 29 0.96 1.01 7 27.84 28.1184 35.1184
data:
dtLs <- list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense)
Reference:
regex-return-all-before-the-second-occurrence
I am guessing reading some of rdata.table vignettes would help you tighten up syntax and make it more terse. Some of us think terse = 'more readable' in numeric programming. Others think that represents some level of insanity:
vignette(package="data.table")
Understanding Map, Reduce, mget and other functional notation in R and rdata.table may help. Here are some things I have done from a data.table mindset:
Dropping cols syntax might be more terse using 'i' to drop a vector of cols:
dt[is.na(dt)] <- 0 # replace NA with 0
drop_col_list <- c('dropcol1','dropcol2','dropcol3') # drop col list
# dt <- dt[!drop_col_list,sapply(dt,as.numeric)] # make selected dt cols numeric type
dt[!drop_col_list,SumCol := Reduce(`+`, dt)] # adds Sum col with 'functional programming' iteration
The lapply(.SD, func) format is very powerful:
fsum <- function(x) {sum(x,na.rm=TRUE)}
dt[,lapply(.SD,fsum),by=,.SDcols=c("col1","col2","col3","col4")]
# or
dt[!drop_col_list,lapply(.SD,fsum)]
This shows applying the internal data.table 'set' function (':=') and mget to create cols derived from operations with functional programming on two data.tables. The data.table(s) may need to have the same nrow():
nm1 <- names(dt1)[1:4]
nm2 <- names(dt2)[1:4]
dt[, SumCol := Reduce(`+`, Map(`*`, mget(nm1), mget(nm2)))]
The loop below isn't really rdata.table'esq' programming but outputs a data.table. Probably this isn't as fast as more data.table like syntax:
seqXpi <- function(x) {x * pi}
seqXexp <- function(x) {x * exp(1)}
l <- {};
for(x in seq(1,10,1)) l <- as.data.table(rbind(l,cbind(seq=x,seqXpi=seqXpi(x),seqXexp=seqXexp(x))))

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Create new index / re-index in dplyr [duplicate]

This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3

Find monthly plane/aircraft usage from the nycflights13 data set

I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.

Resources