I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni
Related
Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.
I am having trouble using combinations of ddply and merge to aggregate some variables. The data frame that I am using is really large, so I am putting an example below:
data_sample <- cbind.data.frame(c(123,123,123,321,321,134,145,000),
c('j', 'f','j','f','f','o','j','f'),
c(seq(110,180, by = 10)))
colnames(data_sample) <- c('Person','Expense_Type','Expense_Value')
I want to calculate, for each person, the percentage of the value of expense of type j on the person's total expense.
data_sample2 <- ddply(data_sample, c('Person'), transform, total = sum(Value))
data_sample2 <- ddply(data_sample2, c('Person','Type'), transform, empresa = sum(Value))
This it what I've done to list total expenses by type, but the problem is that not all individuals have expenses of type j, so their percentage should be 0 and I don't know how to leave only one line per person with the percentage of total expenses of type j.
I might have not made myself clear.
Thank you!
We can use the by function:
by(data_sample, data_sample$Person, FUN = function(dat){
sum(dat[dat$Expense_Type == 'j',]$Expense_Value) / sum(dat$Expense_Value)
})
We could also make use of the dplyr package:
library(dplyr)
data_sample %>%
group_by(Person) %>%
summarise(Percent_J = sum(ifelse(Expense_Type == 'j', Expense_Value, 0)) / sum(Expense_Value))
# A tibble: 5 × 2
Person Percent_J
<dbl> <dbl>
1 0 0.0000000
2 123 0.6666667
3 134 0.0000000
4 145 1.0000000
5 321 0.0000000
This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978
This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3
I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))