Obtaining mean of dataframe by value groups in another dataframe - r

Once again I consult your wisdom.
I have 2 dataframes of the form:
**data1sample**
ID value
water 3
water 5
fire 1
fire 3
fire 2
air 1
**data2controls**
ID value
water 1
fire 3
air 5
I want to use the values in my control dataframe (data2controls) and know their corresponding percentile in the sample distribution (data1sample). I have to classify each sample by their ID (meaning, get fire control against fire sample, and water against water, etc), but I haven't been able to do so.
I am using the command:
mean(data1sample[data1sample$ID == data2controls$ID,] <= data2controls$value)
but I get the error
In Ops.factor(left, right) : ‘<=’ not meaningful for factors
What I am after is basically the percentile of the value in dataframe2 calculated based on the samples of dataframe1 (I am trying to obtain the percentile as in percentile = mean(data1sample$value(by ID) <= dataframe2$value))
So something like this:
**data2controls**
ID value percentile(based on data1 sample values)
water 1 .30
fire 3 .14
air 5 .1
Please disregard the percentile values, they're just made up to show desired output.
I'd love if someone could give me a hand! Thanks!!

Its hard to answer without the desired output, but I will try to guess it here:
library(dplyr)
data1sample <- data.frame(ID = c("water", "water", "fire", "fire", "fire", "air"), value = c(3,5,1,3,2,1))
data2sample <- data.frame(ID = c("water", "fire", "air"), value = c(1,3,5))
by_ID <- data1sample %>% group_by(ID) %>% summarise(control = mean(value))
data2sample %>% inner_join(by_ID)
#> Joining, by = "ID"
#> ID value control
#> 1 water 1 4
#> 2 fire 3 2
#> 3 air 5 1

This gives the result I think you're after?
for(i in d2$ID){
x <- mean(d1[d1$ID == i & d1$value <= d2[d2$ID == i, 'value'], 'value'])
print(x)
}
Based on the data you provided it returns NaN for water because there are no 'water's that meet your criterion, and so div by 0

Related

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

R - Selecting the values of a column based on the values of another row

I have a dataframe df that looks like this
City TreeID Age Diameter
City_1 X 1 6
City_1 Y 2 5
City_2 Y 3 5
City_3 X 4 10
I have a variable nominal that can be "TreeId", "Age" or Diameter and another variable city that has the name of a city stored in string format.
I want to be able to pick up only the values of the column nominal for the correct city name
An example : city = "City_1" and nominal = "Age", then I should only pick up the values 1 and 2
I searched on here but nothing I could find was adapted to my case because I use variables so I don't know in advance which column I want to choose. I am lost, any help is appreciated. Thanks
Here's a function to do that.
return_value <- function(data, city, nominal) {
data[data$City == city, nominal]
}
return_value(df, 'City_1', 'Age')
#[1] 1 2
return_value(df, 'City_2', 'Diameter')
#[1] 5
You coul use pull:
library(dplyr)
df %>%
filter(City == "City_1") %>%
pull(Age)
df %>%
filter(City == "City_1") %>%
pull(Diameter)
[1] 1 2
[1] 6 5

R - Dplyr - How to mutate rows

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certain subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to the numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As I said before I want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because I haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 3 Sales over Assets 1.30
1 7 Sales over Assets Nan
1 1 Sales over Expenses A + B Inf
1 2 Sales over Expenses A + B 2.32
1 3 Sales over Expenses A + B NA
2
3
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope I have made myself clear...
Is there any way to solve this row problem with dplyr? Should I cast the df_2017?

Subset rows for each group based on a character in a column and order of occurrence in a data frame

I have a data similar to this.
B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)),
Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",
"Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))
You can see that Account has 4 occurrences of the element "In the Bimester", two "chunks" of two elements for each state, "Expenses" in between them.
The order here matters because the first chunk is not referring to the same thing as the second chunk.
My data is actually more complex, It has a 4th variable, indicating what each row of Account means. The number of its elements for each Account element (factor per se) can change. For example, In some state, the first "chunk" of "In the Bimester" can have 6 rows and the second, 7; but, I cannot differentiate by this 4th variable.
Desired: I'd like to subset my data, spliting those two "In the Bimester" by each state, subsetting only the rows of the first "chunks" by each state or the second "chunks".
I have a solution using data.table package, but I'm finding it kind of poor. any thoughts?
library(data.table)
B <- as.data.table(B)
B <- B[, .(Account, Value, index = 1:.N), by = .(State)]
x <- B[Account == "Expenses", .(min_ind = min(index)), by = .(State)]
B <- merge(B, x, by = "State")
B <- B[index < min_ind & Account == "In the Bimester", .(Value), by = .(State)]
You can use dplyr package:
library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>%
filter(Account == "In the Bimester") %>%
group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)
# # A tibble: 6 x 3
# # Groups: State [3]
# State Account Value
# <fctr> <fctr> <dbl>
# 1 Arizona In the Bimester 0.17730148
# 2 Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5 Texas In the Bimester 0.54076144
# 6 Texas In the Bimester 0.59168138
If instead of min you use max you'll get the last occurrences of "In the Bimester" for each State. You can also exclude Account column by changing the last pipe to select(-helper,-Account).
p.s. If you don't want to use rleid from data.table and just use dplyr functions take a look at this thread.

How to subset data for a specific column with ddply?

I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))

Resources