How to calculate the mean average of 2 different rows - math

country
Value
GBR
10
USA
30
GBR
20
USA
40
This is just a quick question which i was hoping someone could help me sort out as i am new to coding. How would i be able to find the mean / average value of the total values which occur in GBR and then the values which occur in USA? Thanks :)

I am assuming you are a pandas data frame and python (please update the language and data structure in question tags).
For getting group-wise means you can use groupby -
import pandas as pd
#df contains the dataset!!
df.groupby('country')['Value'].mean() #Grouped by country, get mean of column value
country
GBR 15
USA 35
Name: Value, dtype: int64
You can read more about this on this article I wrote on kaggle

Related

Mirror a dataset with future days for forecasting

Guess this is pretty basic but I'm struggling to find a way and find a answer online either. I'm trying to create a dataframe with future dates but those dates should be duplicated per combinations of other 2 variables
so I should have
Dates | Channel | Product
Channel can take 4 values and product 7 values and I need to create dates for future 45 days after my last day in current df. Therefore I have 28 combinations per day and my new df should be 1260 rows (45 * 7 *4)
as the sample below
I know about this function
Dates =seq(max(train$Date), by="day", length.out=45)
However this will create a vector not duplicating dates for each combination. Anyway I can adapt this?

Compare string to column in different dataframe r

I have one dataframe df_EU that is composed of scientists working in the EU in the following format:
Author ID Country Year
A 12345 UK 2011
B 13254 Germany 2018
C 54952 Belgium 2005
D 58774 UK 2009
E 88569 Italy 2015
...
Then, I have another dataframe that contains scientists from the US df_US in the same format. Now, what I am trying to do is to add a new column for the US dataframe in which I compare each ID in the US dataframe with all the IDs in the EU dataframe. Each time there is a match, I want a 1 to appear in the new column, for each ID that is not in the EU set, a 0.
So far, I am fairly certain that my solution should contain mapply and i deducted from this question that I can "load" the values for the ID numbers using:
mapply(function(i, j) length(grep(i, j)), df_EU$ID, df_US$ID)
I am, however, quite lost on how to proceed from here. I have never really worked with functions, and would therefore greatly appreciate your help! Thank you very much.
Another problem is that the scientists might appear multiple times per dataframe, as they are not listed by their unique names but by publications that have appeared in the respective region.
Here, we can use a regex_fuzzy_join
library(fuzzyjoin)
df_US <- regex_left_join(df_US, df_EU %>%
select(ID), by = 'ID') %>%
mutate(EU_migration = !is.na(ID.y))

Stata-related graphic enquiry

I have a very basic question about Stata. I have a repeated cross section of individuals from year 1 to year 20. For each individual, by year, I have a year-specific variable- GDP per capita in the country for instance. This variable is defined for each individual for each year, across years. I therefore have 20 unique data points for this variable. I want to plot this variable as a function of time (say in a two-way plot). The twoway command does not work because I have a lot more than 20 points for this 20 values because for each value I have it defined over the n number of people in the cross section in that year. How can I create a separate variable that extracts only the distinct values from the variable in its current form?
With a simple example of your data you could have saved yourself and others time. As it stands, your question is difficult to understand. As already pointed out, it lacks both code and example data. Please rewrite so others can easily find and use whatever is posted here.
My interpretation is you have panel data. The variable gdp is year-specific (in every panel the information is duplicated), but you'd like to graph it against time. Just tag one instance, and draw a graph conditional on that. An example:
clear
set more off
// not 20 years, but 3
input ///
id year gdp
1 1990 78
1 1991 90
1 1992 98
2 1990 78
2 1991 90
2 1992 98
end
egen tograph = tag(year)
twoway line gdp year if tograph
or
twoway line gdp year if id == 1
This is a perfect case of panel data:
First set the panel. The command to set the panel in your case is the following:
xtset id year
you can plot using xtline function using following command:
xtline gdp , t(year) i(id)
The above command will plot individual graphs for each id over year. To get one graph for all for comparison, use the following command:
xtline gdb , overlay t(year) i(id)

R - get sum from one column based on categories in another column

I am new to R and trying to learn on my own. I have data in csv format with 1,048,575 rows and 73 columns. I am looking at three columns - year, country, aid_amount. I want to get the sum of aid_amount by country for i) all years, and ii) for years 1991-2010. I tried the following to get for all years BUT the result I get is different from when I sort/sum in Excel. What is wrong here. Also, what change should I make for ii) years 1991-2010. Thanks.
aiddata <- read.csv("aiddata_research.csv")
sum_by_country <- tapply(aiddata$aid_amount, aiddata$country, sum, na.rm=TRUE) # There are missing data on aid_amount
write.csv(sum_by_country, "sum_by_country.csv")
I have also tried:
sum_by_country <- aggregate(aid_amount ~ country, data = aiddata, sum) instead of tapply.
The first few rows for a few columns look like this:
aiddata_id year country aid_amount
23229017 2004 Bangladesh 685899.2666
14582630 2000 Bilateral, unspecified 15772.77174
28085216 2006 Bilateral, unspecified 38926.82898
28702455 2006 Bilateral, unspecified 12633.85659
29928104 2006 Cambodia 955412.9884
27783934 2006 Cambodia 11773.77268
37418683 2008 Guatemala 40150.7331
94726192 2010 Guatemala 151206.3096
You could use data.table for the big dataset. If you want to get the sum of aid_amount for each country by year
library(data.table)
setkey(setDT(aiddata), country,year)[,
list(aid_amount=sum(aid_amount)), by=list(country, year)]
To get the sum of aid_amount for each country
setkey(setDT(aiddata), country)[,
list(aid_amount=sum(aid_amount)), by=list(country)]
yy=aggregate(df$Column1,by=list(df$Column2),FUN=mean)
Column 2- Categories on which you want to sum.
If you want to know the maximum value(sum) among all categories? Use the below code:
which.max(yy$x)

Comparing value with previous one in R to tabulate spending?

I asked a very general version of this question a while ago. I thought I would have enough programming background to make the jump from the answer to create my function, but turns out I was wrong. This is my first time using R, and I'm having some trouble.
Given the following dataset:
Amount_Bought CustomerID
12 28
18 28
2 6
9 6
10 6
I want to create a column called "average spending" which tabulates the average spending of each customer based on their ID. There is about 1000 entries to the data with varying number of purchases.
For example, for customerID 28, I would want average spending to be (12 + 18)/2 = 15
So, something like this:
Amount_Bought CustomerID Average_Spending
12 28
18 28 15
2 6
9 6
10 6 7
How would I go about doing this?
Thank you
How about:
library(plyr)
sumdat <- ddply(my_data,"Customer_ID",summarise,
avg_spending = mean(Amount_Bought))
merge(my_data,sumdat)
(There are a variety of ways to aggregate data in this way in R: ave, aggregate in base R, dplyr package, data.table package ... there are lots of questions on SO comparing efficiency etc. of these various approaches, e.g. Joining aggregated values back to the original data frame )

Resources