Functions by groups in another column in R [duplicate] - r

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have 2 questions regarding groups in a dataframe in R.
Imagine I have a dataframe (df) like this
| CONT | COUNTRY | GDP | AVG_GDP |
|------|---------|-----|---------|
| AF | EGYPT | 3 | 2 |
| AF | SUDAN | 2 | 2 |
| AF | ZAMBIA | 1 | 2 |
| AM | CANADA | 4 | 5 |
| AM | MEXICO | 2 | 5 |
| AM | USA | 9 | 5 |
| EU | FRANCE | 5 | 4 |
| EU | ITALY | 4 | 4 |
| EU | SPAIN | 3 | 4 |
How can I calculate the average of GDP by continents and then put it in the AVG_GDP column so it looks like in the table above?
The second question is how can I sum the GDP by continents so it looks like this:
| CONT | SUM_GDP |
|------|---------|
| AF | 6 |
| AM | 15 |
| EU | 12 |
For this last question I think that in base R the second column would be obtained with something like df$SUM_GDP <- aggregate(df$GDP, by=list(df$CONT), FUN=sum) but maybe there is another way to make it in a new dataframe.
Thank you in advance

Related

Is there a way in R to create a column based on order of multiple values in one another column in dataframe? [duplicate]

This question already has answers here:
Aggregating all unique values of each column of data frame
(2 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I would like to create a column in my R data frame based on the order in which multiple values occur in one column.
For example, my data frame has an id column and an item type column, and the values of the order column is what I would like to add. Is there a way to tell R to look at the order of values in the item column so that it can spit out "ABCD" or "ADCB" (any other order) as the cell value under the 3rd column?
| id | item | order |
| 11 | A | ABCD |
| 11 | A | ABCD |
| 11 | B | ABCD |
| 11 | B | ABCD |
| 11 | C | ABCD |
| 11 | C | ABCD |
| 11 | D | ABCD |
| 11 | D | ABCD |
| 12 | A | ADCB |
| 12 | A | ADCB |
| 12 | D | ADCB |
| 12 | D | ADCB |
| 12 | C | ADCB |
| 12 | C | ADCB |
| 12 | B | ADCB |
| 12 | B | ADCB |
...

How to add a column from one dataframe to another dataframe when two other columns match [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have two datasets, db1 and db2, like the following ones:
db1
+---------+-------+-------+------+------+-----------------+
| Authors| IDs | Title | Year | ISSN | Other columns...|
+---------+-------+-------+------+------+-----------------+
| Abad J.| 16400 | 1 | 2014 |14589 | |
| Ares K.| 70058 | 2 | 2012 |15874 | |
| Anto E.| 71030 | 3 | 2011 |16999 | |
| A Banul| 57196 | 1 | 2011 |21546 | |
| A Berat| 56372 | 2 | 2011 |12554 | |
+---------+-------+-------+------+------+-----------------+
and
db2
+---------+-------+-------+------+------+-------+---------------------------+
| Authors| IDs | Title | Year | ISSN | IF | Other different columns...|
+---------+-------+-------+------+------+-------+---------------------------+
| Abad J.| 16400 | 1 | 2013 |14589 | 2,3 | |
| Ares K.| 70058 | 2 | 2012 |15874 | 3,3 | |
| Anto E.| 71030 | 3 | 2011 |14587 | 1,2 | |
| A Banul| 57196 | 1 | 2011 |21546 | 7,8 | |
| A Berat| 56372 | 2 | 2011 |75846 | 4,5 | |
+---------+-------+-------+------+------+-------+---------------------------+
Basically, what i want is to add to db1 the column IF from db2 when the two columns Year and ISSN have the same values. So what i want to achive is the following output in my example:
db1
+---------+-------+-------+------+------+-------+----------------+
| Authors| IDs | Title | Year | ISSN | IF |Other columns...|
+---------+-------+-------+------+------+-------+----------------+
| Abad J.| 16400 | 1 | 2014 |14589 | NA | |
| Ares K.| 70058 | 2 | 2012 |15874 | 3,3 | |
| Anto E.| 71030 | 3 | 2011 |16999 | NA | |
| A Banul| 57196 | 1 | 2011 |21546 | 7,8 | |
| A Berat| 56372 | 2 | 2011 |12554 | NA | |
+---------+-------+-------+------+------+-------+----------------+
i have tried with merge but, since i have also different columns, i obtain a very big dataset.
What i want is to use the function match but with more than one condition applied at the same time.
Any guess ?
dplyr::left_join(db1, db2 %>% dplyr::select(Year, ISSN, IF))
This should work providing the two dataframes have no other columns in common besides the ones you've shown here.

Convert decimal date to year and week number [duplicate]

This question already has answers here:
Converting date in Year.decimal form in R
(2 answers)
Closed 3 years ago.
I am running an arima model the library forecast, the output of this model consists in something like this:
+----------+----------------+------------+----------+-----------+----------+
| | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+----------------+------------+----------+-----------+----------+
| 2016.261 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+----------------+------------+----------+-----------+----------+
What I would like to achieve is to convert the decimal date (for example 2016.261), by adding two columns, one representing the year and the other one the number of week, achieveing something like this:
+----------+---------+------+----------------+------------+----------+-----------+----------+
| | year | week | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
| 2016.261 | 20.. | n1 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 20.. | n1 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 20.. | n3 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 20.. | n4 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
Well, with dataframe like this for example:
df1 <- data.frame(x =c(2016.01, 2016.32, 2016.261, 2016.281 , 2016.300 , 2016.319))
df1$date <- as.Date(as.character(df1$x), format="%Y.%j")
df1$year <- format(df1$date, "%Y")
df1$week <- format(df1$date, "%W")
df1
# x date year week
# 1 2016.010 2016-01-01 2016 00
# 2 2016.320 2016-02-01 2016 05
# 3 2016.261 2016-09-17 2016 37
# 4 2016.281 2016-10-07 2016 40
# 5 2016.300 2016-01-03 2016 00
# 6 2016.319 2016-11-14 2016 46
NB: I added first two dates just to check that the dates were correct. And istead of df1 you can use your dataframe. All information is actually from here.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

How to get a query result into a key value form in HiveQL

I have tried different things, but none succeeded. I have the following issue, and would be very gratefull if someone could help me.
I get the data from a view as several billions of records, for different measures
A)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to aggregate it by each measure. And so long so fine. I got this figured out.
B)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to get the data in the following form. I need to turn it into a key-value form.
C)
| measure | c | p |
|---------+----+----|
| m1 | 3 | 15 |
| m2 | 6 | 18 |
| m3 | 9 | 21 |
| m4 | 12 | 24 |
|---------+----+----|
The first 4 columns from B) would form in C) the first column, and the second 4 columns would form another column.
Is there an elegant way, that could be easily maintainable? The perfect solution would be if another measure would be introduced in A) and B), there no modification would be required and it would automatically pick up the difference.
I know how to get this done in SqlServer and Postgres, but here I am missing the expirience.
I think you should use map for this

Resources