This question already has an answer here:
Aggregate with na.action=na.pass gives unexpected answer
(1 answer)
Closed 6 years ago.
With a data frame df1 like below
+-----------------------------------------+
|reg |make |model |year|abs |gears|fm|
+-----------------------------------------+
|ax1234|Toyota|Corolla|1999|true |6 |0 |
|ax1235|Toyota|Corolla|1999|false|5 |0 |
|ax1236|Toyota|Corolla|1992|false|4 |NA|
|ax1237|Toyota|Camry |2001|true |7 |1 |
|ax1238|Honda |Civic |1994|true |5 |NA|
|ax1239|Honda |Civic |2000|false|6 |0 |
|ax1240|Honda |Accord |1992|false|4 |NA|
|ax1241|Nissan|Sunny |2001|true |6 |0 |
|ax1242| | |1998|false|6 |0 |
|ax1243|NA |NA |1992|false|4 |NA|
+-----------------------------------------+
On aggregation like below, I want to preserve makes with NA - how to achieve this ? It is fine to have the make and NA are combined together.
> aggregate(reg ~ make, df1, length)
make reg
1 1
2 Honda 3
3 Nissan 1
4 Toyota 4
We can use dplyr and it gives the NA count as well
library(dplyr)
df1 %>%
group_by(make) %>%
summarise(reg = n())
Related
Apologies if this Q has already been asked (I looked but the other Q&As didn't help). I'm trying to create a new df from an existing df that calculates total annual values from multiple columns and each row is one year. So, the df that I have looks something like this:
| year | v1 | v2 | v3
| -------- | -------------- |-----|-----
| 2021 | 5 |4 |6
| 2021 | 10 |5 |3
| 2020 | 7 |8 |5
| 2020 | 3 |3 |4
| 2019 | 3 |1 |8
and what i'm trying to get is a new df that looks like:
| year | v1 | v2 | v3
| -------- | -------------- |-----|-----
| 2021 | 15 |9 |9
| 2020 | 10 |11 |9
| 2019 | 3 |1 |8
I tried something like this without any luck:
newdf<- df %>%
group_by(year) %>% # create the groups
summarise(v1 = sum(v1))
Please let me know if I should include other information :) Thank you v.much in advance for your help!
I use 'across-where' construction for this kind of tasks.
This code select all columns which are numeric and sum all of them.
newdf<- df %>%
group_by(year) %>% # create the groups
summarise(across(where(is.numeric), sum, na.rm = T))
Base R should do the trick:
newdf <- cbind(year = df$year, total = rowSums(x = df[, -1]))
everyone!
I will try to explain my problem. It is very difficult for me. I Hope you can help me:
I have a data frame, lets call it DF1, that looks like the next one:
|Symbol | Date | Volume | Price|
|----------------------------|-------|
|A |2014-01-01 | 0 | 4 |
|A |2014-01-02 | 7 | 7 |
|A |2014-01-03 | 8 | 9 |
|A |2014-01-04 | 1 | 5 |
|B |2014-01-01 |45 | 6 |
|B |2014-01-02 |0 | 11 |
|B |2014-01-03 |34 | 8 |
|B |2014-01-04 |45 | 5 |
|C |2014-01-01 |4 | 6 |
|C |2014-01-02 |0 | 5 |
|C |2014-01-03 |14 | 25 |
|D |2014-01-01 |31 | 4 |
|D |2014-01-02 |7 | 6 |
|D |2014-01-03 |18 | 3 |
|D |2014-01-04 |15 | 7 |
|E |2014-01-01 |13 | 8 |
|E |2014-01-02 |0 | 9 |
Having this dataframe I create a new dataframe, let's call it DF2, through the following lines of code:
RM <- DF1 %>% group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
And from RM, I select only the dates that are of my interest :
RM_reg <- subset(RM, date >= "2014-03-05" & date<="2014-09-03")
Finally, RM_reg looks like this:
| Date | RM |
|2014-03-05 | 0 |
|2014-03-06 | 7 |
|2014-03-07 | 8 |
|2014-03-08 | 1 |
|2014-03-09 | 45 |
|2014-03-10 | 0 |
|2014-03-11 | 34 |
|2014-03-12 | 45 |
|2014-03-13 | 4 |
|2014-03-14 | 0 |
|2014-03-15 | 14 |
|2014-03-16 | 31 |
It should be noted that the values in the RM_reg column are not the actual values, but only examples. Starting from my original dataframe, RM_reg has 125 rows.
Then, from dataframe DF1, I extract the rows for which the Company column is equal to A through the following code:
DF_A <- DF_1%>%
filter(Symbol=="A")
And I add a column of returns to the dataframe DF_A, through the following code:
RA <- DF_A %>% group_by(Symbol)%>%
mutate(Ret_i = log(Price) - lag(log(Price)))
I eliminate the first row, which is NA:
AR <- na.omit(RA)
And from AR, I select only the dates that are of my interest :
AR_reg <- subset(AR, date >= "2014-03-05" & date<="2014-09-03")
AR_reg looks like this:
|Symbol | Date | volume |price | Ret_i |
|--------------------------------------------|
|A |2014-03-05 | 1 | 5 | 2 |
|A |2014-03-06 | 3 | 8 | 3 |
|A |2014-03-07 | 7 | 4 | 4 |
|A |2014-03-08 |3 | 6 | 5 |
|A |2014-03-09 |34 | 7 | 1 |
|A |2014-03-10 |45 | 34 | 4 |
|A |2014-03-11 |4 | 5 | 3 |
|A |2014-03-12 |9 | 7 | 5 |
|A |2014-03-13 |8 | 6 | 6 |
|A |2014-03-14 |4 | 4 | 1 |
|A |2014-03-15 |0 | 7 | 4 |
|A |2014-03-16 |4 | 7 | 7 |
It should be noted that the values in the AR_reg column are not the actual values, but only examples. Starting from my original dataframe, AR_reg also has 125 rows.
Finally, because RM_reg and AR_reg I can regress the Ret_i column of AR_reg on the RM column of RM_reg through the following code:
mod <- lm(AR_reg$Ret_i ~ RM_reg$RM)
What I need to do is to do the same as described above for all the Symbols in the dataframe DF1, in this case for, "B", "C", "D", "E". The problem is that we do not have the same amount of entries, or the same amount of rows corresponding to all Symbols, and this is a necessary condition to be able to do the regression. To do the regression I need to have 125 observations of returns for each Symbol.
What I have thought is to eliminate the Symbols for which the dataframe similar to AR_reg that is generated does not have 125 entries or rows; but the truth is that I do not know how to do this, I suppose that a function must be raised but this is a subject that I still do not dominate.
Thank you very much for reading me, I hope you have understood me. Any help or suggestion will be very appreciated
Translated with www.DeepL.com/Translator (free version)
Join DF1 with RM by Date, keep only data between specific dates, for each Symbol calculate Ret_i and drop NA values and create list of models.
The complete code would look like :
library(dplyr)
DF1$Date <- as.Date(DF1$Date)
RM <- DF1 %>%
group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
result <- DF1 %>%
left_join(RM, by = 'Date') %>%
filter(between(Date, as.Date("2014-03-05"), as.Date("2014-09-03")))
group_by(Symbol) %>%
mutate(Ret_i = log(Price) - lag(log(Price))) %>%
na.omit() %>%
summarise(model = list(lm(Ret_i~RM)))
result
It has been a while since I've used R, so apologies for this very simple question. I probably wouldn't be asking it if I knew the correct way to find the question with the right keywords on a search engine.
I have a really long data table with a few columns, two of which are the subject for this question. The data table looks something like:
Row|Q|Year
1 |1|1990
2 |3|1995
3 |3|1991
4 |2|1990
5 |1|1990
6 |1|1990
7 |1|1992
8 |2|1991
9 |1|1990
10 |1|1990
11 |4|1991
12 |1|1992
13 |2|1995
14 |1|1993
15 |1|1990
....etc
I want to make a table that counts all the the Year records for a given Q like below
Q|1990|1991|1992|1993|1994|1995
1|5 |2 |2 |1 |3 |2
2|6 |6 |2 |1 |4 |1
3|2 |1 |4 |5 |6 |1
4|4 |3 |1 |2 |7 |6
...etc.
Or a way to extract that information one by by one
Q=1, Year=1990: 5
I don't want to do it with a fancy package or with for/while loops, if I remember correctly it should just be something really simple with the basic R package like
table(data$Q,data$Year)
or
count(data$Q[Q==1,], data$Year[Year==1990,])
Though obviously that doesn't work because the length of X is different than the length of Y
Again sorry, a really dumb basic R question
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have a dataset and I want to generate the row position by group. For example
library(data.table)
data<-data.table(Position=c(1,2,3,4,5,6,7,8,9,10),
Category=c("M","M","M","M","F","F","F","M","M","F"))
I group by the Category and want to create column that is the row position by group. Something like below or with data.table
dataByGroup %>% group_by(Category) %>% mutate(positionInCategory = 1:nrow(Category))
Unable to work out how to achieve this?
Desired output:
| Position|Category | positionInCategory|
|--------:|:--------|------------------:|
| 1|M | 1|
| 2|M | 2|
| 3|M | 3|
| 4|M | 4|
| 5|F | 1|
| 6|F | 2|
| 7|F | 3|
| 8|M | 5|
| 9|M | 6|
| 10|F | 4|
Try the following:
library(data.table)
library(dplyr)
data<-data.table(Position=c(1,2,3,4,5,6,7,8,9,10),
Category=c("M","M","M","M","F","F","F","M","M","F"))
cleanData <- data %>%
group_by(Category) %>%
mutate(positionInCategory = 1:n())
Try
data[, new := rowid(Category)]
# or, if you're using 1.9.6 or older
data[, new := 1:.N, by=Category]
Position Category new
1: 1 M 1
2: 2 M 2
3: 3 M 3
4: 4 M 4
5: 5 F 1
6: 6 F 2
7: 7 F 3
8: 8 M 5
9: 9 M 6
10: 10 F 4
To use rowid, you'll currently need the unstable/devel version of the package.
I am trying to determine the best way to collapse a set rows in a dataframe down to one row based on a two column selection.
For example:
| State |Fatalities|Injuries|Dmg |Dmg|year|eventType| PropExp | CropExp |
| ------|----------|--------|------|---|----|---------|---------|---------|
| WA | 1 | 100 |25.00 |0 |1971|HAIL |1000000 |1 | 0 |
| WA | 6 | 200 |25.00 |0 |1972|TORNADO |1000000 |1 | 1 |
| WA | 2 | 300 |25.00 |0 |1973|SNOW |1000000 |1 | 0 |
| WA | 6 | 900 |65.00 |0 |1973|TORNADO |1000000 |1 | 1 |
| WA | 4 | 300 |25.00 |0 |1973|TORNADO |1000000 |1 | 0 |
| WA | 0 | 900 |65.00 |0 |1972|TORNADO |1000000 |1 | 1 |
The intent is to merge and add the rows from each year by event type... thus we collapse the 1973 TORNADO rows into one row - added the selected row data and creating a new merged data row...
| State |Fatalities|Injuries|Dmg |Dmg|year|eventType| PropExp | CropExp |
| ------|----------|--------|------|---|----|---------|---------|---------|
| WA | 1 | 100 |25.00 |0 |1971|HAIL |1000000 |1 | 0 |
| WA | 6 | 200 |25.00 |0 |1972|TORNADO |1000000 |1 | 1 |
| WA | 2 | 300 |25.00 |0 |1973|SNOW |1000000 |1 | 0 |
| WA | 10 | 1200 |90.00 |0 |1973|TORNADO |1000000 |1 | 1 |
| WA | 0 | 900 |65.00 |0 |1972|TORNADO |1000000 |1 | 1 |
Note: I am using an example data set from the Coursera Reproducible Research course - that I have previously completed. The question is not related to the course etc - I am trying to collapse a different data set and I would love to know how to this is done cleanly in R.
What am I missing?
All help gratefully received and appreciated.
Besides the dplyr solution #Khashaa mentioned in the comments, you can also do this in base R off course:
aggregate(cbind(Fatalities, Injuries, Dmg, Dmg.1, PropExp, CropExp) ~ State + year + eventType, data = df, sum)
this gives:
State year eventType Fatalities Injuries Dmg Dmg.1 PropExp CropExp
1 WA 1971 HAIL 1 100 25 0 1000000 1
2 WA 1973 SNOW 2 300 25 0 1000000 1
3 WA 1972 TORNADO 6 1100 90 0 2000000 2
4 WA 1973 TORNADO 10 1200 90 0 2000000 2
Or as an alternative you could also use data.table:
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(State, year, eventType)]
which gives you the same result.
The disadvantages of both the data.table solution and the dplyr solution with summarize_each is that all the columns are summed. Judging from your desired output, you don't want the PropExp & CropExp summed. You could achieve that by indicating the columns that have to be summarized with .SDcols in data.table or with select in dplyr:
# data.table
setDT(df)[, lapply(.SD, sum),
by=.(State, year, eventType),
.SDcols=c("Fatalities","Injuries","Dmg")]
# dplyr
df %>% group_by(State, year, eventType) %>%
summarise_each(funs(sum), -PropExp, -CropExp, -Dmg.1)
which both result in:
State year eventType Fatalities Injuries Dmg
1: WA 1971 HAIL 1 100 25
2: WA 1972 TORNADO 6 1100 90
3: WA 1973 SNOW 2 300 25
4: WA 1973 TORNADO 10 1200 90
Off course you can still do this in base R as well:
aggregate(cbind(Fatalities, Injuries, Dmg) ~ State + year + eventType, data = df, sum)
Beyond what has been offered, a more clean way of doing this is by using plyr's ddply function:
library(plyr)
ddply(df, .(year, eventType), numcolwise(sum))