To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
Related
This question already has answers here:
Reshape multiple values at once
(2 answers)
Closed 3 years ago.
Following the interaction in the previous post Reshaping database using reshape package I create this one to ask other question.
Briefly: I have a database with some rows that it replicate for Id column, I would like to transpose it. The following cose show a little example of my database.
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))
I would like a final database like this figure that I attached:
With one row for each Id value and the different columns attached.
Can you give me any suggestions?
You can use dcast from data.table. This function allows to spread multiple value variables.
library(data.table)
setDT(test) # convert test to a data.table
test1 <- dcast(test, Id ~ rowid(Id),
value.var = c('St', 'gap', 'gip', 'gat'), fill = 0)
test1
# Id St_1 St_2 gap_1 gap_2 gip_1 gip_2 gat_1 gat_2
#1: 1 20 80 0.02 0.04 0.23 0.6 0.0107 0.989
#2: 2 80 0 0.06 0.00 0.86 0.0 0.3370 0.000
#3: 3 20 0 0.08 0.00 2.09 0.0 0.6630 0.000
If you want to continue with a data.frame call setDF(test1) at the end.
A dplyr/tidyr alternative is to first gather to long format, group_by Id and key and create a sequential row identifier for each group (new_key) and finally spread it back to wide form.
library(dplyr)
library(tidyr)
test %>%
gather(key, value, -Id) %>%
group_by(Id, key) %>%
mutate(new_key = paste0(key, row_number())) %>%
ungroup() %>%
select(-key) %>%
spread(new_key, value, fill = 0)
# Id gap1 gap2 gat1 gat2 gip1 gip2 St1 St2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.02 0.04 0.0107 0.989 0.23 0.6 20 80
#2 2 0.06 0 0.337 0 0.86 0 80 0
#3 3 0.08 0 0.663 0 2.09 0 20 0
I'm new in R and I'm struggling with this df that looks like this:
Date Group Factor 1 Factor 2 Spread
2019-04-01 a 1.01 1.011 0.01
2019-04-02 a 1.02 1.012 0.02
2019-04-03 a 1.03 1.013 0.03
2019-04-01 b 1.005 1.004 0.01
2019-04-02 b 1.0051 1.0041 0.02
2019-04-03 b 1.0052 1.0042 0.03
I would like do verify each group in each row and if the results are Group "a" do Factor1/Factor1(1 day lag) * Factor2 + spread, and if the group it's not "a" do not add the spread.
Since you are conditioning on the group, this is a good example of by (base R), dplyr::group_by, or data.table's x[,,by=].
The equation is effectively the same in all three, capitalizing on the fact that (Group[1] == "a") will be coerced from a logical to numeric when multipled by a number; since FALSE translates to a 0, then effectively disabled adding Spread.
Base
I use within here to make the internals a little more readable, but this is not a requirement (in which case you'd need to prepend x$ in front of all of the variable names).
The lagging can be done using dplyr::lag (even if you don't use the rest of the package for this) or many other techniques. I don't find stats::lag to be the most intuitive in applications like this, but I'm sure somebody will suggest a way to incorporate it into an answer. The use of c(NA, ...) ensures that we don't bring in a different group's data or impute data we don't have, since we have no value to bring in on the first row of a group. Finally, head(..., n = 1) returns the first element of a vector/list, while head(..., n = -1) (negative) returns all but the last.
newx <- by(x, x$Group, function(y) {
within(y, {
NewVal = Factor2 * Factor1 / c(NA, head(Factor1, n=-1)) + (Group[1] == "a") * Spread
})
})
newx
# x$Group: a
# Date Group Factor1 Factor2 Spread NewVal
# 1 2019-04-01 a 1.01 1.011 0.01 NA
# 2 2019-04-02 a 1.02 1.012 0.02 1.042020
# 3 2019-04-03 a 1.03 1.013 0.03 1.052931
# -------------------------------------------------------
# x$Group: b
# Date Group Factor1 Factor2 Spread NewVal
# 4 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5 2019-04-02 b 1.0051 1.0041 0.02 1.0042
# 6 2019-04-03 b 1.0052 1.0042 0.03 1.0043
This is really just a list with some fancy by-specific formatting, so you can treat it as such as combine them in an efficient base-R way:
do.call("rbind.data.frame", c(newx, stringsAsFactors = FALSE))
# Date Group Factor1 Factor2 Spread NewVal
# a.1 2019-04-01 a 1.0100 1.0110 0.01 NA
# a.2 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# a.3 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# b.4 2019-04-01 b 1.0050 1.0040 0.01 NA
# b.5 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# b.6 2019-04-03 b 1.0052 1.0042 0.03 1.004300
dplyr
Many find the tidyverse line of packages to read intuitively.
library(dplyr)
x %>%
group_by(Group) %>%
mutate(NewVal = Factor2 * Factor1 / lag(Factor1) + (Group[1] == "a") * Spread) %>%
ungroup()
# # A tibble: 6 x 6
# Date Group Factor1 Factor2 Spread NewVal
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-04-01 a 1.01 1.01 0.01 NA
# 2 2019-04-02 a 1.02 1.01 0.02 1.04
# 3 2019-04-03 a 1.03 1.01 0.03 1.05
# 4 2019-04-01 b 1.00 1.00 0.01 NA
# 5 2019-04-02 b 1.01 1.00 0.02 1.00
# 6 2019-04-03 b 1.01 1.00 0.03 1.00
data.table
On a different note, many find data.table better because of efficiencies gained from in-place modification (most of R's operations are copy-on-write, meaning some operations re-copy the object or a portion of it with each change).
library(data.table)
X <- as.data.table(x)
X[, NewVal := Factor2 * Factor1 / shift(Factor1) + (Group[1] == "a") * Spread, by = "Group"]
X
# Date Group Factor1 Factor2 Spread NewVal
# 1: 2019-04-01 a 1.0100 1.0110 0.01 NA
# 2: 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# 3: 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# 4: 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5: 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# 6: 2019-04-03 b 1.0052 1.0042 0.03 1.004300
The "in-place" part is evident on the second line here, where it appears as if the [ operation should just return a subset or something of the data ... but in this case using := causes the columns to be created (or changed) in-place.
Say I have a dataframe called RaM that holds cumulative return values. In this case, they literally are just a single row of cumulative return values along with column headers, but I would like to apply the logic to not just single row dataframes.
Say I want to sort by the max cumulative return value of each column, or even the average, or the sum of each column.
So each column would be re-ordered so that the max cumulative returns for each column is compared and the highest return becomes the 1st column with the min being the last column
then say I want to derive either the top 10 (1st 10 columns after they are rearranged), or even the top 10%.
I know how to derive the column averages, but I don't know how to effectively do the remaining operations. There is an order function, but when I used it, it stripped my column names, which I need. I could easily then cut the 1st say 10 columns, but is there a way that preserves the names? I don't think I can easily extract the names from the unordered original dataframe and apply it with my sorted by aggregate dataframe. My goal is to extract the column names of the top n columns (in dataframe RaM) in terms of a column aggregate function over the entire dataframe.
something like
top10 <- getTop10ColumnNames(colSums(RaM))
that would then output a dataframe of the top 10 columns in terms of their sum from RaM
Here's output off RaM
> head(RaM,2)
ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
2013-01-31 0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
2013-02-28 0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438
CPST EA EGY EXEL FCSC FOLD GNC GTT HEAR HK HZNP
2013-01-31 -0.05269663 0.08333333 -0.01849711 0.01969365 0 0.4179104 0.07992677 0.250000000 0.2017417 0.10404624 -0.085836910
2013-02-28 0.15051595 0.11443102 -0.04475854 -0.02145923 0 -0.2947368 0.14079036 0.002857143 0.4239130 -0.07068063 -0.009389671
ICON IMI IMMU INFI INSY KEG LGND LQDT MCF MU
2013-01-31 0.07750896 0.05393258 -0.01027397 -0.01571429 -0.05806459 0.16978417 -0.03085824 -0.22001958 0.01345609 0.1924290
2013-02-28 -0.01746362 0.03091684 -0.20415225 0.19854862 0.36849503 0.05535055 0.02189055 0.06840289 -0.09713487 0.1078042
NBIX NFLX NVDA OREX PFPT PQ PRTA PTX RAS REXX RTRX
2013-01-31 0.2112299 0.7846467 0.00000000 0.08950306 0.06823721 0.03838384 -0.1800819 0.04387097 0.23852335 0.008448541 0.34328358
2013-02-28 0.1677704 0.1382251 0.03888981 0.04020979 0.06311787 -0.25291829 0.0266223 -0.26328801 0.05079882 0.026656512 -0.02222222
SDRL SHOS SSI STMP TAL TREE TSLA TTWO UVE VICL
2013-01-31 0.07826093 0.2023956 -0.07788381 0.07103175 -0.14166875 -0.030504714 0.10746974 0.1053588 0.0365299 0.2302405
2013-02-28 -0.07585546 0.1384419 0.08052150 -0.09633197 0.08009728 -0.002860412 -0.07144761 0.2029581 -0.0330408 -0.1061453
VSI VVUS WLB
2013-01-31 0.06485356 -0.0976155 0.07494647
2013-02-28 -0.13965291 -0.1156069 0.04581673
Here's one way using the first section of your sample data to illustrate. You can gather up all the columns so that we can do summary calculations more easily, calculate all the summaries by group that you want, and then sort with arrange. Here I ordered with the highest sums first, but you could do whatever order you wanted.
library(tidyverse)
ram <- read_table2(
"ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438"
)
summary <- ram %>%
gather(colname, value) %>%
group_by(colname) %>%
summarise_at(.vars = vars(value), .funs = funs(mean = mean, sum = sum, max = max)) %>%
arrange(desc(sum))
summary
#> # A tibble: 10 x 4
#> colname mean sum max
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALNY 0.152 0.304 0.322
#> 2 ACAD 0.152 0.303 0.297
#> 3 CORT 0.135 0.271 0.517
#> 4 CLVS 0.0946 0.189 0.234
#> 5 ABMD 0.0939 0.188 0.150
#> 6 ALGN 0.0663 0.133 0.130
#> 7 ASCMA 0.0526 0.105 0.0766
#> 8 AVGO 0.0434 0.0868 0.130
#> 9 ANIP 0.0363 0.0725 0.130
#> 10 CALD -0.0193 -0.0386 0.0407
If you then want to reorder your original data frame, you can get the order from this summary output and index with it:
ram[summary$colname]
#> # A tibble: 2 x 10
#> ALNY ACAD CORT CLVS ABMD ALGN ASCMA AVGO ANIP
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.322 0.297 0.517 0.234 0.0379 0.130 0.0286 0.130 0.130
#> 2 -0.0182 0.00663 -0.247 -0.0446 0.150 0.00255 0.0766 -0.0433 -0.0576
#> # ... with 1 more variable: CALD <dbl>
Created on 2018-08-01 by the reprex package (v0.2.0).
I am trying to summarise a dataframe based on grouping by label column. I want to obtain means based on the following conditions:
- if all numbers are NA - then I want to return NA
- if mean of all the numbers is 1 or lower - I want to return 1
- if mean of all the numbers is higher than 1 - I want a mean of the values in the group that are greater than 1
- all the rest should be 100.
Managed to find the answer and now my code is running well - is.na() should be there instead of ==NA in the first ifelse() statement and that was the issue.
label <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
sev <- c(NA,NA,NA,NA,1,0,1,1,1,NA,1,2,2,4,5,1,0,1,1,4,5)
Data2 <- data.frame(label,sev)
d <- Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.na(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
Your first condition is the issue here. If we remove the nested ifelse and keep only the first one, we get the same output
Data2 %>%
group_by(label) %>%
summarise(sevmean = ifelse(mean(sev,na.rm=TRUE)==NaN,NA,1))
# label sevmean
# <dbl> <lgl>
#1 1.00 NA
#2 2.00 NA
#3 3.00 NA
#4 4.00 NA
#5 5.00 NA
#6 6.00 NA
#7 7.00 NA
I am not sure why you are checking NaN but if you want to do that , check it with is.nan instead of ==
Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.nan(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
# label sevmean
# <dbl> <dbl>
#1 1.00 NA
#2 2.00 1.00
#3 3.00 1.00
#4 4.00 2.00
#5 5.00 3.67
#6 6.00 1.00
#7 7.00 4.50
I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.