Average of multiple rows by name or row number in R - r

A dataset in R looks like below:
LD.D LD.L LD.P
Y.1992.a1 67.89552605 33.21192862 90.7750688
Y.1992.a2 227.1370541 79.67211036 154.5165077
Y.1992.a3 94.5326718 24.72816922 151.665545
Y.1992.a4 106.8793485 56.07635245 100.6711004
Y.1992.a5 97.41402289 46.93434073 100.8787496
Y.1993.a1 150.045093 19.64290196 27.81953228
Y.1993.a2 106.5888189 21.38886866 84.82532249
Y.1993.a3 110.7493543 25.41765759 70.02222315
Y.1993.a4 237.1246502 16.43006029 75.17407065
Y.1993.a5 234.5403261 16.93082727 49.01639754
Y.1994.a1 94.5326718 24.72816922 151.665545
Y.1994.a2 106.8793485 56.07635245 100.6711004
Y.1994.a3 97.41402289 46.93434073 100.8787496
Y.1994.a4 150.045093 19.64290196 27.81953228
Y.1994.a5 106.5888189 21.38886866 84.82532249
For each year I have got five replicates. The question is how could I have the aveage of each single year (e.g., 1992 and 1993 and 1994)?

You could do this using either base R or with specialized packages such as dplyr or data.table (more efficient when the dataset is really big).
df$Year <- gsub("^.\\.(\\d+)\\..*", "\\1", row.names(df)) #extracted the year alone from the row names and created a column `Year` in the dataset
library(dplyr)
df %>%
group_by(Year) %>% #grouped by Year variable
summarise_each(funs(mean=mean(., na.rm=TRUE))) #when you specify the function, `summarise_each will applies the function (here it is mean) to each of the columns in the dataset or a subset of columns (if specified)
# Source: local data frame [3 x 4]
# Year LD.D LD.L LD.P
#1 1992 118.7717 48.12458 119.70139
#2 1993 167.8096 19.96206 61.37151
#3 1994 111.0920 33.75413 93.17205
Using data.table. Convert to data.table using setDT and use lapply on a Subset of Data.table (.SD) columns and get the mean. Use by to specify the grouping variable Year.
library(data.table)
setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=Year]
# Year LD.D LD.L LD.P
#1: 1992 118.7717 48.12458 119.70139
#2: 1993 167.8096 19.96206 61.37151
#3: 1994 111.0920 33.75413 93.17205
Or using base R. There are different ways aggregate, by, split etc. Here is one with by. Use regex (lookbehind) to get the Year. In this case, I am getting the Y prefix also as it doesn't affect the results.
Year <- gsub("(?<=[0-9])\\..*$", "", row.names(df), perl=TRUE)
do.call(`rbind`,by(df, Year, FUN= colMeans, na.rm=TRUE))
# LD.D LD.L LD.P
#Y.1992 118.7717 48.12458 119.70139
#Y.1993 167.8096 19.96206 61.37151
#Y.1994 111.0920 33.75413 93.17205
data
df <- structure(list(LD.D = c(67.89552605, 227.1370541, 94.5326718,
106.8793485, 97.41402289, 150.045093, 106.5888189, 110.7493543,
237.1246502, 234.5403261, 94.5326718, 106.8793485, 97.41402289,
150.045093, 106.5888189), LD.L = c(33.21192862, 79.67211036,
24.72816922, 56.07635245, 46.93434073, 19.64290196, 21.38886866,
25.41765759, 16.43006029, 16.93082727, 24.72816922, 56.07635245,
46.93434073, 19.64290196, 21.38886866), LD.P = c(90.7750688,
154.5165077, 151.665545, 100.6711004, 100.8787496, 27.81953228,
84.82532249, 70.02222315, 75.17407065, 49.01639754, 151.665545,
100.6711004, 100.8787496, 27.81953228, 84.82532249)), .Names = c("LD.D",
"LD.L", "LD.P"), class = "data.frame", row.names = c("Y.1992.a1",
"Y.1992.a2", "Y.1992.a3", "Y.1992.a4", "Y.1992.a5", "Y.1993.a1",
"Y.1993.a2", "Y.1993.a3", "Y.1993.a4", "Y.1993.a5", "Y.1994.a1",
"Y.1994.a2", "Y.1994.a3", "Y.1994.a4", "Y.1994.a5"))

Try aggregate where DF is the data frame:
aggregate(DF, list(Year = gsub("^Y.|.[^.]*$", "", rownames(DF))), mean)

Related

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

how to group identical instances in r into one and at the same time, generate frequency and average stats?

I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.
I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.
Note: ID3 contains . I'd like to rename these to "na" before grouping
I've tried solutions
tdata$ID3[is.na(tdata$ID3)] <- "NA"
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"
But get Error:
In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L, :
invalid factor level, NA generated
The data is:
date ID1 ID2 ID3 stat1 stat2 stat.3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 pqr123 30 60 90
5 5-09-07 spa234 int345 <NA> 40 50 70
Desired Output
date ID1, ID2, ID3, n, stat1, stat2, stat 3
12-03-07 abc123, wxy456, pqr457, 3, 20, 40, 60
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
05-09-07 spa234, int345, big234, 1 , 40, 50, 70
I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R
But I was not successful merging the columns which have to be grouped and tested for similarity.
DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?
We can do this with dplyr. After grouping by the 'ID' columns, add 'date' and 'n' also in the grouping variables, and get the mean of 'stat' columns
library(dplyr)
df1 %>%
group_by(ID1, ID2, ID3) %>%
group_by(date = first(date), n =n(), add=TRUE) %>%
summarise_at(vars(matches("stat")), mean)
NOTE: Regarding change the 'NA' to 'big234', we can convert the 'ID3' to character class and change it before doing the above operation
df1$ID3 <- as.character(df1$ID3)
df1$ID3[is.na(df1$ID3)] <- "big234"
While I find the dplyr solution proposed by akrun very intuitive to use, there is also a nice data.table solution:
Similarly as akrun, I assume that the NA value has been converted to "big234" to get the desired result.
library(data.table)
# convert data.frame to data.table
data <- data.table(df1)
# return the desired output
data[, c(.N, lapply(.SD, mean)),
by = list(date, ID1,ID2, ID3)]

find max value in a data.frame by group and show its date as year-month-day

Here my dataframe:
df = read.csv(text = '"Date","Value","ID","WY"
1975-02-01,-1.16543693088,"Tweed",1975
1975-03-01,-1.05372283483,"Tweed",1975
1975-04-01,-1.06632370439,"Tweed",1975
1975-05-01,-1.18903485356,"Tweed",1975
1992-05-01,-1.04737467143,"Ouse",1992
1992-06-01,-1.4058281451,"Ouse",1992
1992-07-01,-1.13608647243,"Ouse",1992
1992-08-01,-0.802566581309,"Ouse",1992
1992-09-01,-0.551433852821,"Ouse",1992
1992-10-01,-0.625997598552,"Ouse",1993
1992-11-01,-0.483559758609,"Ouse",1993
1992-12-01,-0.792013395632,"Ouse",1993
1993-01-01,-0.754618121962,"Ouse",1993
1993-02-01,-1.2504282139,"Ouse",1993
1996-01-01,-0.945410385985,"Trent",1996
1996-02-01,-0.84249575782,"Trent",1996
1996-03-01,-1.10332425045,"Trent",1996
1996-04-01,-1.22634133042,"Trent",1996
1996-05-01,-1.2335181635,"Trent",1996
1996-06-01,-1.23451130358,"Trent",1996
1996-07-01,-1.25902677738,"Trent",1996
1996-08-01,-1.13068733413,"Trent",1996', header = TRUE)
I need to find the annual maximum value for each ID and WY group.
The following code do the trick very easily but its output only shows the year of each annual maximum whereas I am interested also in the relative month and day:
df_AMAX = aggregate(df$Value, by = list(df$WY, df$ID), max)
colnames(df_AMAX) = c('Date', 'ID', 'Value')
print(df_AMAX)
Date ID Value
1 1992 Ouse -0.5514339
2 1993 Ouse -0.4835598
3 1996 Trent -0.8424958
4 1975 Tweed -1.0537228
My output should be:
Date ID Value
1 1992-09-01 Ouse -0.5514339
2 1993-11-01 Ouse -0.4835598
3 1996-02-01 Trent -0.8424958
4 1975-03-01 Tweed -1.0537228
It should be a silly thing but please let me know if you have any suggestion.
Thanks
Use subset with ave. Note that the function passed to ave returns a logical but ave will coerce it to the class of Value so we use !! to make it logical again. No packages are used.
mx_all <- function(x) if (length(x)) x == max(x)
subset(df, !!ave(Value, ID, WY, FUN = mx_all))
or
mx_first <- function(x) if (length(x)) seq_along(x) == which.max(x)
subset(df, !!ave(Value, ID, WY, FUN = mx_first))
These give the same answer for the sample input and will always give the same answer if there is a unique maximum in each group but if there are multiple maxima in a group then the first one gives all of them and the second gives the first.
There is of course a dplyr solution, too:
df %>%
group_by(WY, ID) %>%
summarise(
Value = max(Value),
Date = Date[which.max(Value)]) %>%
ungroup() %>%
select(ID:Date)

Multiply values from numeric vector by specific values from data frame

I have a numeric vector:
> dput(vec_exp)
structure(c(12.344902729712, 6.54357482855349, 17.1939193108764,
23.1029632631654, 8.91495023159554, 14.3259091357051, 18.0494234749187,
2.92524638658168, 5.10306474037357, 2.66645609602021), .Names = c("Arthur_1",
"Mark_1", "Mark_2", "Mark_3", "Stephen_1", "Stephen_2",
"Stephen_3", "Rafael_1", "Marcus_1", "Georg_1"))
and then I have a data frame like the one below:
Name Nr Numb
1 Rafael 20.8337 20833.7
2 Joseph 25.1682 25168.2
3 Stephen 40.5880 40588.0
4 Leon 198.7730 198773.0
5 Thierry 16.5430 16543.0
6 Marcus 31.6600 31660.0
7 Lucas 39.6700 39670.0
8 Georg 194.9410 194941.0
9 Mark 60.1020 60102.0
10 Chris 56.0578 56057.8
I would like to multiply the numbers in numeric vector by the numbers from the column Nr in this data frame. Of course it is important to multiply the values by the name. It means that Mark_1 from numeric vector should be multiplied by the Nr = 60.1020, same for Mark_2, and Stephen_3 by 40.5880, etc.
Can someone recommend any easy solution ?
You could use match to match the names after extracting only the first part of the names of vec_exp, i.e. extract Mark from Mark_1 etc.
vec_exp * df$Nr[match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)]
# Arthur_1 Mark_1 Mark_2 Mark_3 Stephen_1 Stephen_2 Stephen_3 Rafael_1 Marcus_1 Georg_1
# NA 393.28193 1033.38894 1388.53430 361.84000 581.46000 732.59000 60.94371 161.56303 519.80162
Arthur is NA because there's no match in the data.frame.
If you want to keep those entries without a match in the data as they were before, you could do it like this:
i <- match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)
vec_exp[!is.na(i)] <- vec_exp[!is.na(i)] * df$Nr[na.omit(i)]
This first computes the matches and then only multiplies those if they are not NA.
We can use base R methods. Convert the vector to a data.frame with stack, create a 'Name' column by removing the substring from 'ind' and merge with the data.frame ('df1'). Then, we can multiply the 'Nr' and the 'values' column.
d1 <- merge(df1, transform(stack(vec_exp), Name = sub("_.*", "", ind)), all.y=TRUE)
d1$Nr*d1$values
Or with dplyr, it is much more easier to understand.
library(dplyr)
library(tidyr)
stack(vec_exp) %>%
separate(ind, into = c("Name", "ind")) %>%
left_join(., df1, by = "Name") %>%
mutate(res = values*Nr) %>%
.$res
#[1] NA 393.28193 1033.38894 1388.53430 361.84000
#[6] 581.46000 732.59000 60.94371 161.56303 519.80162

Refer particular value in `dplyr::mutate()`

I have the following code:
library(dplyr)
library(quantmod)
# inflation data
getSymbols("CPIAUCSL", src='FRED')
avg.cpi <- apply.yearly(CPIAUCSL, mean)
cf <- avg.cpi/as.numeric(avg.cpi['1991']) # using 1991 as the base year
cf <- as.data.frame(cf)
cf$year <- rownames(cf)
cf <- tail(cf, 25)
rownames(cf) <- NULL
cf$year <- lapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
rm(CPIAUCSL)
# end of inflation data get
tmp <- data.frame(year=c(rep(1991,2), rep(1992,2)), price=c(12.03, 12.98, 14.05, 14.58))
tmp %>% mutate(infl.price = price / cf[cf$year == year, ]$CPIAUCSL)
I want to get the following result:
year price
1991 12.03
1991 12.98
1992 13.64
1992 14.16
But I'm getting an error:
Warning message:
In cf$year == tmp$year :
longer object length is not a multiple of shorter object length
And with %in% it produces and incorrect result.
I think it might be easier to join the CPIAUCSL column in cf into tmp before you try to mutate:
cf$year = as.numeric(cf$year)
tmp = tmp %>% inner_join(cf, by = "year") %>% mutate(infl.price = price / CPIAUCSL)
Your cf structure is a list of lists which is unfriendly. It woud have been nicer to have
cf$year <- sapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
which at least returns a simple vector.
Additional, the subsetting operator [] is not properly vectorized for this type of operation. The mutate() function does not iterate over rows, it operates on entire columns at a time. When you do
cf[cf$year == year, ]$CPIAUCSL
There is not just one year value, mutate is trying to do them all at once.
You'd be better off doing a proper merge with your data and then do the mutate. This will basically do the same thing as your pseudo-merge that you were trying to do in your version.
You can do
tmp %>% left_join(cf) %>%
mutate(infl.price = price / CPIAUCSL) %>%
select(-CPIAUCSL)
to get
year price infl.price
1 1991 12.03 12.03000
2 1991 12.98 12.98000
3 1992 14.05 13.63527
4 1992 14.58 14.14962

Resources