Average for column value across multiple datasets in R [duplicate] - r

This question already has answers here:
calculate average over multiple data frames
(5 answers)
Closed 6 years ago.
I am new to R and I need help in this. I have 3 data sets from 3 different years. they have the same columns with different values for each year. I want to find the average for the column values across the three years based on the name field. To be specific:
assume : first data set
Name Age Height Weight
A 4 20 20
B 5 22 22
C 8 25 21
D 10 25 23
second data set
Name Age Height Weight
A 5 22 25
B 6 23 26
Third data set
Name Age Height Weight
A 6 24 24
B 7 24 27
C 10 27 28
I want to find the average height for "A" across the three data sets

We can place them in a list and rbind them, group by 'Name' and get the mean of each column
library(data.table)
rbindlist(list(df1, df2, df3))[, lapply(.SD, mean), by = Name]
Or with dplyr
bind_rows(df1, df2, df3) %>%
group_by(Name) %>%
summarise_each(funs(mean))

Related

Apply function to Dataframe based on Multi-level Grouping in R [duplicate]

This question already has answers here:
Apply a function to groups within a data.frame in R
(4 answers)
Closed 2 years ago.
I am trying to apply a function to a dataframe to add a column which calculates the percentile rank for each record based on Weather Station ID (WSID) and Season Grouping.
## temperatures data frame:
WSID Season Date Temperature
20 Summer 24/01/2020 18
12 Summer 25/01/2020 20
20 Summer 26/01/2020 25
12 Summer 27/01/2020 17
20 Winter 18/10/2020 15
12 Winter 19/10/2020 12
12 Winter 20/10/2020 13
12 Winter 21/10/2020 14
## Code tried:
perc.rank <- function(x) trunc(rank(x))/length(x)
rank.perc = function(mdf) {
combined1 = mdf %>%
mutate(percentile = perc.rank(Temperature))
}
temperatures = temperatures %>%
split(.$WSID) %>%
map_dfr(~rank.perc(.))
## Expected Output :
WSID Season Date Temperature Percentile
20 Summer 24/01/2020 18 0.333
12 Summer 25/01/2020 20 0.444
20 Summer 26/01/2020 25 0.666
12 Summer 27/01/2020 17 0.333
20 Winter 18/10/2020 15
12 Winter 19/10/2020 12
12 Winter 20/10/2020 13
12 Winter 21/10/2020 14
Is there some elegant way to do this using functions such as group_modify, group_split, map and/or split?
I was thinking there should be as for example in case there is a 3 or more level grouping factor.
The code works for when I split the data by WSID but I cant seem to get any further when I want to group also by WSID + Season.
(Filled in Percentile values were calculated from Excel percentile rank function)
You can directly use the function with group_by instead of splitting, also function rank.perc seems unnecessary.
library(dplyr)
perc.rank <- function(x) trunc(rank(x))/length(x)
df %>%
group_by(WSID) %>%
mutate(percentile = perc.rank(Temperature))
In group_by it is easy to add more groupings later eg - group_by(WSID, Season).

How to get unique values from table() function in R

I have a data frame which 31 columns. In column of Year (named "Anos"), I have rows which years are repeated and when I use table(df$Anos), I get frequency of years. I need only years with 12 observations (12 months).
Example:
freq_years <- table(df$Anos)
freq_years
Result:
2009 2010 2011 2012 2013 2014 2015 2017 2018 2019 2020
10 12 12 3 11 6 8 12 12 12 5
How to get automatically in a new variable only years with freq = 12? (maybe like 2010,2011,2018,2019)
Here is a tidyverse version. Depending on your use with the other 30 columns in your data frame, keeping the data as df2 might be useful.
install.packages("dplyr")
install.packages("magrittr")
library("magrittr")
library("dplyr")
#create example dataset
df <- data.frame("Anos" = c(rep(2009,10),
rep(2010,12),
rep(2011,12),
rep(2012,3),
rep(2013,11),
rep(2014,6),
rep(2015,8),
rep(2016,12),
rep(2017,12)))
head(df)
# count number of years by row and filter to those with only 12
df2 <- df %>% group_by(Anos) %>% count() %>% filter(n == 12)
head(df2)
# create variable with list of years that have exactly 12 rows
variable <- df2$Anos
variable
We can create a logical vector and subset the names of the table output
names(freq_years)[freq_years == 12]

I need to classify by categories without mixing the data of different columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have the following dataset:
Year Company Product Sales
2017 X A 10
2017 Y A 20
2017 Z B 20
2017 X B 10
2018 X B 20
2018 Y B 30
2018 X A 10
2018 Z A 10
I want to obtain the following summary:
Year Product Sales
2017 A 30
B 30
2018 A 50
B 20
and also the following summary:
Year Company Sales
2017 X 20
Y 20
Z 20
2018 X 50
Y 10
Z 10
Is there any way to do it without using loops?
I know I could do something with the function aggregate, but I don't know how to proceed with it without mixing the data of company, product and year. For example, I get the total sales of product A and B, but it's mixing the sales of both years instead of giving A and B in 2017, and separated in 2018.
Do you have any suggestions?
Let's say your dataframe is called df:
df1 = df.groupby('Year', 'Product')['Sales'].sum()
df2 = df.groupby('Year', 'Company')['Sales'].sum()
I believe this would help you create your two summary dataframes without mixing anything :) !

Using reshape where there are multiple values at each time point [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
I'm trying to reshape a longitudinal dataset containing visual measurements for the left and right eyes of several individuals over a one year period. I need to end up with a data.frame() with the headings 'patient','month','re','le' (where 're' means 'right eye' and 'le' means 'left eye')
My data are currently in the format:
patient','re_month1','le_month1','re_month2','le_month2'....'le_month12'
I know I could use the reshape() function to sort the data if I only had one piece of data per time point. If I were just working with 'patient','month1','month2' etc, I could use the following:
reshape(dframe,idvar = 'patient',v.names = 'vision',
varying = 2:13,direction = "long")
...But how do I do this when there are two pieces of data (or more) at each time point?
We can use melt from data.table and specify the measure columns with the patterns argument. The patterns can take multiple regex/fixed column names.
library(data.table)
melt(setDT(dframe), id.var="patient",
measure = patterns("^re_", "^le_"))
# patient variable value1 value2
#1: 1 1 20 21
#2: 2 1 25 18
#3: 3 1 23 22
#4: 1 2 18 29
#5: 2 2 22 19
#6: 3 2 25 24
data
dframe <- data.frame(patient=1:3, re_month1 = c(20, 25,
23), le_month1= c(21, 18, 22), re_month2=c(18, 22, 25),
le_month2= c(29, 19, 24))

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

Resources