This question already has answers here:
Calculate row means on subset of columns
(7 answers)
Closed 3 years ago.
I have the following data frame
ID <- c(1,1,2,3,4,5,6)
Value1 <- c(20,50,30,10,15,10,NA)
Value2 <- c(40,33,84,NA,20,1,NA)
Value3 <- c(60,40,60,10,25,NA,NA)
Grade1 <- c(20,50,30,10,15,10,NA)
Grade2 <- c(40,33,84,NA,20,1,NA)
DF <- data.frame(ID,Value1,Value2,Value3,Grade1,Grade2)
ID Value1 Value2 Value3 Grade1 Grade2
1 1 20 40 60 20 40
2 1 50 33 40 50 33
3 2 30 84 60 30 84
4 3 10 NA 10 10 NA
5 4 15 20 25 15 20
6 5 10 1 NA 10 1
7 6 NA NA NA NA NA
I would like to group by ID, select columns with names contain the string ("Value"), and get the mean of these columns with NA not included.
Here is an example of the desired output
ID mean(Value)
1 41
2 58
3 10
....
In my attempt to solve this challenge, I wrote the following code
Library(tidyverse)
DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.,na.rm = TRUE))
The code groups the data by IDs, select columns with column name containing ("Value"), and attempts to summarise the selected columns by using the mean function. When I run my code, I get the following output
> DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.))
Adding missing grouping variables: `ID`
# A tibble: 6 x 2
ID `mean(.)`
<dbl> <dbl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
I would appreciate your help in this manner.
You should try using pivot_longer to get your data from wide to long form Read latest tidyR update on pivot_longer & pivot_wider (https://tidyr.tidyverse.org/articles/pivot.html)
library(tidyverse)
ID <- c(1,2,3,4,5,6)
Value1 <- c(50,30,10,15,10,NA)
Value2 <- c(33,84,NA,20,1,NA)
Value3 <- c(40,60,10,25,NA,NA)
DF <- data.frame(ID,Value1,Value2,Value3)
DF %>% pivot_longer(-ID) %>%
group_by(ID) %>% summarise(mean=mean(value,na.rm=TRUE))
Output here
ID mean
<dbl> <dbl>
1 1 41
2 2 58
3 3 10
4 4 20
5 5 5.5
6 6 NaN
Without using dplyr or any specific package, this would help :
DF$mean<- rowMeans(DF[,c(2:4)], na.rm = T)
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
I have a dataframe like the following in R:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5),ncol=2)
For each row of this dataframe, I want to include the total value for each class (A,B,C) so that the dataframe will look this this:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5,20,20,20,20,19,19,19,19,17,17,17,17),ncol=3)
What's a way I could accomplish this?
Thanks in advance for your help.
Using R base
df <- data.frame(df)
df$Total <- ave(as.numeric(df$X2), df$X1, FUN=sum)
A dplyr solution would be this:
data.frame(df) %>%
group_by(X1) %>%
mutate(Sum = sum(as.numeric(X2)))
# A tibble: 12 × 3
# Groups: X1 [3]
X1 X2 Sum
<chr> <chr> <dbl>
1 A 4 20
2 A 6 20
3 A 8 20
4 A 2 20
5 B 2 19
6 B 7 19
7 B 2 19
8 B 8 19
9 C 9 17
10 C 1 17
11 C 2 17
12 C 5 17
My apologies for a poorly formulated question. I am new to R and to programming and for posting questions.
I am working with panel data. I have two context varying variables: cat (category that ranges from 1 to 4, where an individual have gambled in 3 out of 4 possible places) and d.stake = amount of money staked in a given category. Cat and d.stake are nested within the individual (id) (context independent variable).
I wish to make a difference score between the different categories amount staked in different categories.
I have created four variables. Two of them lag is a lag variable (ldstake and ldstake2) and two variables with difference scores (diff1 = stake - ldstake; diff2 stake - ldstake2), using the code
df.3$ldstake <- c(NA, df.3$d.stake[-nrow(df.3)])
df.3$ldstake[which(!duplicated(df.3$id))] <- NA
df.3$ldstake2 <- c(NA, df.3$ldstake[-nrow(df.3)])
df.3$ldstake2[which(!duplicated(df.3$id))] <- NA
df.3 <- df.3 %>%
mutate(diff1 = d.stake - ldstake,
diff2 = d.stake - ldstake2)
This give me the following dataframe:
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA NA NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 5
2 1 34 NA NA NA NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 22
However, I wish to replace the first row of diff1 (the NA) for each individual with the third row of diff2 from each individual (See example below).
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA !5! NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 !5!
2 1 34 NA NA *22* NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 *22*
Is this possible? I would be grateful to receive a script where I can replace the first NA value with the value of diff2 and last value for the individual (third or last observation). Furthermore, if there is a script that would do this automatically (that is create the difference score between cat2-1 cat3-2 and cat3-1) I would be grateful to receive any help.
All the best,
Tony
Here is one possibility based on something else I had been working on this past week.
library(tidyverse)
df_wide <- df %>%
pivot_wider(id_cols = id, names_from = cat, values_from = d.stake) %>%
as.data.frame(.)
data.frame(id = df_wide$id, combn(df_wide[-1], 2, function(x) x[,1]-x[,2])) %>%
setNames(c("id", apply(combn(names(df_wide[-1]), 2), 2, paste0, collapse = "-"))) %>%
pivot_longer(cols = -id, names_to = "cats", values_to = "diff") %>%
drop_na()
Output
# A tibble: 6 x 3
id cats diff
<dbl> <chr> <dbl>
1 1 1-2 -10
2 1 1-3 -5
3 1 2-3 5
4 2 1-2 -40
5 2 1-4 22
6 2 2-4 62
Data
df <- data.frame(
id = c(1,1,1,2,2,2),
cat = c(1,2,3,1,2,4),
d.stake = c(50,60,55,34,74,12)
)
I have a data frame like so:
ID = c(1,1,1,2,2,2,3,3,3,4,4,4,4)
VAR_1 = c(2,4,6,1,7,9,4,4,3,1,7,4,0)
VAR_2 = c(NA,NA,NA,NA,NA,20190101,20190101,20190101,NA,20190101,NA,NA,NA)
df2 = data.frame(ID,VAR_1,VAR_2)
And I would like to subset from this data frame all the rows for every group (ID) ONLY if the first observation by group in VAR_2 has a value, In this simple case, the new subset should be all the rows from ID's 3 and 4
To represent this better:
df df_subset
ID VAR_1 VAR_2 ID VAR_1 VAR_2
1 2 NA 3 4 20190101
1 4 NA 3 4 20190101
1 6 NA 3 3 NA
2 1 NA 4 1 20190101
2 7 NA 4 7 NA
2 9 20190101 4 4 NA
3 4 20190101 4 0 NA
3 4 20190101
3 3 NA
4 1 20190101
4 7 NA
4 4 NA
4 0 NA
I manage to do this in several steps (I subset the original taking only the first observation by group,assign VAR_1 a special value, re-merge and then finally filter by the special value), but I would like to know if there's a simpler more elegant (and probably) more efficient way. I don't need VAR_1, so that can be changed if needed to provide a faster solution.
Any help would be appreciated.
Using dplyr, we can group_by ID and select groups only if first value in each group is non-NA.
library(dplyr)
df2 %>%
group_by(ID) %>%
filter(!is.na(VAR_2[1L]))
# ID VAR_1 VAR_2
# <dbl> <dbl> <dbl>
#1 3 4 20190101
#2 3 4 20190101
#3 3 3 NA
#4 4 1 20190101
#5 4 7 NA
#6 4 4 NA
#7 4 0 NA
Some variations to extract first value could be (thanks to #tmfmnk)
df2 %>% group_by(ID) %>% filter(!is.na(first(VAR_2)))
OR
df2 %>% group_by(ID) %>% filter(!is.na(nth(VAR_2, 1)))
Same using base R ave
df2[with(df2, ave(!is.na(VAR_2), ID, FUN = function(x) x[1L])), ]
or a bit complicated one with split and subset
subset(df2, ID %in% names(na.omit(sapply(split(df2$VAR_2, df2$ID), head, 1))))
Here is a simple example:
> df <- data.frame(sn=rep(c("a","b"), 3), t=c(10,10,20,20,25,25), r=c(7,8,10,15,11,17))
> df
sn t r
1 a 10 7
2 b 10 8
3 a 20 10
4 b 20 15
5 a 25 11
6 b 25 17
Expected result is
sn t r
1 a 20 3
2 a 25 1
3 b 20 7
4 b 25 2
I want to group by a specific column ("sn"), leave some columns unchanged ("t" for this example), and apply diff() to remaining columns ("r" for this example).
I explored "dplyr" package to try something like:
df1 %>% group_by(sn) %>% do( ... diff(r)...)
but couldn't figure out correct code.
Can anyone recommend me a clean way to get expected result?
You can do like this (I don't use directly diff because it returns n-1 values):
library(dplyr)
df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n())
#### sn t r
#### <fctr> <dbl> <dbl>
#### 1 a 20 3
#### 2 a 25 1
#### 3 b 20 7
#### 4 b 25 2
The slice fonction is here to remove the NA rows created by the differenciation at the beginning of each group. One could also use na.omit instead, but it could also remove other lines unintentionally
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), set the key as 'sn' (it will order it based on 'sn'), grouped by 'sn', get the difference of 'r' with the lag of 'r' (i.e. shift in data.table does that) and remove the NA rows with `na.rows.
library(data.table)
na.omit(setDT(df, key = "sn")[, r := r-shift(r) , sn])
# sn t r
#1: a 20 3
#2: a 25 1
#3: b 20 7
#4: b 25 2
Or if we are using diff, then make sure that length are the same as the diff output will be one less than the length of the original vector. So, we can pad with NA and later remove by filter
library(dplyr)
df %>%
arrange(sn) %>%
group_by(sn) %>%
mutate(r = c(NA, diff(r))) %>%
filter(!is.na(r))
# sn t r
# <fctr> <dbl> <dbl>
#1 a 20 3
#2 a 25 1
#3 b 20 7
#4 b 25 2
I'm a newcommer to dplyr and have following question. My has data.frame one column serving as a grouping variable. Some rows don't belong to a group, the grouping column being NA.
I need to add some columns to the data.frame using the dplyr function mutate. I'd prefer that dplyr ignores all rows where the grouping column equals to NA. I'll illustrate with an example:
library(dplyr)
set.seed(2)
# Setting up some dummy data
df <- data.frame(
Group = factor(c(rep("A",3),rep(NA,3),rep("B",5),rep(NA,2))),
Value = abs(as.integer(rnorm(13)*10))
)
# Using mutate to calculate differences between values within the rows of a group
df <- df %>%
group_by(Group) %>%
mutate(Diff = Value-lead(Value))
df
# Source: local data frame [13 x 3]
# Groups: Group [3]
#
# Group Value Diff
# (fctr) (int) (int)
# 1 A 8 7
# 2 A 1 -14
# 3 A 15 NA
# 4 NA 11 11
# 5 NA 0 -1
# 6 NA 1 -8
# 7 B 7 5
# 8 B 2 -17
# 9 B 19 18
# 10 B 1 -3
# 11 B 4 NA
# 12 NA 9 6
# 13 NA 3 NA
Calculating the differences between rows without a group makes no sense and is corrupting the data. I need to remove these rows and have done so like this:
df$Diff[is.na(df$Group)] <- NA
Is there a way to include the above command into the dplyr-chain using %>% ? Somewhere in the lines of:
df <- df %>%
group_by(Group) %>%
mutate(Diff = Value-lead(Value)) %>%
filter(!is.na(Group))
But where the rows without a group are not removed all together? Or even better, is there a way to make dplyr ignore rows without a group?
There desired outcome would be:
# Source: local data frame [13 x 3]
# Groups: Group [3]
#
# Group Value Diff
# (fctr) (int) (int)
# 1 A 8 7
# 2 A 1 -14
# 3 A 15 NA
# 4 NA 11 NA
# 5 NA 0 NA
# 6 NA 1 NA
# 7 B 7 5
# 8 B 2 -17
# 9 B 19 18
# 10 B 1 -3
# 11 B 4 NA
# 12 NA 9 NA
# 13 NA 3 NA
Simply use an iflelse condition for the variable that you are trying to create:
library(dplyr)
set.seed(2)
df = data.frame(
Group = factor(c(rep("A",3), rep(NA,3), rep("B",5), rep(NA,2))),
Value = abs(as.integer(rnorm(13)*10))
) %>%
group_by(Group) %>%
mutate(Diff = ifelse(is.na(Group), as.integer(NA), Value-lead(Value)))