I would like to get the values from multiple columns and rows of a data frame, do something with those values (e.g., calculate a mean), then add the results to a new column in a modified version of the original data frame. The columns and rows are selected based on values in other columns. I've gotten this working in dplyr, but only when I "hard code" the column names that are used by the function.
Example code:
'''
# create a test data frame
tdf <- data.frame(A=c('a','a','b','b','b'), B=c('d','d','e','e','f'),
L1=as.numeric(c(1,2,3,4,5)), L2=as.numeric(c(11,12,13,'na',15)),
L3=as.numeric(c('na',22,23,'na',25)), stringsAsFactors=FALSE)
'''
which gives
A B L1 L2 L3
1 a d 1 11 NA
2 a d 2 12 22
3 b e 3 13 23
4 b e 4 NA NA
5 b f 5 15 25
In this case, I would like to calculate the mean of the values in the L* columns (L1,L2,L3) that have the same values in columns A and B. For example, using the values in A and B, I select rows 1 & 2 and calculate a mean using (1, 11, 2, 12, 22), then rows 3 & 4 (3, 13, 23, 4), and finally row 5 (5, 15, 25).
I can do this using dplyr using either (both work):
'''
ddply(tdf, .(A, B), summarize, mean_L=mean(c(L1, L2, L3), na.rm=TRUE))
or
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
which gives what I want:
A B mean_L
1 a d 9.60
2 b e 10.75
3 b f 15.00
However, my issue is that the number of "L" columns is dynamic among different data sets. In some cases I may have 10 total columns (L1, L2, ... L10) or 100+ columns. The columns I use for the selection criteria (in this case A and B), will always be the same, so I can "hard code" those, but I'm having difficulty specifying the columns in the "mean" function.
dplyr has a way of dynamically generating the "group by" variables, but that does not seem to work within the function component of the summarize. For example, I can do this:
'''
b <- names(tdf)[1:2]
dots <- lapply(b, as.symbol)
tdf %>% group_by(.dots=dots) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
but I can't do the same inside the mean function. The closest I have come to working is:
'''
b='L1'
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(.data[[b]], na.rm=TRUE))
'''
but this only works for specifying a single column. If I try b='L1,L2,L3', it seems dplyr uses the literal "L1,L2,L3" as a column name and not as a list.
This doesn't seem to be a complicated problem, but I would like help finding the solution, either in dplyr or some other way.
Many thanks!
tdf %>%
group_by_at(1:2) %>%
summarise(mean_L=mean(c_across(starts_with("L")),
na.rm=TRUE)) %>%
ungroup()
No matter how many L columns you have you can always think of transforming your data set into long format and group them based on your variables:
library(tidyverse)
df %>%
pivot_longer(!c(A, B)) %>%
group_by(A, B) %>%
summarise(L_mean = mean(value, na.rm = TRUE))
# A tibble: 3 × 3
# Groups: A [2]
A B L_mean
<chr> <chr> <dbl>
1 a d 9.6
2 b e 10.8
3 b f 15
Related
I want to count variables in duplicates.
I used this code.
library(dplyr)
mydata2<-mydata %>%group_by(sub) %>%summarise(n = n())
mydata2
But, you know, in this code, it counts A and A,B differently.
But I want to get numbers' of variables as A to 3, B to 3, C to 2 from the below data.
How can i make code for it?
Here is my data.
num sub
1 A
2 A,B
3 C
4 A
5 B
6 B,C
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(sub) %>%
count(sub)
We can use base R with strsplit and table
table(strsplit(mydata$sub, ","))
I try to create a simple function how to sum some variables in a nested data set.
Here is a much simpler example
df <- data.frame(ID=c(1,1,1,1,2,3,3,4,4,4,5,6,7,7,7,7,7,7,7,7),
var=c("A","B","C","D","B","A","D","A","C","D","D","D","A","D","A","A","A","B","B","B"),
N=c(50,50,50,50,298,156,156,85,85,85,278,301,98,98,98,98,98,98,98,98))
Think of this as a dataframe containing results of 7 different studies. Each study has investigated one or more Variables (A, B, C, D). The variables mean
ID = The ID of a respective study.
var = The respective variable measured in each study. Some studies have measured only one variable (e.g., ID=2, which only contained b), some several
N = The sample size of each study. That is, each ID has a sample size
I would like to create a function that summarizes three things:
k = how many studies measured each variable (e.g., "A")
m = how often each variable was measured (regardless whether some studies measured a variable more than once)--a simple frequency.
N = the sample size per variable--but only once per study. That is, no duplications per study ID are allowed.
My current version (I am a real noob, so please forgive the form), results in exactly what I want:
model km N
1 A 4 (7) 389
2 B 3 (5) 446
3 C 2 (2) 135
4 D 6 (6) 968
For instance, variable A was measured 7times, but only by 4 studies (i.e., study #7 measured it several times. The (non-redundant) sample size was N=389 (not counting the several measures of study #7 more than one time).
(Note: The parentheses in the table are helpful as I intend to copy the results into a document)
Here is the current version of the code. The problems begin with the part containing the pipes
kmn <- function(data, x, ID, N) {
m <-table(data[[x]])
k <-apply(table(data[[x]],data[[ID]]), 1, function(x) length(x[x>0]) )
model <- levels(data[[x]])
km <- cbind(k,m)
colnames(km)<-c("k","m")
km <- paste0(k," (",m,")")
smpsize <- data %>%
group_by(data[[x]]) %>%
summarise(N = sum(N[!duplicated(ID)])) %>%
select(N)
cbind(model,km,smpsize)
}
kmn(data=df, x="var", ID = "ID", N="N")
The above code works but only if the df-dataframe really contains the N-variable (but not with a different variable name). I guess the "data %>%" prompts R to look into the dataframe and not to use the "sum(N..." part as reference to the call.
I can guess that this looks horrible for someone with some idea :)
Thank you for any ideas
Holger
First, remove duplicates by using the unique function and sum by var.
Secondly take df and group by var, n() gives the count and n_distinct(ID) the number of unique IDs, then you join the dataframe stats_N
library(dplyr)
stats_N <- df %>%
select(ID,var,N) %>%
unique() %>%
group_by(var) %>%
summarise(N=sum(N))
df %>%
group_by(var) %>%
summarise(n=n(),km=n_distinct(ID)) %>%
left_join(stats_N)
# A tibble: 4 x 4
# var n km N
# <fct> <int> <int> <dbl>
#1 A 7 4 389
#2 B 5 3 446
#3 C 2 2 135
#4 D 6 6 968
in addition to the #fmarm's answer, it can be also done without a join, where do the group by 'var', get the number of distinct elements in 'D' (n_distinct), number of rows (n()) and the sum of non-duplicated 'N's
library(dplyr)
df %>%
group_by(model = var) %>%
summarise(km = sprintf("%d (%d)", n_distinct(ID), n()),
N = sum(N[!duplicated(N)]))
# A tibble: 4 x 3
# model km N
# <fct> <chr> <dbl>
#1 A 4 (7) 389
#2 B 3 (5) 446
#3 C 2 (2) 135
#4 D 6 (6) 968
This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)
In dplyr, I'm looking for way/s to group by unique keys(for the problem at hand, by unique row numbers). Given a dataframe such as below:
df <- data.frame(A = rep(1:5, each = 2), B = rnorm(10, 3, 3), C= runif(10, 1.5, 4.5))
#> A B C
#> 1 1 -4.6399372 1.622857
#> 2 1 0.9933197 4.256062
#> 3 2 4.1381981 3.522439
#> 4 2 4.6943698 4.260124
#> 5 3 5.7183797 3.877568
#> 6 3 -3.6183500 2.236473
#> 7 4 -2.5711393 4.373780
#> 8 4 5.9092908 2.125349
#> 9 5 6.1531930 4.472758
#> 10 5 -1.9750869 1.516432
I would like to get a row of mean of three rows(df[4:6, ]) which replaces those specified in the index with single row. Thus the result would produce only 8 rows in total after grouping and collapsing. Normally, I would work the way out in following manner:
df %>%
group_by(rownumber = c(1:3, rep(4, each=3), 7:10)) %>%
summarise_all(.funs = mean)
But, I find the code overtly explicit, in that each slice of index has to be provided.
There must be more efficient/succinct ways to achieve the same feat. Thanks to anyone to offer insights. And also, although tidyverse community seems to dodge the row naming convention, for now, I'd like to have a proper row numbering here.
One option would be to replace those elements with a specific value so that we can avoid the rep and the later concatenation step
df %>%
group_by(grp = replace(row_number(), 4:6, 4)) %>%
summarise_all(mean)
I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))