I am extremely new to R and programming, so I don't even know how to describe my question very clearly, excuse me for using an example to further explain what I mean:
Say I have a data frame with 2 columns, first one being 10 different countries, second column is the rate of happiness (0-10). And country column could have lots of repeated ones, e.g.:
Column titles: Country Happiness
1st Column content: A,C,A,B,B,B,C,A,D,D....
2nd Column content: 10,9,3,4,4,5,6,9,10,6...
What I want to achieve is: get mean/median/mode for country A B C D respectively. So far using describe() function I can only get the MMM for all the numbers, rather than by country.
I wonder if there is a function to achieve this directly, or should I create subsets of each country first? How should I do it?
Many thanks.
You can do this best with dplyr but first you will have to write a function for the mode:
getmode <- function(v) {
uniqv <- unique(v[!is.na(v)])
uniqv[which.max(table(match(v, uniqv)))]
}
Now you can group_bythe grouping variable Country and use summarise to calculate the statistics:
library(dplyr)
df %>%
group_by(Country) %>%
summarise(Mean = mean(Happiness),
Median = median(Happiness),
Mode = getmode(Happiness))
Result:
# A tibble: 4 x 4
Country Mean Median Mode
* <chr> <dbl> <dbl> <int>
1 A 2.5 2.5 2
2 B 2 2 2
3 C 3 3 3
4 D 3.5 3.5 5
Data:
set.seed(12)
df <- data.frame(
Country = sample(LETTERS[1:4], 10, replace = T),
Happiness = sample(1:5, 10, replace = T)
)
Related
Am working on a large dataset to calculate a single value in R. I believe the CUMSUM and cum product would work. But I don't know-how
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
I need a function that can simply give me a single value as follows
for every county_id, then I need the total.
Example, for county_id=1 the total for res is calculated manually as
2(3+2+4)+3(2+4)+2(4)
for county_id=2 the total for res is calculated manually as
2(4+3)+4(3)
for county_id=3 the total for res is calculated manually as
3(2)
Then it sums all this into a single variable
44+26+6=76
NB my county_id run from 1:47 and each county_id could have up to 200 res
Thank you
You can use aggregate with cumsum like:
x <- aggregate(res, list(county_id)
, function(x) sum(rev(cumsum(rev(x[-1])))*x[-length(x)]))
#Group.1 x
#1 1 44
#2 2 26
#3 3 6
sum(x[,2])
#[1] 76
You can sum the product of the pairwise combinations:
library(dplyr)
dat %>%
group_by(county_id) %>%
summarise(x = sum(combn(res, 2, FUN = prod)))
# A tibble: 3 x 2
county_id x
<dbl> <dbl>
1 1 44
2 2 26
3 3 6
Base R:
aggregate(res ~ county_id, dat, FUN = function(x) sum(combn(x, 2, FUN = prod)))
Here is one way to do this using tidyverse functions.
For each county_id we multiply the current res value with the sum of res value after it.
library(dplyr)
library(purrr)
df1 <- df %>%
group_by(county_id) %>%
summarise(result = sum(map_dbl(row_number(),
~res[.x] * sum(res[(.x + 1):n()])), na.rm = TRUE))
df1
# county_id result
# <dbl> <dbl>
#1 1 44
#2 2 26
#3 3 6
To get total sum you can then do :
sum(df1$result)
#[1] 76
data
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
df <- data.frame(county_id, res)
Another option is to use SPSS syntax
// You need to count the number of variables with valid responses
count x1=var1 to var4(1 thr hi).
execute.
// 1st thing is to declare a variable that will hold your cumulative sum
// Declare your variables in terms of a vector
//You then loop twice. The 1st loop being from the 1st variable to the number of
//variables with data (x1). The 2nd loop will be from the 1st variable to the `
//variable in (1st loop-1) for all variables with data.`
//Lastly you need to get a cumulative sum based on your formulae
// This syntax can be replicated in other software.
compute index1=0.
vector x=var1 to var4.
loop #i=1 to x1.
loop #j=1 to #i-1 if not missing(x(#i)).
compute index1=index1+(x(#j)*sum(x(#i))).
end loop.
end loop.
execute.
I try to create a simple function how to sum some variables in a nested data set.
Here is a much simpler example
df <- data.frame(ID=c(1,1,1,1,2,3,3,4,4,4,5,6,7,7,7,7,7,7,7,7),
var=c("A","B","C","D","B","A","D","A","C","D","D","D","A","D","A","A","A","B","B","B"),
N=c(50,50,50,50,298,156,156,85,85,85,278,301,98,98,98,98,98,98,98,98))
Think of this as a dataframe containing results of 7 different studies. Each study has investigated one or more Variables (A, B, C, D). The variables mean
ID = The ID of a respective study.
var = The respective variable measured in each study. Some studies have measured only one variable (e.g., ID=2, which only contained b), some several
N = The sample size of each study. That is, each ID has a sample size
I would like to create a function that summarizes three things:
k = how many studies measured each variable (e.g., "A")
m = how often each variable was measured (regardless whether some studies measured a variable more than once)--a simple frequency.
N = the sample size per variable--but only once per study. That is, no duplications per study ID are allowed.
My current version (I am a real noob, so please forgive the form), results in exactly what I want:
model km N
1 A 4 (7) 389
2 B 3 (5) 446
3 C 2 (2) 135
4 D 6 (6) 968
For instance, variable A was measured 7times, but only by 4 studies (i.e., study #7 measured it several times. The (non-redundant) sample size was N=389 (not counting the several measures of study #7 more than one time).
(Note: The parentheses in the table are helpful as I intend to copy the results into a document)
Here is the current version of the code. The problems begin with the part containing the pipes
kmn <- function(data, x, ID, N) {
m <-table(data[[x]])
k <-apply(table(data[[x]],data[[ID]]), 1, function(x) length(x[x>0]) )
model <- levels(data[[x]])
km <- cbind(k,m)
colnames(km)<-c("k","m")
km <- paste0(k," (",m,")")
smpsize <- data %>%
group_by(data[[x]]) %>%
summarise(N = sum(N[!duplicated(ID)])) %>%
select(N)
cbind(model,km,smpsize)
}
kmn(data=df, x="var", ID = "ID", N="N")
The above code works but only if the df-dataframe really contains the N-variable (but not with a different variable name). I guess the "data %>%" prompts R to look into the dataframe and not to use the "sum(N..." part as reference to the call.
I can guess that this looks horrible for someone with some idea :)
Thank you for any ideas
Holger
First, remove duplicates by using the unique function and sum by var.
Secondly take df and group by var, n() gives the count and n_distinct(ID) the number of unique IDs, then you join the dataframe stats_N
library(dplyr)
stats_N <- df %>%
select(ID,var,N) %>%
unique() %>%
group_by(var) %>%
summarise(N=sum(N))
df %>%
group_by(var) %>%
summarise(n=n(),km=n_distinct(ID)) %>%
left_join(stats_N)
# A tibble: 4 x 4
# var n km N
# <fct> <int> <int> <dbl>
#1 A 7 4 389
#2 B 5 3 446
#3 C 2 2 135
#4 D 6 6 968
in addition to the #fmarm's answer, it can be also done without a join, where do the group by 'var', get the number of distinct elements in 'D' (n_distinct), number of rows (n()) and the sum of non-duplicated 'N's
library(dplyr)
df %>%
group_by(model = var) %>%
summarise(km = sprintf("%d (%d)", n_distinct(ID), n()),
N = sum(N[!duplicated(N)]))
# A tibble: 4 x 3
# model km N
# <fct> <chr> <dbl>
#1 A 4 (7) 389
#2 B 3 (5) 446
#3 C 2 (2) 135
#4 D 6 (6) 968
I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.
This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)
I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2