how to calculate mean based on conditions in for loop in r - r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!

colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)

The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}

You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Related

How to create a loop code from big dataframe in R?

I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Resources