so i lets say i have a datatable that consist of stock monthly returns:
Company
Year
return
next years return
1
1
5
1
2
6
1
3
2
1
4
4
For a large dataset, of multiple companies and years how can i get a new column that consist of next years returns, for example in first row there would be second years return of 6% etc etc? In excel i could simple use index match but no idea how its done in R. And the reason for not using excel is that it takes over 20 hours to compute all functions as index match is extremely slow. The code needs to do this for all companies so it has to find the correct company for correct year and then input it into new column.
You could group by the company and use lead() to get the next value:
library(dplyr)
df <- data.frame(
company = c(1L, 1L, 1L, 1L, 2L, 2L),
year = c(1L, 2L, 3L, 4L, 1L, 2L),
return_ = c(5L, 6L, 2L, 4L, 2L, 4L))
df
#> company year return_
#> 1 1 1 5
#> 2 1 2 6
#> 3 1 3 2
#> 4 1 4 4
#> 5 2 1 2
#> 6 2 2 4
df %>% group_by(company) %>%
mutate(next.years.return = lead(return_, order_by = year))
#> # A tibble: 6 × 4
#> # Groups: company [2]
#> company year return_ next.years.return
#> <int> <int> <int> <int>
#> 1 1 1 5 6
#> 2 1 2 6 2
#> 3 1 3 2 4
#> 4 1 4 4 NA
#> 5 2 1 2 4
#> 6 2 2 4 NA
Created on 2023-02-10 with reprex v2.0.2
Getting the next years return if its really the next year.
library(dplyr)
df %>%
group_by(Company) %>%
arrange(Company, Year) %>%
mutate("next years return" =
if_else(lead(Year) - Year == 1, lead(`return`), NA)) %>%
ungroup()
# A tibble: 8 × 4
Company Year return `next years return`
<dbl> <dbl> <int> <int>
1 1 1 5 NA
2 1 3 2 4
3 1 4 4 6
4 1 5 6 NA
5 2 1 5 6
6 2 2 6 2
7 2 3 2 4
8 2 4 4 NA
Data
df <- structure(list(Company = c(1, 1, 1, 1, 2, 2, 2, 2), Year = c(1,
5, 3, 4, 4, 3, 2, 1), return = c(5L, 6L, 2L, 4L, 4L, 2L, 6L,
5L)), row.names = c("1", "2", "3", "4", "41", "31", "21", "11"
), class = "data.frame")
Related
Have a dataset for determining interrater reliability. Trying to restructure my data from wide to long form. Here is my data.
Subject Rater Item_1 Item_2
AB 1 6 4
AB 2 5 5
CD 1 4 5
CD 2 6 5
EF 1 4 4
EF 2 7 5
I want to restructure it so that it looks like this:
Subject Item Rater_1 Rater_2
AB 1 6 5
AB 2 4 5
CD 1 4 6
CD 2 5 5
EF 1 4 7
EF 2 4 5
I've tried pivot_longer but am unable to separate "rater" into two columns. Any ideas?
Get the data in long format and use a different key to get it in wide format again.
library(dplyr)
library(tidyr)
#Thanks to #Dan Adams for the `NA` trick.
df %>%
pivot_longer(cols = starts_with('Item'),
names_to = c(NA, 'Item'),
names_sep = "_") %>%
pivot_wider(names_from = Rater, values_from = value, names_prefix = "Rater_")
# Subject Item Rater_1 Rater_2
# <chr> <chr> <int> <int>
#1 AB 1 6 5
#2 AB 2 4 5
#3 CD 1 4 6
#4 CD 2 5 5
#5 EF 1 4 7
#6 EF 2 4 5
data
df <- structure(list(Subject = c("AB", "AB", "CD", "CD", "EF", "EF"
), Rater = c(1L, 2L, 1L, 2L, 1L, 2L), Item_1 = c(6L, 5L, 4L,
6L, 4L, 7L), Item_2 = c(4L, 5L, 5L, 5L, 4L, 5L)),
class = "data.frame", row.names = c(NA, -6L))
Here is a base R solution. You are really just transposing the data by group in this particular case.
Map(\(s) {
x <- subset(df, df$Subject == s)
x[,c("Item_1", "Item_2")] <- t(x[,c("Item_1", "Item_2")])
colnames(x) <- c("Subject", "Item", "Rater_1", "Rater_2")
x
}, unique(df$Subject)) |>
do.call(what = rbind)
#> # A tibble: 6 x 4
#> Subject Item Rater_1 Rater_2
#> * <chr> <dbl> <dbl> <dbl>
#> 1 AB 1 6 5
#> 2 AB 2 4 5
#> 3 CD 1 4 6
#> 4 CD 2 5 5
#> 5 EF 1 4 7
#> 6 EF 2 4 5
I would like to cut rows from my data frame by groups (Column "Group") based on the number asigned in the column "Count".
Data looks like this
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 Ane 3 8 5
5 Ane 3 7 8
6 John 2 9 NA
7 John 2 2 NA
8 John 2 4 2
9 John 2 3 2
Expected results
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
6 John 2 9 NA
7 John 2 2 NA
Thanks!
We may use slice on the first value of 'Count' after grouping by 'Group'
library(dplyr)
df1 %>%
group_by(Group) %>%
slice(seq_len(first(Count))) %>%
ungroup
-output
# A tibble: 5 × 4
Group Count Result Result2
<chr> <int> <int> <int>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 John 2 9 NA
5 John 2 2 NA
Or use filter with row_number() to create a logical vector
df1 %>%
group_by(Group) %>%
filter(row_number() <= Count) %>%
ungroup
data
df1 <- structure(list(Group = c("Ane", "Ane", "Ane", "Ane", "Ane", "John",
"John", "John", "John"), Count = c(3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L), Result = c(5L, 6L, 4L, 8L, 7L, 9L, 2L, 4L, 3L), Result2 = c(NA,
5L, 5L, 5L, 8L, NA, NA, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
I have a data frame where I have aggregated the total activity per member for 15 different months (as ordered factors). Now the months/levels, where a member has not had any activity is simply skipped as there are no rows in the original data.
The data looks like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 11-2014 2
3 12-2014 3
I want to insert new rows in between the active months, so that the months show a frequency of 0.
Like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 06-2014 0
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 08-2014 0
3 09-2014 0
3 10-2014 0
3 11-2014 2
3 12-2014 3
However every member hasn't become members at the same time, so the 0's can only be between the min and max MonthYr for each member.
We can use complete to do this. Convert the 'MonthYr' to Date class, then grouped by 'MemberID', use complete to expand the 'MonthYr' from min to max 'Date' by 'month', while filling the 'freq' with 0 and if needed, convert back the 'MonthYr' to original format
library(dplyr)
library(tidyr)
library(zoo)
df1 %>%
mutate(MonthYr = as.Date(as.yearmon(MonthYr, "%m-%Y"))) %>%
group_by(MemberID) %>%
complete(MonthYr = seq(min(MonthYr), max(MonthYr), by = '1 month'),
fill = list(freq = 0)) %>%
mutate(MonthYr = format(MonthYr, "%m-%Y"))
# A tibble: 14 x 3
# Groups: MemberID [3]
# MemberID MonthYr freq
# <int> <chr> <dbl>
# 1 1 04-2014 2
# 2 1 05-2014 3
# 3 1 06-2014 0
# 4 1 07-2014 2
# 5 1 08-2014 5
# 6 2 04-2014 3
# 7 2 05-2014 3
# 8 3 06-2014 6
# 9 3 07-2014 4
#10 3 08-2014 0
#11 3 09-2014 0
#12 3 10-2014 0
#13 3 11-2014 2
#14 3 12-2014 3
data
df1 <- structure(list(MemberID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
3L), MonthYr = c("04-2014", "05-2014", "07-2014", "08-2014",
"04-2014", "05-2014", "06-2014", "07-2014", "11-2014", "12-2014"
), freq = c(2L, 3L, 2L, 5L, 3L, 3L, 6L, 4L, 2L, 3L)),
class = "data.frame", row.names = c(NA,
-10L))
I want to find the total utility of person in each each household.SAMPN is household index, PERNO is person index.
there are 2 utility for each person, utility1 and utility2. for each person I want to add utility 1 of that person with utility2 of other persons.
SAMPN PERNO utility1 utility2
1 1 3 4
1 2 4 5
1 3 6 8
2 1 1 2
2 2 2 3
output
SAMPN PERNO utility1 utility2 HH-utility
1 1 3 4 3+5+8=16
1 2 4 5 4+4+8=16
1 3 6 8 6+4+5=15
2 1 1 2 1+3=4
2 2 2 3 2+2=4
One option after grouping by 'SAMPN', is to get the sum of 'utility2', subtract from the column 'utility2' to get the sum without the element and add 'utility1' to it
library(dplyr)
df1 %>%
group_by(SAMPN) %>%
mutate(HHutility = sum(utility2) - utility2 + utility1)
# A tibble: 5 x 5
# Groups: SAMPN [2]
# SAMPN PERNO utility1 utility2 HHutility
# <int> <int> <int> <int> <int>
#1 1 1 3 4 16
#2 1 2 4 5 16
#3 1 3 6 8 15
#4 2 1 1 2 4
#5 2 2 2 3 4
Or with base R
transform(df1, HHutility = utility1 + ave(utility2, SAMPN, FUN = sum) - utility2)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L), PERNO = c(1L, 2L,
3L, 1L, 2L), utility1 = c(3L, 4L, 6L, 1L, 2L), utility2 = c(4L,
5L, 8L, 2L, 3L)), class = "data.frame", row.names = c(NA, -5L
))
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have a data set like
id age edu blood
1 30-39 Primary 5.5
1 20-29 Secondary 8.7
1 30-39 Primary 10
2 30-39 Primary 11
2 20-29 Secondary 10
2 20-29 Secondary 9
I want id wise output like this:
id age30_39count age20_29count edu_pri_count edu_sec_count blood_median
1 2 1 2 1 8.7
2 1 2 1 2 10
I have tried R code:
library(dplyr)
library(tidyr)
ddply(dat, "id", spread, age, age, edu, edu, blood, blood_median=median(blood))
But it not showing desired result. Could anybody do help?
You mean like this?
> library(dplyr)
> library(tidyr)
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
spread(tag,id) %>% select(-blood) %>%
select(-medblood,medblood)
# A tibble: 6 x 5
`age_20-29` `age_30-39` edu_Primary edu_Secondary medblood
<int> <int> <int> <int> <dbl>
1 NA 1 1 NA 8.70
2 1 NA NA 1 8.70
3 2 NA NA 2 10.0
4 NA 1 1 NA 8.70
5 2 NA NA 2 10.0
6 NA 2 2 NA 10.0
That last select(-medblood,medblood) moves the median blood column to the far right. You might possibly be wanting to do this though:
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
count(medblood,id,tag) %>% spread(tag,n)
# A tibble: 2 x 6
# Groups: id [2]
id medblood `age_20-29` `age_30-39` edu_Primary edu_Secondary
<int> <dbl> <int> <int> <int> <int>
1 1 8.70 1 2 2 1
2 2 10.0 2 1 1 2
Here is the dput of the data df used for this example:
> dput(df)
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), age = structure(c(2L,
1L, 2L, 2L, 1L, 1L), .Label = c("20-29", "30-39"), class = "factor"),
edu = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("Primary",
"Secondary"), class = "factor"), blood = c(5.5, 8.7, 10,
11, 10, 9)), .Names = c("id", "age", "edu", "blood"), class = "data.frame", row.names = c(NA,
-6L))