fill column with values in another columns based on their conditions - r

I would like to fill a column (KE) with values in other columns (K2007/K2008/K2009) based on the conditions (Year). For examle, If "Year" is 2007, KE would be 1.
Year <- c(2007,2008,2009)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
Thanks in advance,
---2022/06/24 update----
I'm so sorry that the data have another limit.
I want to assign values to KE conditional on two column values in addition to "year."
Year <- c(2007,2008,2009)
X <- c(10, 20, 30)
Y <- c(40, 50, 60)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
In this case, when Year = 2007, X = 10 and Y = 40, KE will be 1.
I successfully got the results below the code.
test <- ddd %>%
mutate(KE = case_when(
Year == 2007 ~ ke_2007,
Year == 2008 ~ ke_2008,
Year == 2009 ~ ke_2009,
TRUE ~ NA_real_))
Thanks,

If you don't need to keep the original columns you could pivot the data to a tidier long format, remove the "K" from the original column names, and only keep rows in which Year is the same as the old column:
library(tidyr)
library(dplyr)
dat |>
pivot_longer(K2007:K2009, values_to = "KE") |>
mutate(name = sub('.', '', name) |> as.double()) |>
filter(Year == name) |>
select(-name)
#> # A tibble: 3 x 2
#> Year KE
#> <dbl> <dbl>
#> 1 2007 1
#> 2 2008 5
#> 3 2009 9
Created on 2022-06-23 by the reprex package (v2.0.1)

Try this
df <- data.frame(K2007 , K2008 , K2009)
KE <- sapply(seq_along(Year) ,
\(x) df[ x,grep(Year[x] , names(df))])
KE
#[1] 1 5 9

library(tidyverse)
tbl <- tibble(
year = c(2007,2008,2009),
k2007 = c(1,2,3),
k2008 = c(4,5,6),
k2009 = c(7,8,9))
tbl %>%
pivot_longer(-year, values_to = 'ke') %>%
filter(name == str_c('k', year)) %>%
select(-name) %>%
left_join(tbl, ., 'year')
# A tibble: 3 x 5
year k2007 k2008 k2009 ke
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2007 1 4 7 1
2 2008 2 5 8 5
3 2009 3 6 9 9

Related

How to eliminate a maximum value in a for loop to fulffil a certain criteria?

I have this data frame:age <- c(1,2,3,4,5,6,7,8,9,10)
gender <- c("M","F")
df <- data.frame(age, gender)
Creating this : average <- df %>% group_by(gender) %>% summarise(average_age = mean(age))
I obtain:
gender average_age
F 6
M 5
How can I iterate the age of Females and eliminate the maximum value in order to have an average_age < 4?
The reality is that I have a much bigger dataset and I cannot do this manually
Instead of iterating you can group by the gender and then sort by gender and age ascending. You can use the cummean() function to find the point where the average age for females >= 4 and then filter out all of the female records from that point on.
age <- c(1,2,3,4,5,6,7,8,9,10)
gender <- c("M","F")
df <- data.frame(age, gender)
library(dplyr)
df %>%
dplyr::group_by(gender) %>%
dplyr::arrange(gender,age) %>%
dplyr::mutate(CumAvgAge = cummean(age)) %>%
dplyr::filter(!(gender == "F" &
CumAvgAge >= 4)) %>%
dplyr::select(-CumAvgAge) %>%
dplyr::summarise(average_age = mean(age))
# A tibble: 2 x 2
gender average_age
<fct> <dbl>
1 F 3
2 M 5
If you ultimately want the list of individual ages you can remove the last call to summary():
df %>%
dplyr::group_by(gender) %>%
dplyr::arrange(gender,age) %>%
dplyr::mutate(CumAvgAge = cummean(age)) %>%
dplyr::filter(!(gender == "F" &
CumAvgAge >= 4)) %>%
dplyr::select(-CumAvgAge)
# A tibble: 7 x 2
# Groups: gender [2]
age gender
<dbl> <fct>
1 2 F
2 4 F
3 1 M
4 3 M
5 5 M
6 7 M
7 9 M
A simpler answer with dplyr is to add a filter() function:
library(dplyr)
average <- df %>% group_by(gender) %>%
filter((gender == "F" & age <= 4) | gender == "M") %>%
summarise(average_age = mean(age))
average
...and the output:
average
# A tibble: 2 x 2
gender average_age
<fct> <dbl>
1 F 3
2 M 5
>

Reshape dataframe so that matching family members have their own column

I have a dataframe...
df <- tibble(
id = 1:5,
family = c("a","a","b","b","c"),
twin = c(1,2,1,2,1),
datacol1 = 11:15,
datacol2 = 21:25
)
For every twin pair (members of the same family) I need to have a second 'datacol' with the other twins' data. This should only happen for matching twins, so the 5th row (from family "c") should have duplicate columns that are empty.
Ideally, by the end the data would look like the following...
df <- tibble(
id = 1:5,
family = c("a","a","b","b","c"),
twin = c(1,2,1,2,1),
datacol1 = 11:15,
datacol1.b = c(12,11,14,13,NA),
datacol2 = 21:25,
datacol2.b = c(22,21,24,23,NA)
)
I have added an image to help illustrate what I am trying to get to.
I would like to be able to do this for all columns or for selected columns and preferably using tidyverse.
We can also use mutate_at
library(dplyr)
df %>%
group_by(family) %>%
mutate_at(vars(starts_with('datacol')), list(`2` =
~if(n() == 1) NA_integer_ else rev(.)))
# A tibble: 5 x 7
# Groups: family [3]
# id family twin datacol1 datacol2 datacol1_2 datacol2_2
# <int> <chr> <dbl> <int> <int> <int> <int>
#1 1 a 1 11 21 12 22
#2 2 a 2 12 22 11 21
#3 3 b 1 13 23 14 24
#4 4 b 2 14 24 13 23
#5 5 c 1 15 25 NA NA
cols = c("datacol1", "datacol2")
df %>%
group_by(family) %>%
mutate_at(vars(cols), function(x){
if (n() == 2){
rev(x)
} else {
NA
}
}) %>%
ungroup() %>%
select(cols) %>%
rename_all(funs(paste0(., ".b"))) %>%
cbind(df, .)
Base R
cols = c("datacol1", "datacol2")
do.call(rbind, lapply(split(df, df$family), function(x){
cbind(x, setNames(lapply(x[cols], function(y) {
if (length(y) == 2) {
rev(y)
} else {
NA
}}),
paste0(cols, ".b")))
}))

Grouping data starting with specific number in R

I am sorry if the title is incomprehensible. I have a data as shown below; 1, 2, 3.. are months of various years. And I want to gather months separately for a and l.
a l
1-2006 3.498939 0.8523857
1-2007 14.801777 0.2457656
1-2008 6.893728 0.5381691
2-2006 2.090962 0.6764694
2-2007 9.192913 0.8740950
2-2016 5.059505 1.1761113
Structure of data is;
data<-structure(list(a = c(3.49893890760882, 14.8017770056402, 6.89372828391484,
2.0909624091048, 9.19291324208917, 5.05950526612261, 13.1570625271881,
14.9570662205959, 7.72453112976811, 12.9331892673657
), l = c(0.852385662732809,
0.245765570168399, 0.538169092055646, 0.676469362818052, 0.874095005203713,
1.17611132212132, 0.76857056091243, 0.622533767341579, 0.9562200838363,
1.10064589903771, 0.85863722854391
)), class = "data.frame", row.names = c("1-2006",
"1-2007", "1-2008",
"2-2006", "2-2007",
"2-2016",
"3-2015", "3-2016", "3-2017", "3-2018"
))
For example; I want to gather all january (1-2005, 1-2006..) and march data(3-2012, 3-2015..) data for a and also for l. Like this one:
january_a
1-2006 3.498939
1-2007 14.801777
1-2008 6.893728
january_l
1-2006 0.8523857
1-2007 0.2457656
1-2008 0.5381691
march_a
3-2012 9.192913
3-2015 5.059505
march_l
3-2012 0.8740950
3-2015 1.1761113
You could add a column which contains only the numerical prefix, and then split on that:
data$prefix <- sub("^(\\d+).*$", "\\1", row.names(data))
data_a <- split(data[,"a"], data$prefix)
data_a
$`1`
[1] 3.498939 14.801777 6.893728
$`2`
[1] 2.090962 9.192913 5.059505
Data:
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")
This is another variation that you can try using tidyverse which returns a list of dataframes, where every element has a combination of month and "a" or "l".
library(tidyverse)
data %>%
rownames_to_column('date') %>%
pivot_longer(cols = -date) %>%
separate(date, c('month', 'year'), sep = "-", remove = FALSE) %>%
group_split(month, name)
#[[1]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 a 3.50
#2 1-2007 1 2007 a 14.8
#3 1-2008 1 2008 a 6.89
#[[2]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 l 0.852
#2 1-2007 1 2007 l 0.246
#3 1-2008 1 2008 l 0.538
#...
#...
This has some additional columns to uniquely identify values which you can remove if not needed.
Another option is group_split
library(purrr)
library(dplyr)
library(stringr)
data %>%
rownames_to_column('rn') %>%
select(rn, a) %>%
group_split(rn = str_remove(rn, '-.*'), keep = FALSE) %>%
map(flatten_dbl)
#[[1]]
#[1] 3.498939 14.801777 6.893728
#[[2]]
#[1] 2.090962 9.192913 5.059505
data
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")

"Rolling" Regression in R

Say I want to run regressions per group whereby I want to use the last 5 year data as input for that regression. Then, for each next year, I would like to "shift" the input for that regression by one year (i.e., 4 observations).
From those regressions I want to extract both the R2 and the fitted values/residuals, which I then need in subsequent regressions that follow similar notions.
I have some code working using loops, but it is not really elegant nor efficient for large datasets. I assume there must be a nice plyr way for resolving this issue.
# libraries #
library(dplyr)
library(broom)
# reproducible data #
df <- tibble(ID = as.factor(rep(c(1, 2), each = 40)),
YEAR = rep(rep(c(2001:2010), each = 4), 2),
QTR = rep(c(1:4), 20),
DV = rnorm(80),
IV = DV * rnorm(80))
# output vector #
output = tibble(ID = NA,
YEAR = NA,
R2 = NA)
# loop #
k = 1
for (i in levels(df$ID)){
n_row = df %>%
arrange(ID) %>%
filter(ID == i) %>%
nrow()
for (j in seq(1, (n_row - 19), by = 4)){
output[k, 1] = i
output[k, 2] = df %>%
filter(ID == i) %>%
slice((j + 19)) %>%
select(YEAR) %>%
unlist()
output[k, 3] = df %>%
filter(ID == i) %>%
slice(j:(j + 19)) %>%
do(model = lm(DV ~ IV, data = .)) %>%
glance(model) %>%
ungroup() %>%
select(r.squared) %>%
ungroup()
k = k + 1
}
}
Define a function which returns the year and R squared given a subset of rows of df (without ID) and then use rollapply with it.
library(dplyr)
library(zoo)
R2 <- function(x) {
x <- as.data.frame(x)
c(YEAR = tail(x$YEAR, 1), R2 = summary(lm(DV ~ IV, x))$r.squared)
}
df %>%
group_by(ID) %>%
do(data.frame(rollapply(.[-1], 20, by = 4, R2, by.column = FALSE))) %>%
ungroup
giving:
# A tibble: 12 x 3
ID YEAR R2
<fct> <dbl> <dbl>
1 1 2005 0.0133
2 1 2006 0.130
3 1 2007 0.0476
4 1 2008 0.0116
5 1 2009 0.00337
6 1 2010 0.00570
7 2 2005 0.0481
8 2 2006 0.00527
9 2 2007 0.0158
10 2 2008 0.0303
11 2 2009 0.235
12 2 2010 0.116

Finding the first non-zero year in data frame for multiple variables using tidyverse

I have the following data:
library(tidyverse)
set.seed(1)
test <- data.frame(id = c(rep(1, 3), rep(2, 4), rep(3, 5)),
Year = 2000 + c(1,3,5,2,3,5,6,1,2,3,4,5),
var1 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)),
var2 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)))
I need to the first year that each variable (var1 and var2) is non-zero within each id group.
I know how to find the row number of the first non-zero row:
temp <- function(a) ifelse(length(head(which(a>0),1))==0,0,head(which(a>0),1))
test2 <- test %>% group_by(id) %>%
mutate_at(vars(var1:var2),funs(temp)) %>%
filter(row_number()==1) %>% select (-year)
id var1 var2
1 1 0 1
2 2 1 2
3 3 1 1
However, I am not sure how to match the row number back to the year variable so that I will know exactly when did the var1 and var2 turn non-zero, instead of only having the row numbers.
This is what I want:
id var1 var2
1 1 0 2001
2 2 2002 2003
3 3 2001 2001
We may do the following:
test %>% group_by(id) %>% summarise_at(vars(var1:var2), funs(Year[. > 0][1]))
# A tibble: 3 x 3
# id var1 var2
# <dbl> <dbl> <dbl>
# 1 1 NA 2001
# 2 2 2002 2003
# 3 3 2001 2001
That is, . > 0 gives a logical vector with TRUE whenever a value is positive, then we select all the corresponding years, and lastly pick only the first one.
That's very similar to your approach. Notice that due to using summarise I no longer need filter(row_number()==1) %>% select (-year). Also, my function corresponding to temp is more concise.
A slightly different approach gathering everything into a big long file first:
test %>%
gather(var, value, var1:var2) %>%
filter(value != 0) %>%
group_by(id, var) %>%
summarise(Year = min(Year)) %>%
spread(var, Year)
## A tibble: 3 x 3
## Groups: id [3]
# id var1 var2
#* <dbl> <dbl> <dbl>
#1 1.00 NA 2001
#2 2.00 2002 2003
#3 3.00 2001 2001
And a base R version for fun:
tmp <- cbind(test[c("id", "Year")], stack(test[c("var1","var2")]))
tmp <- tmp[tmp$values != 0,]
tmp <- aggregate(Year ~ id + ind, data=tmp, FUN=min)
reshape(tmp[c("id","ind","Year")], idvar="id", timevar="ind", direction="wide")

Resources