I´m trying to rename two types of variables in R using tidyverse/dplyr. The first type "var_a_year", I want to rename it as "sample_year". The second type of variable "var_b_7", I want to rename it as "index_year".
The second variable, "var_b" starts on the number 7 for the first year "2004". And increases by 2 for each year. So for year 2005, the second type variable is called "var_b_9" as shown.
I would like to use a loop so I can make this faster instead of writting a line for each year.
Many thanks in advance!
df <- df %>%
rename(
sample_2004 = var_a_2004, index_2004 = var_b_7,
sample_2005 = var_a_2005, index_2005 = var_b_9,
sample_2006 = var_a_2006, index_2006 = var_b_11,
sample_2007 = var_a_2007, index_2007 = var_b_13,
...
sample_2020 = var_a_2020, index_2020 = var_b_39)
There's no need to use a loop. rename_with will do the trick:
df <- tibble(var_a_2004=NA, var_b_7=NA, var_a_2005=NA, var_b_8=NA)
renameA <- function(x) {
return(paste0("sample_", stringr::str_sub(x, -4)))
}
df %>% rename_with(renameA, starts_with("var_a"))
Gives
# A tibble: 1 x 4
sample_2004 var_b_7 sample_2005 var_b_8
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA
I'll leave you to work out how to code the corresponding function for your var_b_XXXX columns.
In addition to the answer of Limey:
#sample data
df <- structure(list(var_a_2004 = NA, var_b_7 = NA, var_a_2005 = NA,
var_b_9 = NA), row.names = c(NA, -1L), class = "data.frame")
#load data.table package
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#convert var_a in columnnames to sample_
colnames(dt) <- gsub("var_a_", "sample_", colnames(dt))
#use a loop to replace var_b to index_
for(i in 2004:2005){
year <- i
nr <- 2* i -4001
setnames(dt, old = paste0("var_b_", nr), new = paste0("index_", year))
}
This function now works for the years 2004:2005 to match the sample data. You can change it to 2004:2020 for your dataset.
Related
I have a data frame with a name column and a ticker column, the ticker column trim with ";" if a name has more than 1 ticker.
A glimpse from this data.frame below:
df.info is the name of the dataframe
Rows: 525
Columns: 2
$ name.company <chr> "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJUDICIAL", "524 PARTICIPAÇOES SA", "AAJR SECURITIZADORA DE CRÉDITO...
$ tickers <chr> NA, "QVQP3B", NA, "ADHM3", "TIET11;TIET3;TIET4", "AFLT3", NA, "ALEF3B", "RPAD3;RPAD5;RPAD6", NA, "ALSO3", "ALPA...
And I want a dataframe that has 2 columns, ticker and name.company but without the trim pattern ";".
e.g.:
name ticker
tiete tiet11
tiete tiet3
tiete tiet4
and so it goes.. I solved it using the by() function but I have no clue how to solve it using the tidyverse/purrr packages.
Solution without tidyverse
get.ticker.df <- function(df.in)
{
# Gets ticker string and organizes it in another data_frame
temp.split <- str_split(df.in$tickers, ';')[[1]]
temp.df <- tibble(name.company = df.in$name.company,
ticker = temp.split)
}
my.l <- by(data = df.info,
INDICES = df.info$name.company,
FUN = get.ticker.df)
df.tickers <- bind_rows(my.l)
I don't know the equivalent of this by() function in tidyverse.
Edit - Added initial frame and the ideal result dataframe, to make it clear.
tibble_start <- tibble( name.company = c("AES TIETE", "AMBEV"),
ticker = c("TIET11;TIET3;TIET4", "ABEV3;ABEV4"))
tibble_ideal <- tibble( name.company = c( rep("AES TIETE", 3), rep("AMBEV",2)),
ticker = c("TIET11","TIET3","TIET4","ABEV3","ABEV4"))
Thanks in advance!
We can use separate_rows
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(tickers)
I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]
I have a rather wide dataset to read in with over 1000 missing values at the top, but all the variable names follow the same pattern. Is there a way to use starts_with() to force certain variables to be parsed correctly?
MWE:
library(tidyverse)
library(readr)
mwe.csv <- data.frame(id = c("a", "b"), #not where I actually get the data from
amount1 = c(NA, 20),
currency1 = c(NA, "USD")
)
mwe <- readr::read_csv("mwe.csv", guess_max = 1) #guess_max() for example purposes
I'd like to be able do
mwe<- read_csv("mwe.csv", guess.max = 1
col_types = cols(starts_with("amount") = "d",
starts_with("currency") = "c"))
)
> mwe
# A tibble: 2 x 3
id amount currency
<chr> <dbl> <chr>
1 a NA NA
2 b 20 USD
But I get the error "unexpected '=' in: read_csv". Any thoughts? I cannot hard code it because the number of columns will change regularly, but the pattern (amountN) will be constant. There will also be other columns that are not id or amount/currency. I would prefer not to increase the guess.max() option for speed purposes.
The answer is to cheat!
mwe <- read_csv("mwe.csv", n_max = 0) # only need the col_names
cnames <- attr(mwe, "spec") # grab the col_names
ctype <- rep("?", ncol(mwe)) # create the col_parser abbr -- all guesses
currency <- grepl("currency", names(cnames$col)) # which ones are currency?
# or use base::startsWith(names(cnames$col), "currency")
ctype[currency] <- "c" # do not guess on currency ones, use character
# repeat lines 4 & 5 as needed
mwe <- read_csv("mwe.csv", col_types = paste(ctype, collapse = ""))
I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)
Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.
Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL