R Changing format to columns of dataframes using functional programming - r

The situation is the following: I have a list of dataframes, and for each dataframe I have a list of columns whose format I need to change. Setup:
df1 <- data.frame(a = c("2020-03-02", "2020-12-22", "2020-07-03"), b = c(4, 5, 6), c = c("2020-03-13", "2019-11-03", "2011-05-02"))
df2 <- data.frame(d = c(1, 2, 3), e = c("2020-05-21", "2014-08-31", "1999-01-21"), f = c(7, 8, 9))
datasets <- list("first" = df1, "second" = df2)
dates <- list("first" = c("a", "c"), "second" = c("e"))
One could do this by 1. Looping over the list of dataframes, 2. for each dataframe, looping over the list of columns one wants to change, and reassign them in place. Something like this:
for (i in names(datasets)) {
for (j in dates[i]) {
for (k in datasets[[i]][j]) {
k <- as.Date(k)
}
}
}
This is ugly, so I wanted to try to do the same using purrr. I thought this would be a good idea:
library(purrr)
walk2(datasets, dates, ~ walk(.x[.y], ~ {.x <- as.Date(.x)}))
But the datasets remain unperturbed after this operation. Why?

Here is a solution that uses purrr and dplyr:
library(purrr)
library(dplyr)
datasets <- datasets %>%
imap(~{
.x %>%
mutate_at(vars(dates[[.y]]), as.Date)
})
str(datasets)
#List of 2
#$ first :'data.frame': 3 obs. of 3 variables:
# ..$ a: Date[1:3], format: "2020-03-02" "2020-12-22" "2020-07-03"
# ..$ b: num [1:3] 4 5 6
# ..$ c: Date[1:3], format: "2020-03-13" "2019-11-03" "2011-05-02"
#$ second:'data.frame': 3 obs. of 3 variables:
# ..$ d: num [1:3] 1 2 3
# ..$ e: Date[1:3], format: "2020-05-21" "2014-08-31" "1999-01-21"
# ..$ f: num [1:3] 7 8 9

Related

Adding a list element for an ID based on a specific number

I have a list of elements that hold values in them. I would to write an if statement where if their isn't a specific number of elements for a specific ID (e.g., A, B, C) then add the appropriate number of elements, and assign an NA as the value in the element. I would like the output to look something like expected_ID. Is there an efficient way of doing this?
library(lubridate)
library(tidyverse)
library(purrr)
date <- rep_len(seq(dmy("01-01-2011"), dmy("31-07-2011"), by = "days"), 200)
ID <- rep(c("A","B", "C"), 200)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$Month <- month(df$date)
int1 <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(Month == "1") %>%
group_split()
names(int1) <- sapply(int1, function(x) paste(x$ID[1]))
int1 <- int1[-c(6, 8, 9)]
expected_ID <- list(int1[[1]], int1[[2]], int1[[3]], int1[[4]], int1[[5]], NA, int1[[6]], NA, NA)
names(expected_ID) <- c(rep("A", 3), rep("B", 3), rep("C", 3))
It's not usually desirable to create lists with repeated names, and it would be better to store these data in a hierarchical structure. This is necessary to achieve your intended output, but after having done that, we can get the data back to the format you've specified. Comments are in the code block below.
# split the list into a list of nested lists
lst <- split(int1, names(int1))
# fill each inner list to the desired length
# the use of pmax() ensures that rep() will not be sent an invalid negative value
# '3' here is your desired list length
filled_lst <- lapply(lst, \(x) list(x, rep(list(NA), pmax(0, 3 - length(x)))))
# convert to desired flattened output format
flat_lst <- unlist(unlist(filled_lst, recursive = F), recursive = F)
names(flat_lst) <- sub('(.).*', '\\1', names(flat_lst))
For posterity, here is my original answer, which worked on the example in which ID was a list of vectors.
# split the list into a list of nested lists
lst <- split(ID, names(ID))
str(lst)
List of 3
$ A:List of 3
..$ A: int [1:3] 1 2 3
..$ A: int [1:3] 4 5 6
..$ A: int [1:3] 7 8 9
$ B:List of 2
..$ B: int [1:2] 1 2
..$ B: int [1:2] 3 4
$ C:List of 1
..$ C: int [1:3] 1 2 3
# fill each nested list to the desired length
# the use of pmax() ensures that rep() will not be sent an invalid negative value
# '3' here is your desired list length
filled_lst <- lapply(lst, \(x) c(x, rep(NA, pmax(0, 3 - length(x)))))
str(filled_lst)
List of 3
$ A:List of 3
..$ A: int [1:3] 1 2 3
..$ A: int [1:3] 4 5 6
..$ A: int [1:3] 7 8 9
$ B:List of 3
..$ B: int [1:2] 1 2
..$ B: int [1:2] 3 4
..$ : logi NA
$ C:List of 3
..$ C: int [1:3] 1 2 3
..$ : logi NA
..$ : logi NA
# convert to desired flattened output format
flat_lst <- unlist(filled_lst, recursive = F)
names(flat_lst) <- gsub('\\d|.\\.+', '', names(flat_lst))
str(flat_lst)
List of 9
$ A: int [1:3] 1 2 3
$ A: int [1:3] 4 5 6
$ A: int [1:3] 7 8 9
$ B: int [1:2] 1 2
$ B: int [1:2] 3 4
$ B: logi NA
$ C: int [1:3] 1 2 3
$ C: logi NA
$ C: logi NA

Creating an empty dataframe in R with column names stored in two separate lists

I have two separate lists containing column names of a new dataframe df to be created.
fixed <- list("a", "b")
variable <- list("a1", "b1", "c1")
How do I proceed so as to make the column names of df appear in the order aba1b1c1
Probabaly, unlist both lists, concatenate and subset the data
df[unlist(c(fixed, variable))]
If there are additional elements in the list that are not as column names in 'df', use intersect
df[intersect(unlist(c(fixed, variable)), names(df))]
a a1 c1
1 7 8 1
2 3 1 5
3 8 5 4
4 7 5 6
5 2 5 6
If it is a null data.frame, we could do
v1 <- unlist(c(fixed, variable))
df <- as.data.frame(matrix(numeric(), nrow = 0,
ncol = length(v1), dimnames = list(NULL, v1)))
str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
Or another option is
df <- data.frame(setNames(rep(list(0), length(v1)), v1))[0,]
> str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
data
v1 <- c('a', 'd2', 'c', 'a1', 'd1', 'c1', 'e1')
set.seed(24)
df <- as.data.frame(matrix(sample(1:9, 5 * length(v1),
replace = TRUE), ncol = length(v1), dimnames = list(NULL, v1)))

Convert columns in a list from character to numeric

I have a list with two data sets and I would like to convert each of the columns from character to numeric.
[[1]]
b m
2 12194.0968074593 703.359790781974
[[2]]
b m
2 49.2080763267713 30.9186232579308
> str(tidy_linear_regression)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ b: chr "12194.0968074593"
..$ m: chr "703.359790781974"
$ :'data.frame': 1 obs. of 2 variables:
..$ b: chr "49.2080763267713"
..$ m: chr "30.9186232579308"
I cannot come up with a code where I end up with a list.
I tried the following code and the result is always a data.frame:
tidy_linear_regression_new <-
lapply(tidy_linear_regression,
function(x) as.numeric(as.character(x)))
tidy_linear_regression_new<-
sapply(tidy_linear_regression,
as.character)
As you have multiple columns in the dataframe you need lapply inside a lapply -
tidy_linear_regression <- lapply(tidy_linear_regression, function(x) {
x[] <- lapply(x, as.numeric)
x
})
We may use tidyverse
library(purrr)
library(dplyr)
tidy_linear_regresion <- map(tidy_linear_regresion, ~ .x %>%
mutate(across(everything(), as.numeric)))

Force dplyr not to drop attributes - is it possible?

Consider the simple example:
library(dplyr)
dat <- data.frame( a = 1, b = 2 )
attr(dat, "myattr") <- "xyz"
dat %>% mutate(c = 3) %>% str()
## 'data.frame': 1 obs. of 3 variables:
## $ a: num 1
## $ b: num 2
## $ c: num 3
So dplyr drops the attribute. Is it possible to force it not to drop it?
More general: is it possible to force R not to drop attributes when changing object class?

Padding Zeros to one column in all data frames in a list

I have a list of four data frames. Each data frame has the same first column person.id (unique key to each data frame) I want to pad zeros.
ISSUE:
The code runs but outputs to the Console and doesn't change the actual data frames in the list.
EXAMPLE DATA:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
lapply(dataList, function(i){
i$person.id <- str_pad(i$person.id, 6, pad = "0")
})
# Console output pads the zeros (not expected):
[[1]]
[1] "003200" "003201" "003202" "003203" "003204" "003205" "003206" "003207" "003208"
[10] "003209" "003210" "003211" "003212" "003213" "003214"
# Data Frames in list return with no change:
> dataList[[1]]$person.id
[1] 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214
How do I apply the change to every column names person.id in every data frame in my list?
What I want is padded zeros in every data frame in my list:
> dataList[[1]]$person.id
[1] 003200 003201 003202 003203 003204 003205 003206 003207 003208
[10] 003209 003210 003211 003212 003213 003214
The function you lapply needs to return the full data frame. The function you used just returns the result of the assignment, which is only the values for the column, not the entire data frame. You also need to save the result. Here we use transform as the function as it modifies a data frame, and use the person.id argument to modify the person.id column (see ?transform):
df.pad <- lapply(dataList, transform, person.id=str_pad(person.id, 6, pad = "0"))
Then, df.pad[[1]]: produces:
[[1]]
person.id letter
1 003200 a
2 003201 b
3 003202 c
4 003203 d
5 003204 e
6 003205 f
7 003206 g
8 003207 h
9 003208 i
10 003209 j
11 003210 k
12 003211 l
13 003212 m
14 003213 n
15 003214 o
You need to return the data frame because R is not an assign-by-reference language. Your assignments to i in lapply just modify the local copy of i, not the data frames in dataList in the global environment. If you want dataList to be modified you can substitute dataList for df.pad in the above expression, which will result in dataList being overwritten with a new version of it containing the modified data frames.
You made the assignment to a column but a) did not return the dataframes, nor b) did you assign the results to a new name. (Welcome to functional programming. Running a function on an object does not change the original object.) All you got were the names:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
library(stringr)
newList <- lapply(dataList, function(i){
i$person.id <- str_pad(i$person.id, 6, pad = "0"); return(i)
})
> str(newList)
List of 4
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "003200" "003201" "003202" "003203" ...
..$ letter : Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "004100" "004101" "004102" "004103" ...
..$ letter : Factor w/ 15 levels "h","i","j","k",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "004300" "004301" "004302" "004303" ...
..$ letter : Factor w/ 15 levels "j","k","l","m",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "005500" "005501" "005502" "005503" ...
..$ letter : Factor w/ 15 levels "e","f","g","h",..: 1 2 3 4 5 6 7 8 9 10 ...
The pad function in the package qdapTools can do this:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
library(qdapTools)
lapply(dataList, function(x) {x[["person.id"]] <- pad(x[["person.id"]], 6);x})

Resources