I need to write a function that creates objects similar to tibble::tibble(). The function should take a parameter list() = x that contains the columns (dont need to check for type, length of the columns,.. ).
I can’t use tibble::tibble() or similar functions like data.frame(), data.table::data.table().
I currently am stuck since I can’t figure out how to bind two columns of different types.
I tried collecting the types of the rows to display it in a row and then put the columns together somehow but I still havent figuered out how.
My code looks like the following:
library(tibble) #implemented but isnt used
my_tibble <- function(x) {
ncol <- as.integer(length(x))
nrow <- length(x[[1]])
y <- list()
i <- 1
while (i <= length(x)) {
y <- append(y, typeof(x[[i]]), length(y))
i = i + 1
}
}
And example output should look like this:
my_tb <- my_tibble(list(x=1:3, y=letters[1:3]))
my_tb
A tibble: 3 x 2
x y
<int> <chr>
1 1a
2 2b
3 3c
A tibble and a data.frame are essentially just lists with at least one additionaly attribute (row.names") and a special printing method. So here is a very simple implementation of a tibble:
my_tbl <- function(...) {
out <- list(...)
class(out) <- c("my_tbl", "data.frame")
attr(out, "row.names") <- seq_along(out[[1]])
return(out)
}
print.my_tbl <- function(x) {
cat("my tibble: ", nrow(x), " x ", ncol(x), "\n")
cat(colnames(x), "\n")
cat(paste0("<", sapply(x, class), ">"), "\n")
for (r in seq_along(x[[1]])) {
cat(unlist(x[1,]), "\n")
}
}
So now you can use the new function:
my_tbl(x=1:3, y=letters[1:3])
#> my tibble: 3 x 2
#> x y
#> <integer> <character>
#> 1 a
#> 1 a
#> 1 a
This also works with more complicated input:
my_tbl(list(x=1:3, y=letters[1:3]))
#> my tibble: 2 x 1
#>
#> <list>
#> 1 2 3
#> 1 2 3
Created on 2022-11-03 with reprex v2.0.2
Related
I am changing the values of a column of a data frame. Then, I am saving the file, supposedly with the changes, but not. What am I missing? Thanks,
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
that was oversimplified, the real deal is this one:
date_format_1 = "[0-9]-[:alpha:][:alpha:][:alpha:]"
date_format_2 = "[:alpha:][:alpha:][:alpha:]-[0-9][0-9]"
test <- data.frame(name_s = c("v","w","x","y","z"), event_text = c("Aug-89","7-May","9-Jun","4-Dec-2021","Feb-99"))
lapply(1:length(test$event_text), function(x) {
if (str_detect(test$event_text[[x]], paste0("\\b",date_format_1,"\\b")) == T){
test$event_text[x] <- paste0(str_sub(test$event_text[[x]],1,1), "/F",
which(month.abb %in% str_sub(
test$event_text[[x]], 3,5)))
} else if(str_detect(test$event_text[[x]], paste0("\\b", date_format_2,"\\b"))
== T) {
test$event_text[[x]] = paste0(which(month.abb %in% str_sub(
test$event_text[x],1,3)),"/F",str_sub(test$event_text[[x]],-2))
} else {
test$event_text[x] <- test$event_text[[x]]
}
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
Below I have written two calls to lapply that fix the issue you were having. The problem stems from the fact that R has scoped variables and so the value is changed within the function but the result is never returned or extracted from the function. As such I have demonstrated this by printing the dataframe after each of the lapply() calls below.
We can fix this in two ways. The first more correct version is to let lapply modify the exact vector directly by adding one to each value and returning x+1. (Note I have skipped curly braces and this will return the value from the next ppiece of code run, in this case x+1 alternatively you could write function(x) {return(x+1)} in that argument).
An alternate approach that will run slower but still use the indexing method is to use global assignment. <<- assigns the variable to the global scope/environment rather than the local scope of the function. (Note this code is run sequentially so the written call to this function is adding + 1 for the second time to the dataframe when shown below).
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
# Original Behaviour, doesn't work due to scoping issues
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
#>
#> [[3]]
#> [1] 4
print(test)
#> name_s number_s
#> 1 x 1
#> 2 y 2
#> 3 z 3
# function that is syntactically and functionally correct
# instead of modifying the vector in the function scope the function returns the
# mutated vector which we then assign to the dataframe's vector
test$number_s <- lapply(test$number_s, function(x) x + 1)
print(test)
#> name_s number_s
#> 1 x 2
#> 2 y 3
#> 3 z 4
# function that is syntactically odd but functionally correct
# the function affects the values in the global scope, this works but is slower
# and is not best practice as it would be difficult to read
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <<- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 5
print(test)
#> name_s number_s
#> 1 x 3
#> 2 y 4
#> 3 z 5
Created on 2021-07-23 by the reprex package (v2.0.0)
I have several lists with the same structure (same list elements, different values), and I want to bind the lists using dplyr::bind_rows(). I encounter an error because one of the lists elements is a function.
Example
lst_a <- list(what_it_does = "adds 1", func = function (x) x + 1)
lst_b <- list(what_it_does = "subtracts 2", func = function(x) x - 2)
lst_c <- list(what_it_does = "takes a sqrt", func = function (x) sqrt(x))
lst_d <- list(what_it_does = "multiplies by 5", func = function (x) x * 5)
Desired output
Combine the lists to one table (tibble)
# A tibble: 4 x 3
name what_it_does func
<chr> <chr> <list>
lst_a adds 1 <fn>
lst_b subtracts 2 <fn>
lst_c takes a sqrt <fn>
lst_d multiplies by 5 <fn>
My attempt
Initially I tried:
library(dplyr)
bind_rows(lst_a, lst_b, lst_c, lst_d)
Error: Input must be a vector, not a function.
Then I thought maybe I should be wrapping the func element in each list inside its own list(), which works:
lst_a_wrapped <- list(what_it_does = "adds 1", func = list(function (x) x + 1))
lst_b_wrapped <- list(what_it_does = "subtracts 2", func = list(function(x) x - 2))
lst_c_wrapped <- list(what_it_does = "takes a sqrt", func = list(function (x) sqrt(x)))
lst_d_wrapped <- list(what_it_does = "multiplies by 5", func = list(function (x) x * 5))
bind_rows(lst_a_wrapped, lst_b_wrapped, lst_c_wrapped, lst_d_wrapped)
## # A tibble: 4 x 2
## what_it_does func
## <chr> <list>
## 1 adds 1 <fn>
## 2 subtracts 2 <fn>
## 3 takes a sqrt <fn>
## 4 multiplies by 5 <fn>
My problem
Consider that I may have many lists to be tabulated, so manually wrapping each func element in each list with an inner list() is inelegant. Is there a way to go directly (meaning, programmatically) from {lst_a, lst_b, lst_c, lst_d} to the desired output? As far as I could find, bind_rows() doesn't have extra arguments to tweak the input, so the solution has to be other.
Bonus. Although it's not too important, I'd be happy to also learn whether there's a way to achieve a name column as in the desired output, derived from the lists object names, as part of the process.
You can add the list around the function in a loop like:
x <- mget(c("lst_a", "lst_b", "lst_c", "lst_d"))
y <- lapply(x, function(x) lapply(x, function(y) if(is.function(y)) list(y) else y))
dplyr::bind_rows(y)
## A tibble: 4 x 2
# what_it_does func
# <chr> <list>
#1 adds 1 <fn>
#2 subtracts 2 <fn>
#3 takes a sqrt <fn>
#4 multiplies by 5 <fn>
If you have many lists with common prefix such as "lst_", you may want to consider using ls function.
all_lsts <- mget(ls(pattern = "^lst_"))
all_lsts_fixed <- lapply(all_lsts, function(x) lapply(x, function(y) if(is.function(y)) list(y) else y))
dplyr::bind_rows(all_lsts_fixed)
# # A tibble: 4 x 2
# what_it_does func
# <chr> <list>
# 1 adds 1 <fn>
# 2 subtracts 2 <fn>
# 3 takes a sqrt <fn>
# 4 multiplies by 5 <fn>
I am creating a new vctrs S3 class using new_list_of() but I can't find a way to control the printing of this class when used in tibbles. Here is a minimal example using a toy 'fruit_bowl' class. Ideally I'd like the columns to show the output of obj_print_data.fruit_bowl(), i.e. "2 types of fruit" and "1 type of fruit", but it seems the only thing that will print is the length of the vector.
library(vctrs)
library(tibble)
#>
#> Attaching package: 'tibble'
#> The following object is masked from 'package:vctrs':
#>
#> data_frame
# Fruit bowl constructor function that uses `new_list_of`
fruit_bowl <- function(...) {
x <- list(...)
x <- lapply(x, vec_cast, character())
new_list_of(x, ptype = character(), class = "fruit_bowl")
}
# Set the ptypes for nice printing
vec_ptype_full.fruit_bowl <- function(x, ...) "fruit_bowl"
vec_ptype_abbr.fruit_bowl <- function(x, ...) "frt_bwl"
# Formatting for fruit bowls
format.fruit_bowl <- function(x, ...) {
format_fruits <- function(x) {
n <- length(unique(x))
sprintf("%d type%s of fruit", n, if (n == 1) "" else "s")
}
vapply(x, format_fruits, character(1))
}
# Printing for fruit bowls - use the 'format' function
obj_print_data.fruit_bowl <- function(x, ...) {
if (length(x) == 0) {
return()
}
print(format(x))
}
# Printing works nicely in isolation
fruit_bowl(c("banana", "apple"), "pear")
#> <fruit_bowl[2]>
#> [1] "2 types of fruit" "1 type of fruit"
# ...But not within tibbles
tibble(fruit_bowls = fruit_bowl(c("banana", "apple"), "pear"))
#> # A tibble: 2 x 1
#> fruit_bowls
#> <frt_bwl>
#> 1 [2]
#> 2 [1]
Created on 2021-04-10 by the reprex package (v0.3.0)
I jumped the gun and posted this question without fully reading the documentation. Here is the correct approach as outlined here
pillar_shaft.fruit_bowl <- function(x, ...) {
pillar::new_pillar_shaft_simple(format(x))
}
tibble(fruit_bowls = fruit_bowl(c("banana", "apple"), "pear"))
#> # A tibble: 2 x 1
#> fruit_bowls
#> <frt_bwl>
#> 1 2 types of fruit
#> 2 1 type of fruit
I have 25 data sets each is structured the same. Each contains many rows and 7 columns. Column 6 contains data that should be numerical but is not numerical. They are not numerical because the numbers contain commas i.e. 100000 is 100,000.
I can manually resolve this in each data set by removing the comma and then specifying that the data is numerical using the following code
df$column_6 <- gsub("[,]" , "", df$column_6)
df$column_6 <- as.numerical(df$column_6)
However as there are 25 data sets I would like to loop through them doing this however I am unable to do this.
Additionally because column 6 has a different name in each data set I would prefer to specify column 6 without using its name like below
df[6] <- gsub("[,]" , "", df[6])
however this doesn't seem to work.
My code is as follows
list_of_dfs = c(df1, df2, ..... , df25)
for (i in list_of_dfs) {
i[6] <- gsub("[,]" , "", i[6])
i[6] <- as.numerical(i[6])
}
Does anyone have any advice on how to do this
Your code is close, but has a few problems:
the result never gets assigned back the the list.
as.numerical is a typo, it needs to be as.numeric
i[6] doesn't work because you need to specify that it's the 6th column you want: i[, 6]. See here for details on [ vs [[.
c(df1, df2) doesn't actually create a list of data frames
Try this instead:
## this is bad, it will make a single list of columns, not of data frames
# list_of_dfs = c(df1, df2, ..... , df25)
# use this instead
list_of_dfs = list(df1, df2, ..... , df25)
# or this
list_of_dfs = mget(ls(pattern = "df"))
for (i in seq_along(list_of_dfs)) {
list_of_dfs[[i]][, 6] <- as.numeric(gsub("[,]" , "", list_of_dfs[[i]][, 6]))
}
We can do a bit better, gsub uses pattern-matching regular expressions by default, using the fixed = TRUE argument instead will be quite a bit faster:
for (i in seq_along(list_of_dfs)) {
list_of_dfs[[i]][, 6] <- as.numeric(gsub(",", "", list_of_dfs[[i]][, 6], fixed = TRUE))
}
And we could use lapply instead of a for loop for slightly shorter code:
list_of_dfs[[i]] <- lapply(list_of_dfs, function(x) {
x[, 6] = as.numeric(gsub("," , "", x[, 6], fixed = TRUE))
return(x)
})
Try this out. You put all dataframes in a list, then you make the column numeric. Instead of gsub I use readr::parse_number. I'll also include a practice set for illustration.
library(tidyverse)
df1 <- data_frame(id = rep(1,3), num = c("10,000", "11,000", "12,000"))
df2 <- data_frame(id = rep(2,3), num = c("13,000", "14,000", "15,000"))
df3 <- data_frame(id = rep(3,3), num = c("16,000", "17,000", "18,000"))
list(df1, df2, df3) %>% map(~mutate(.x, num = parse_number(num)))
#> [[1]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 1 10000
#> 2 1 11000
#> 3 1 12000
#>
#> [[2]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 2 13000
#> 2 2 14000
#> 3 2 15000
#>
#> [[3]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 3 16000
#> 2 3 17000
#> 3 3 18000
Created on 2018-09-20 by the reprex
package (v0.2.0).
Part of the answer has been sourced from here: Looping through list of data frames in R
In your case, you can do the following:
list_of_dfs = list(df1, df2, ..... , df25)
lapply(list_of_dfs, function(x) { x[, 6] <- as.integer(gsub("," , "", x[, 6])) })
The data table way
test<-data.table(col1=c('100,00','100','100,000'),col2=c('90','80,00','60'))
col1 col2
100,00 90
100 80,00
100,000 60
your list of data frames
testList<-list(test,test)
assume u want to correct col2 in this case but want to use index as reference
removeNonnumeric<-function(x){return(as.numeric(gsub(',','',x)))}
data<-function(x){return(x[,lapply(.SD,removeNonnumeric),.SDcols=names(x)[2],by=col1])}
removeNonnumeirc removes the "," from the columns and data accesses each data table in the testList and calls "removeNonnumeric" on them output is a list of data tables which is created by merging these 2 functions in an "lapply"
lapply(testList,data)
I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5