I'm trying to create a variable using an if-statement. I want to check whether variable "st" exists in the dataframes in the list of dataframes "dflist", and if it doesn't exist I want to create variable "st". I tried to do it like this(however, it doens't work):
#making list of dataframes, and reading them into r
mylist = list.files(pattern="*.dta")
dflist <- lapply(mylist, read.dta13)
# if "st" exists in every dataframe in dflist, return "yes", else if it doesn't exist in a particular dataframe, create variable "st" in those dataframes
if(exists(st, dflist)){
"yes"
} else{
st <- c("total")
dflist$st <- st
}
We can use lapply to loop over the list and create a column in the 'data.frame' if 'st' is not there.
dflist1 <- lapply(dflist, function(x) if(!exists("st", x))
transform(x, st = "total") else x)
data
dflist <- list(data.frame(v1 = 1:5), data.frame(st = 1:6))
Related
I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)
I have example data as follows:
# list of data frames:
l = list(a=mtcars, b=mtcars, c=mtcars)
I would like to replace the list names, if they exist in the vector list_names_available_for_name_change with new_list_names.
list_names_available_for_name_change <- c("a", "c")
new_list_names <- c("android", "circus")
I thought of doing something like:
names(l)[names(l) == "a"] <- "android"
But I would like to do this for the entire list. Something like:
names(l)[names(l) == list_names_available_for_name_change ] <- new_list_names
How should I write the syntax to achieve this?
Desired output:
# list of data frames:
l = list(android=mtcars, b=mtcars, circus=mtcars)
In base R, use match to find the matching positions of the 'names' of the list with the subsset of list names, use that to get the corresponding 'new_list_names' and do the assign on the names of the list
nm1 <- new_list_names[match(names(l), list_names_available_for_name_change)]
i1 <- !is.na(nm1)
names(l)[i1] <- nm1[i1]
-output
names(l)
[1] "android" "b" "circus"
Or with mapvalues
names(l) <- plyr::mapvalues(names(l),
list_names_available_for_name_change, new_list_names)
So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))
Is it a way I can get the data info from global environment into a summary table?
For example, I have a lot of data set named TXXX in my global environment, like
I would like to table that looks like this
Is it possible to also get all the variable list for each data using programing?
it will looks like this:
Any way I can do that by programming? Thanks.
We can use mget to get all the objects that starts with 'T' followed by 3 digit number in to a list , then loo over the list get the number of rows, 'Obs' and number of columns 'Variable'), rbind the list elements after creating the column 'Data' as the names of the list
lst1 <- lapply(mget(ls(pattern = "^T\\d{3}$")),
function(x) data.frame(Obs = nrow(x),
Variable = ncol(x)))
out <- do.call(rbind, Map(cbind, Data = names(lst1), lst1))
row.names(out) <- NULL
If we need the column names, we could use rowr to cbind the column names when the lengths are not the same
lst1 <- lapply(mget(ls(pattern = "^T\\d{3}$")), names)
library(versions)
available.versions('rowr') # // check for available version. Not in CRAN
install.versions('rowr', '1.1.2') # // install a version
library(rowr) # // load the package
do.call(cbind.fill, c(lst1, fill = NA))
Or without installing rowr
mx <- max(lengths(lst1))
do.call(cbind, lapply(lst1, `length<-`, mx))
Or using tidyverse
library(dplyr)
library(purrr)
mget(ls(pattern = '^T\\d{3}$')) %>%
map_dfr(~ tibble(Obs = nrow(.x), Variable = ncol(.x)), .id = 'Data')
I have dataframes and want to pass them as a parameter to process in function. Let say there are 4 dataframes and want to rename first columns to 'ROWNUM'.
df1 = data.frame(c(1:10),sample(1:100,10))
df2 = data.frame(c(1:10),sample(1:100,10))
df3 = data.frame(c(1:10),sample(1:100,10))
df4 = data.frame(c(1:10),sample(1:100,10))
function(df) colnames(df)[1] = 'ROWNUM'
My objective is I want to rename in one shot rather than passing one by one
Thanks.
We can use lapply after keeping the datasets in a list
nm1 <- ls(pattern="df\\d+")
lst <- lapply(mget(nm1), function(x) {
colnames(x)[1] <- 'ROWNUM'
x})
It is better to keep the datasets in a list, but if we need to update the original datasets
list2env(lst, envir=.GlobalEnv)
Or we use assign
for(j in seq_along(nm1)){
assign(nm1[j], `names<-`(get(nm1[j]),
c("ROWNUM", names(get(nm1[j]))[-1])))
}