I would like to delete only some data frames from the environment. Since I have more than 400 data frames there, I would like to select based on number of columns.
I tried the code below that does not work, however.
# Select data tables
dfs <- ls()
dfs <- dfs[grepl(".csv", dfs)]
myfilenames <- list.files(inputdir, pattern = ".csv$")
storage <- data.frame(matrix(nrow=length(myfilenames), ncol=1))
# Filter for 25 columns
for (i in seq_along(dfs)){
dat <- get(dfs[i])
if(ncol(dat)!=25){
storage[i,1] <- dfs[i]
}
}
# Delete
todelete <- storage[,1]
rm(todelete)
Expected result: only data frames with 25 columns remain in Global environment
Actual result: none is removed
Thank you very much for your help!
Related
I have a list of data frames "181", and i want to extract the 2nd column and save it in a csv file and label it, the labels for those 181 dfs are 0,1,2,3,4,5,6.
The problem is i have different length for each df, and i don't know if that's applicable in R!
This is an inefficient but easily coded solution (and efficiency doesn't matter when all you need to do is output a short CSV file). It writes each data frame one line at a time, assuming the data frames are represented by a list l.df.
#
# Prepare for output and input.
#
fn <- "temp.csv"
if(is.null(names(l.df))) names(l.df) <- 1:length(l.df)
#
# Loop over the data frames.
#
append <- FALSE
for (s in names(l.df)) {
#
# Create a one-row data frame for this column.
#
X <- data.frame(ID=s, as.list(l.df[[s]][[2]]))
#
# Append it to the output.
#
write.table(X, file=fn, sep=",", row.names=FALSE, col.names=FALSE, append=append)
append <- TRUE
}
For example, we may prepare a set of data frames with random entries:
set.seed(17)
l.df <- lapply(1+rpois(181, 5), function(n) data.frame(X=1:n, Y=round(rnorm(n),2)))
The output file looks like this:
"1",0.37,1.61,0.02,0.51
"2",1.07,0.13,-0.55,0.34,2.24,0.41,0.26,0.13,-0.48,0.07,0.54
... (177 lines omitted)
"180",0.58,-1.5,1.85,-1.02
"181",-0.59,0.12,-0.38,-0.35,1.22,-0.63,0.81
There are many ways of solving your issues, I'll just propose the simplest one with base R, looping 🙌 (otherwise work with tidyverse).
The Issue of differing df lengths (in terms of rows) can be solved by adding NAs at the end.
I assume this is your setup:
# Your list of data frames
yourlistofdataframes <- list()
for (i in 1:182) { # in R list indices run from 1 to 181 (in Python from 0 onwards)
nrowofdf <- sample(1:100,1) # random number of rows between 1 and 100
yourlistofdataframes[[i]] <- data.frame(cbind(rep(paste0("df",i,"|column1"),nrowofdf),
rep(paste0("df",i,"|column2"),nrowofdf),
rep(paste0("df",i,"|column3"),nrowofdf)))
}
names(yourlistofdataframes) <- 0:181 # labeling the data frames
Then this is your solution:
newlist <- list()
for (i in 1:length(yourlistofdataframes)){
newlist[[i]] <- unlist(yourlistofdataframes[[i]][2])
}
names(newlist) <- 0:181 # give them the names you wanted
newlist <- lapply(newlist, `length<-`, max(lengths(newlist))) # add NA's to make them equal length
# bind back to data.frame & save as csv
newdf <- data.frame(newlist) # if you want to have the data in 181 columns in your final df
newdft <- t(newdf) # if you want to have the data in 181 rows in your final df
write.csv(newdf, "mycsv.csv")
Feedback on your question:
Also, if you want to ask for coding advice, post some representation of your data, so that people don't have to assume how your data looks like / build their own.
I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.
The data could look something like this.
So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:
[Key 1] [Key 2]
XXX1 YYY3
XXX2 N/A
XXX3 N/A
I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):
=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))
And this is the approach made in R:
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
df_1$Key1[i] = df_2$Key2[j]
} else (df_1$Key1[i] = N/A)
}}
The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?
Thanks!
Create dummy columns in both the data frames such as(I can show you for df1) :
for(i in 1:nrow(df1)){
df1$key1 <- paste0("X_",i)
}
Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount.
Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.
could the following code work for you?
# create random data
set.seed(123)
df1 <- data.frame(
key_1=as.factor(paste("xxx",1:100,sep="_")),
age = sample(1:100,100,replace=TRUE),
amount = sample(1:200,100))
df2 <- data.frame(
key_1=paste("yyy",1:500,sep="_"),
age = sample(1:100,500,replace=TRUE),
amount_1 = sample(1:200,500,replace=TRUE),
amount_2 = sample(1:200,500,replace=TRUE),
amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3] <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
ageComp <- df2$age == as.numeric(x[2])
if(!any(ageComp)){
return(NaN)
}
amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
if(!any(amountComp)){
return(NaN)
}
matchIdx <- ageComp & amountComp
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])
didn't had time to test it on really big data, but this should be faster than your two for loops I guess....
To further speed up things you could delete the
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
lines if you are not worried about a line matches several others.
I've got multiple data.frames (26) in a list. The dfs have the same structure, but I would like working/exporting only two different columns. I can export all the dfs to individual dfs
for(i in filelist){
list2env(setNames(filelist, paste0("names(filelist[[i]])",
seq_along(filelist))), envir = parent.frame())}
I can delete a column from all the dfs
for(i in seq_along(filelist)){filelist[[i]]$V5 = NULL}
but I cannot export the other columns individually. From a single data.frame it simply works:
token_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V2"]
type_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V1"]
When I tried these
for(i in seq_along(filelist)){n[[i]] <- filelist[[i]]$V2}
for(i in seq_along(filelist)){
sapply(filelist, function(x) n <- filelist[[i]]$V2)
}
the most I achieved, that I could read in all the 26 dfs the second column of the last df.
The V2 looks like:
V2
1 az
2 a
3 f
ekete
4 folt
(and so on, these are hungarian short stories... )
Depending on your desired results, you have several options.
If you want a new list, with your data frames containing only one specific column.
new_filelist <-
lapply(filelist, function(df){
df["V2"]
})
If you want to export to a file one specific column for all data frames, in separate files (in this case, .txt files).
This requires your data frames in your list to be named. In case they are not, you can replace names(filelist) for 1:length(filelist).
lapply(names(filelist), function(df){
df_filename <- paste0(df, ".txt")
write.table(filelist[[df]]["V2"], df_filename)
})
If you wan to assign to new objects in your enviroment one specific column for all your data frames.
Again, this requires your data frames to be named.
lapply(names(filelist), function(df){
assign(df, filelist[[df]]["V2"], envir = .GlobalEnv)
})
I am trying to read over 200 CSV files, each with multiple rows and columns of numbers. It makes most sense to reach each one as a separate data frame.
Ideally, I'd like to give meaningful names. So the data frame of store 1, room 1 would be named store.1.room.1, and store.1.room.2. This would go all the way up to store.100.room.1, store.100.room.2 etc.
I can read each file into a specified data frame. For example:
store.1.room.1 <- read.csv(filepath,...)
But how do I create a dynamically created data frame name using a For loop?
For example:
for (i in 1:100){
for (j in 1:2){
store.i.room.j <- read.csv(filepath...)
}
}
Alternatively, is there another approach that I should consider instead of having each csv file as a separate data frame?
Thanks
You can create your dataframes using read.csv as you have above, but store them into a list. Then give names to each item (i.e. dataframe) in the list:
# initialize an empty list
my_list <- list()
for (i in 1:100) {
for (j in 1:2) {
df <- read.csv(filename...)
df_name <- paste("store", i, "room", j, sep="")
my_list[[df_name]] <- df
}
}
# now you can access any data frame you wish by using my_list$store.i.room.j
I'm not sure whether I am answering your question, but you would never want to store those CSV files into separate data frames. What I would do in your case is this:
set <- data.frame()
for (i in 1:100){
##calculate filename here
current.csv <- read.csv(filename)
current.csv <- cbind(current.csv, index = i)
set <- rbind(set, current.csv)
An additional column is being used to identify which csv files the measurements are from.
EDIT:
This is useful to apply tapply in certain vectors of your data.frame. Also, in case you'd like to keep the measurements of only one csv (let's say the one indexed by 5), you can enter
single.data.frame <- set[set$index == 5, ]
I have a list object that contains nested lists, each includes a data frame. The code below simulates my data structure:
## simulate my data structure -- list of data frames
mylist <- list()
for (i in 1:5) {
tmp <- list(data = data.frame(x=sample(1:5, replace=T), y=sample(6:10, replace=T)))
mylist <- c(mylist, tmp)
}
I am looking to row bind all of my dataframes in order to create one master data frame. Currently I use a for loop to complete this action:
## goal: better way to combine row bind data frames
## I like rbind.fill because sometimes my data are not as clean as desired
library(plyr)
df <- data.frame(stringsAsFactors=F)
for (i in 1:length(mylist)) {
tmp <- mylist[i]$data
df <- rbind.fill(df, tmp)
}
In reality, my master list is quite large - length of 3700, not 5 - so my for loop is quite slow.
Is there a faster way to complete the same task?
ldply(mylist, data.frame)
# if you dont need the id column,
ldply(mylist, data.frame)[,-1]
# If you want a progress bar for the larger operation, add .progress
ldply(mylist, data.frame, .progress = 'text')
# See ?create_progress_bar for more options.