Suppose I have a list object such like:
set.seed(123)
df <- data.frame(x = rnorm(5), y = rbinom(5,2,0.5))
rownames(df) <- LETTERS[1:5]
ls <- list(df1 = df, df2 = df, df3 = df)
My question is how to quickly check the row names are identical across the three elements (data frames) in the ls.
You can try
all(sapply(ls, rownames) == rownames(ls[[1]]))
To check only the name of the ith column, you can modify this to
all(sapply(ls, rownames)[i, ] == rownames(ls[[1]])[i])
You can get a list of row names with:
Map(rownames, ls)
so you can check that all the dataframes have the same rownames checking that there is only one unique value of row.names vector with:
length(unique(Map(rownames, ls))) == 1
Related
I'm fairly new to R and I was wondering if someone could help me?
I have a list of identical data frames (df1, df2, ..., df9) and I'm trying to rename one of the columns, 'value', in all the data frames to be 'value_dataframename'- the renamed column in all 9 data frames should be value_df1 in df1, value_df2 in df2, ..., value_df9 in df9.
Any help would be much appreciated!
Below code with example list (auto.list) that does what you want. Run it to check.
To use it for your list:
skip the code till the your.list <- ... line,
save your list as your.list object,
assign to term your "value".
auto.list <- list()
for (i in seq_len(10)) {
auto.list[[i]] <- data.frame("a" = 1:i, "value" = sample(letters, i))
names(auto.list)[i] <- paste0("df", i)
}
your.list <- auto.list # assign to your.list your own list
term <- "value" # assign your own "value"
for (i in seq_along(your.list)) {
colnames(your.list[[i]])[colnames(your.list[[i]]) == term] <- paste0(term, "_", names(your.list)[i])
}
Try this out:
## these two are my sample data frames for this example
df_1 <- data.frame(first = rbinom(10,size = 2,prob = 0.3), second = rnorm(10))
df_2 <- data.frame(first = rbinom(10,size = 2,prob = 0.3), second = rnorm(10))
# R stores data frames as list, so you can retrieve all your data frames thus:
all_df_names = ls.str(mode = "list")
# to check: all_df_names[1] - the first element - will give you "df_1", which is the name of the first data frame
# be careful though - 'ls.str(mode = "list")' will pick ALL the lists currently in your environment
# if you don't want to use this ls method, it might be wiser to manually create a variable 'all_df_names' and put all your data frame names there yourself.
# rename
for(i in 1:length(all_df_names)) {
# get the actual content via its variable name, and store it in a temporary variable 'x'
x = get(all_df_names[i])
# rename the column you want
names(x)[2] = paste0(names(x)[2], "_", i) # this will replace the column with the previous name plus a '_' and the current iteration
# resave that dataframe, with the new content
assign(all_df_names[i], x)
}
# to remove variables we no longer need when done:
# rm(x, i)
# confirm
# names(df_1) = "first" "second_1"
# names(df_2) = "first" "second_2"
I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})
I already have a list of data frames (mylist) and need to switch the first and second column for all the data frames in the list.
Test Data Frame in List
[reads] [phylum]
1 phylum1
2 phylum2
3 phylum3
Into....
[phylum] [reads]
phylum1 1
phylum2 2
phylum3 3
I know I need to use lapply, but not sure what to input for the FUN=
mylist <- lapply(mylist, FUN = mylist[ ,c("phylum", "reads")])
errors saying incorrect number of dimensions
Sorry if this is a simple question and thanks in advance for your help!
-Brand new R user
The FUN asks for a function that it can apply to every element in the list. You are passing mylist[ ,c("phylum", "reads")]) which is not a function.
# sample data
df1 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df2 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df3 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df4 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
ldf <- list(df1,df2,df3,df4)
ldf_re <- lapply(ldf, FUN = function(X){X[c('phylum', 'reads')]})
In the last line, the lapply will iterate through all the dataframes, they will be passed as the X argument for the function defined in the FUN argument and the columns will be dataframes will be stored in the list ldf_re with their columns rearranged.
Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.
I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.