I am cleaning several excel files in R. They unfortunately are of unequal dimensions, rows and columns. Currently I am storing each excel sheet as a data frame in a list. I know how to print the 4th row of the first data frame in a list by issuing this command:
df.list1[[1]][4,]
Or a range of rows like this:
df.list1[[1]][1:10,]
My question is: How do I print a particular row for every data frame in the list? In other words:
df.list1[[i]][4,]
df.list1 has 30 data frames in it, but my other df.lists have over 140 data frames that I am looking to extract their rows. I'd like to be able to store particular locations across several data frames into a new list. I'm thinking the solution might involve lapply.
Furthermore, is there a way to extract rows in every data frame in a list based on a condition? For example, for all 30 data frames in the list df.list1, extract the row if the value is equal to "Apartment" or some other string of characters.
Appreciate your help, please let me know if I can help clarify my problem.
You could also just directly lapply the extraction function #Justin suggests, e.g.:
# example data of a list containing 10 data frames:
test <- replicate(10,data.frame(a=1:10),simplify=FALSE)
# extract the fourth row of each one - setting drop=FALSE means you get a
# data frame returned even if only one vector/column needs to be returned.
lapply(test,"[",4,,drop=FALSE)
The format is:
lapply(listname,"[",rows.to.return,cols.to.return,drop=FALSE)
# the example returns the fourth row only from each data frame
#[[1]]
# a
#4 4
#
#[[2]]
# a
#4 4
# etc...
To generalise this when you are completing an extraction based on a condition, you would have to change it up a little to something like the below example extracting all rows where a in each data.frame is >4. In this case, using an anonymous function is probably the clearest method, e.g.:
lapply(test, function(x) with(x,x[a>4,,drop=FALSE]) )
#[[1]]
# a
#5 5
#6 6
#7 7
#8 8
#9 9
#10 10
# etc...
There is no need for a wrapper function, just use lapply and pass it a blank argument at the end (to represent the columns)
lapply(df.list, `[`, 4, )
This also works with any type of row argument that you would normally use in myDF[ . , ] eg: lapply(df.list,[, c(2, 4:6), )
.
I would suggest that if you are going to use a wrapper function, have it work more like [ does: eg
Grab(df.list, 2:3, 1:5) would select the second & third row and first through 5th column of every data.frame and
Grab (df.list, 2:3) would select the second & third row of all columns
Grab <- function(ll, rows, cols) {
if (missing(cols))
lapply(ll, `[`, rows, )
else
lapply(ll, `[`, rows, cols)
}
Grab (df.list, 2:3)
My suggestion is to write a function that does what you want on a single data frame:
myfun <- function(dat) {
return(dat[4, , drop=FALSE])
}
If you want to return as a vector instead of data.frame, just do: return(dat[4, ]) insteaad. Then use lapply to apply that function to each element of your list:
lapply(df.list1, myfun)
With that technique, you can easily come up with ways to extend myfun to more complex functions...
For example, you have a .csv file called hw1_data.csv and you want to retrieve the 47th row. Here is how to do that:
x<-read.csv("hw1_data.csv")
x[47,]
If it is a text file you can use read.table.
Related
Say I have 10 dataframes. I would like to check if all have same column names irrespective of their cases.
I can do this in multiple steps, but I was wondering if there is a shortcut way to do this?
We place the datasets in a list, loop over the list with lapply, get the column names, convert it to a single case, get the unique and check if the length is 1
length(unique(lapply(lst1, function(x) sort(toupper(names(x)))))) == 1
#[1] TRUE
data
lst1 <- list(mtcars, mtcars, mtcars)
You can use Reduce + intersect to get all the common column names in the list of dataframes and compare it with the names of any single dataframe in the list.
all(sort(Reduce(intersect, lapply(list_df, names))) == sort(names(list_df[[1]])))
T12 is a data frame with 22 columns (but I just want column 2 till 8) and about one million entries.
Some of the Entries are NA in column one. Everytime there is NA in first column, complete cases deletes the complete row. Everything works well.
I Have a lot more data frames and I don't want to write the whole code again for every data frame.
I would like to have something like this function and want to put as x T12, T13, T14, T15 and so on.
Might you help me?
split <- function (x){
x <- x[,2:8]
x <- x[complete.cases(x[ ,1]),]
}
If you have dataframes named "T12", "T13" etc, you can use the pattern "T" followed by a number to capture all such dataframes in a character vector using ls.
Using mget you can get dataframes from those character vector in a named list.
You can then use lapply to apply split function on each list.
new_data <- lapply(mget(ls(pattern = 'T\\d+')), split)
new_data has list of dataframes. If you want these changes to reflect in original dataframe use list2env.
list2env(new_data, .GlobalEnv)
PS - split is a default function in R, so it is better to give some different name to your function.
I'm extremely new to R and could use some help.
I'm trying to split a data frame into a list of data frames that consists of every possible pair of column 1 with each subsequent column.
For example, given the following data:
df <- data.frame ("Time" = c("Mon","TUE", "WED"), VarA = c(2,5,6), VarB = c(24,46,14))
I'd like to end up with two data frames contained within a list. The first would be columns "Time" and "VarA", and the second would be columns "Time" and "VarB".
Ideally the function that creates this list would be scalable for hundreds of time-variable pairs. The end goal is to have a list of data frames so that I can use lapply to run various calculations on the data.
I think I can use split.data.frame to turn my original data frame into a list of specific subsets of the original data frame, but I'm having trouble getting the arguments right.
You want each column to be an individual data frame?
lapply(2:ncol(df), function (j) df[c(1, j)])
The solution with split is doing no good here. If you want to split up every single column, the algorithm that split does is actually an overhead. Learn more about split from What is the algorithm behind R core's `split` function?
If you have difficulty understanding the code, do it in two steps.
# define a function
f <- function (j) df[c(1, j)]
## try the function to see that it does
f(2)
f(3)
# use a lapply loop
result <- lapply(2:ncol(df), f)
I'm lost with the following object in R:
# create a list of filenames
files <- list.files("directory", full.names = TRUE)
# read all files as csv
data <- lapply(files, function(x) (data.frame(read.csv(x))))
Thats fine, but I have no idea what the type of data is and how to get my hands on it. Lets have a look:
data[1]
[[1]]
Date value1 value2 ID
1 2003-01-01 NA NA 1
2 2003-01-02 NA NA 1
...
Ok, that looks like a data frame (thats also what I intended when I did data.frame(read.csv(x))) -- I wanted a list of data frames. Unfortunately, when I ask
typeof(data[1])
[1] "list"
R claims data[1] to be a list. Why? I figured out now how that data[[1]] gives access to the data.frame as intended. But I could not figure out how to apply operations on the data frames packed in data. For instance, I would like to filter all elements from data which have more than 100 rows in the dataframe. I tried
lapply(data, Filter, f = function(x) (nrow(data.frame(x))>100))
but this just gives back a list of the same length as data which contains for instance
[[1]]
data frame with 0 columns and 1461 rows
Basically I have three questions:
Why do I get a list of lists instead of a list of data frames?
Could I convert this list of lists into, lets say a vector of data frames?
How could I subset the list in the way described above (for instance get all frames with more then 1000 rows)?
Ad1:
This is pretty basic stuff: the [ operator does not select a single element from a list - it returns a subset of a list. For a single element use [[ or $. So the answer to the first question is: you do get a list of data frames.
Ad2: You can't have vectors of data frames.
Ad3: lapply needs a FUN argument. But even if used correctly, lapply will produce some output element for every input element in a list. For filtering use Filter, in your case: Filter(function(x) nrow(x) > 100, data)
How would I go about taking elements of a list and making them into dataframes, with each dataframe name consistent with the list element name?
Ex:
exlist <- list(west=c(2,3,4), north=c(2,5,6), east=c(2,4,7))
Where I'm tripping up is in the actual naming of the unique dataframes -- I can't figure out how to do this with a for() loop or with lapply:
for(i in exlist) {
i <- data.frame(exlist$i)
}
gives me an empty dataframe called i, whereas I'd expect three dataframes to be made (one called west, another called north, and another called east)
When I use lapply syntax and call the individual list element name, I get empty dataframes:
lapply(exlist, function(list) i <- data.frame(list["i"]))
yields
data frame with 0 columns and 0 rows
> $west
list..i..
1 NA
$north
list..i..
1 NA
$east
list..i..
1 NA
If you want to convert your list elements to data.frames, you can try either
lapply(exlist, as.data.frame)
Or (as suggested by #Richard), depends on your desired output:
lapply(exlist, as.data.frame.list)
It is always recommended to keep multiple data frames in a list rather than polluting your global environment, but if you insist on doing this, you could use list2env (don't do this), such as:
list2env(lapply(exlist, as.data.frame.list), .GlobalEnv)
This should create the three objects you want:
df.names <- "value" ## vector with column names here
for (i in names(exlist)) setNames(assign(i, data.frame(exlist[[i]])), df.names)