Make dataframe from a list of lists in R - r

I would like to make a dataframe from a list of n. Each list contains 3 different list inside. I am only intrested in 1 list of those 3 list inside. The list I am intrested in is a data.frame with 12 obs of 12 variables.
My imput tmp in my lapply function is a list of n with each 5 observations.
2 of those observations are the Latitude and Longitude. This is how my lapply function looks like:
DF_Google_Places<- lapply(tmp, function(tmp){
Latitude<-tmp$Latitude
Longitude<-tmp$Longitude
LatLon<- paste(Latitude,Longitude, sep=",")
res<-GET(paste("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=",LatLon,"&radius=200&types=food&key=AIzaSyDS6usHdhdoGIj0ILtXJKCjuj7FBmDEzpM", sep=""))
jsonAnsw<-content(res,"text")
myDataframe<- jsonlite::fromJSON(content(res,"text"))
})
My question is: how do I get this list of 12 obs of 12 variables into a dataframe from a list of n ?
Could anyone help me out?, Thanks

I'm just posting my comment as an answer so I can show output to show you the idea:
x <- list(a=list(b=1,c=2),d=list(b=3,c=4))
So x is a nested list structure, in this case with consistent naming / structure one level down.
> x
$a
$a$b
[1] 1
$a$c
[1] 2
$d
$d$b
[1] 3
$d$c
[1] 4
Now we'll use do.call to build the data.frame. We need to pass it a named list of arguments, so we'll use list(sapply to get the named list. We'll walk the higher level of the list by position, and the inner level by name since the names are consistent across sub-lists at the inner level. Note here that the key idea is essentially to reverse what would be the intuitive way of indexing; since I want to pull observations at the second level from across observations at the first level, the inner call to sapply traverses multiple values of level one for each value of the name at level two.
y <- do.call(data.frame,
list(sapply(names(x[[1]]),
function(t) sapply(1:length(x),
function(j) x[[j]][[t]]))))
> y
b c
1 1 2
2 3 4
Try breaking apart the command to see what each step does. If there is any consistency in your sub-list structure, you should be able to adapt this approach to walk that structure in the right order and fold the data you need.
On a large dataset, this would not be efficient, but for 12x12 it should be fine.

Related

Split a list of elements into two unique lists (and get all combinations) in R

I have a list of elements (my real list has 11 elements, this is just an example):
x <- c(1, 2, 3)
and want to split them into two lists (using all entries) but I want to get all possible combinations of that list to be returned e.g.:
(1,2)(3) & (1)(2,3) & (2)(1,3)
Does anyone know an efficient way to do this for a more complex list?
Thanks in advance for your help!
List with 3 elements:
vec <- 1:3
Note that for each element we have two possibilities: it is either in 1st split or in 2nd split. So we define a matrix of all possible splits (in rows) using expand.grid which produces all possible combinations:
groups <- as.matrix(expand.grid(rep(list(1:2), length(vec))))
However This will treat scenarios where the groups are flipped as different splits. Also will include scenarios where all the observations are in the same group (but there will only be 2 of them).
If you want to remove them we need to remove the lines from groups matrix that only have one group (2 such lines) and all the lines that split the vector in the same way, only switching the groups.
One-group entries are on top and bottom so removing them is easy:
groups <- groups[-c(1, nrow(groups)),]
Duplicated entries are a bit trickier. But note that we can get rid fo them by removing all the rows where the first group is 2. In effect this will make a requirement that the first element is always assigned to group 1.
groups <- groups[groups[,1]==1,]
Then the job is to split the list we have using each of the rows in the groups matrix. For that we use Map to call split() function on our list vec and each row of groups matrix:
splits <- Map(split, list(vec), split(groups, row(groups)))
> splits
[[1]]
[[1]]$`1`
[1] 1 3
[[1]]$`2`
[1] 2
[[2]]
[[2]]$`1`
[1] 1 2
[[2]]$`2`
[1] 3
[[3]]
[[3]]$`1`
[1] 1
[[3]]$`2`
[1] 2 3

Adding a new column in R based on maximum occurrence of words from a CSV

I am working with two CSV files. They are formatted like this:
File 1
able,2
gobble,3
highway,3
test,6
zoo,10
File 2
able,6
gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10
In my program I want to do the following:
Create a keyword list by combining the values from two CSV files and keeping only unique keywords
Compare that keyword list to each individual CSV file to determine the maximum number of occurences of a given keyword, then append that information to the keyword list.
The first step I have done already.
I am getting confused by R reading things as vectors/factors/data frames etc...and "coercion to lists". For example in my files given above, the maximum occurrence for the word "gobble" should be 10 (its value is 3 in file 1 and 10 in file 2)
So basically two things need to happen. First, I need to create a column in "keywords" that holds information about the maximum number of occurrences of a word from the CSV files. Second, I need to populate that column with the maximum value.
Here is my code:
# Read in individual data sets
keywordset1=as.character(read.csv("set1.csv",header=FALSE,sep=",")$V1)
keywordset2=as.character(read.csv("set2.csv",header=FALSE,sep=",")$V1)
exclude_list=as.character(read.csv("exclude.csv",header=FALSE,sep=",")$V1)
# Sort, capitalize, and keep unique values from the two keyword sets
keywords <- sapply(unique(sort(c(keywordset1, keywordset2))), toupper)
# Keep keywords greater than 2 characters in length (basically exclude in at etc...)
keywords <- keywords[nchar(keywords) > 2]
# Keep keywords that are not in the exclude list
keywords <- setdiff(keywords, sapply(exclude_list, toupper))
# HERE IS WHERE I NEED HELP
# Compare the read keyword list to the master keyword list
# and keep the frequency column
key1=read.csv("set1.csv",header=FALSE,sep=",")
key1$V1=sapply(key1[[1]], toupper)
keywords$V2=key1[which(keywords[[1]] %in% key1$V1),2]
return(keywords)
The reason that your last commmand fails is that you try to use the $ operator on a vector. It only works on lists or data frames (which are a special case of lists).
A remark regarding toupper (and many other functions in R): it works on vectors, such that you don't need to use sapply. toupper(c(keywordset1, keywordset2)) is perfectly fine.
But I would like to propose an entirely different solution to your problem. First, I create the data as follows:
keywords1 <- read.table(text="able,2
gobble,3
highway,3
test,6
zoo,10",sep=",",stringsAsFactors=FALSE)
keywords2 <- read.table(text="gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10",sep=",",stringsAsFactors=FALSE)
Note that I use stringsAsFactors=FALSE. This prevents read.table from converting characters to factors, such that there is no need to call as.character later.
The next steps are to capitalize the keyword columns in both tables. At the same time, I put both tables in a list. This is often a good way to simplify calculations in R, because you can use lapply to apply a function on all the list elements. Then I put both tables into a single table.
keyword_list <- lapply(list(keywords1,keywords2),function(kw)
transform(kw,V1=toupper(V1)))
keywords_all <- do.call(rbind,keyword_list)
The next step is to sort the data frame in decreasing order by the number in the second column:
keywords_sorted <- keywords_all[order(keywords_all$V2,decreasing=TRUE),]
keywords_sorted looks as follows:
V1 V2
5 ZOO 10
6 GOBBLE 10
11 ZOO 10
9 TEST 8
8 SPEED 7
4 TEST 6
2 GOBBLE 3
3 HIGHWAY 3
7 HIGHWAY 3
10 UPPER 3
1 ABLE 2
As you notice, some keywords appear only once and for those that appear twice, the first appearance is the one you want to keep. There is a function in R that can be used to extract exactly these elements: duplicated() (run ?duplicated to learn more). Basically, the function returns TRUE, if an element appears for the at least second time in a vector. These are the elements you don't want. To convert TRUE to FALSE (and vice versa), you use the operator !. So the following gives your desired result:
keep <- !duplicated(keywords_sorted$V1)
keywords_max <- keywords_sorted[keep,]
V1 V2
5 ZOO 10
6 GOBBLE 10
9 TEST 8
8 SPEED 7
3 HIGHWAY 3
10 UPPER 3
1 ABLE 2

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.
The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.
you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

Plyr for simple group-by

I'm getting data from a MySQL table that has 2 columns (idDoc, tag) describing that the document has a given tag. When I use the data frame with
ddply(tags,1)
My objective is to group tags by id, so say I do the following steps
> x=c(1,1,2,2)
> y=c(4,5,6,7)
> data.frame(x,y)
x y
1 1 4
2 1 5
3 2 6
4 2 7
My desired output would be perhaps a list of lists (or whatever other result) that would get
1 -> c(4,5)
2 -> c(6,7)
Regards
This is kind of a shot in the dark, since when you say you want an 'association', that doesn't really precisely describe any particular R data structure, so it's unclear what form you want the output to take.
But one base R possibility would be to simply use split:
split(tags$tag, tags$idDoc)
which should returned a named list where the names come from idDoc and each list element is the tags associated with that idDoc value. There will be duplicates, though. So maybe this would work better:
tapply(tags$tag,tags$idDoc,FUN = unique)
which should return a list of unique tags for each idDoc.
(Edited: No need for the anonymous function; only need to pass unique).

Trying to use user-defined function to populate new column in dataframe. What is going wrong?

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Resources