unnesting list inside of a list, inside of a list, inside of a list... while preserving id in R - r

I imported a JSON file with below structure:
link
I would like to transform it to a dataframe with 3 columns: ID group_name date_joined,
where ID is a element number from "data" list.
It should look like this:
ID group_name date_joined
1 aaa dttm
1 bbb dttm
1 ccc dttm
1 ddd dttm
2 eee dttm
2 aaa dttm
2 bbb dttm
2 fff dttm
2 ggg dttm
3 bbb dttm
3 ccc dttm
3 ggg dttm
3 mmm dttm
Using below code few times i get a dataframe with just 2 columns: group_name and date_joined
train2 <- do.call("rbind", train2)
sample file link

the following should work:
library(jsonlite)
train2 <- fromJSON("sample.json")
train2 <- train2[[1]]$groups$data
df <- data.frame(
ID = unlist(lapply(1:length(train2), function(x) rep.int(x,length(train2[[x]]$group_name)))),
group_name = unlist(lapply(1:length(train2),function(x) train2[[x]]$group_name)),
date_joined = unlist(lapply(1:length(train2),function(x) train2[[x]]$date_joined)))
output:
> df
ID group_name date_joined
1 1 Let's excercise together and lose a few kilo quicker - everyone is welcome! (Piastow) 2008-09-05 09:55:18.730066
2 1 Strongman competition 2008-05-22 21:25:22.572365
3 1 Fast food 4 life 2012-02-02 05:26:01.293628
4 1 alternative medicine - Hypnosis and bioenergotheraphy 2008-07-05 05:47:12.254848
5 2 Tom Cruise group 2009-06-14 16:48:28.606142
6 2 Babysitters (Sokoka) 2010-09-25 03:21:01.944684
7 2 Work abroad - join to find well paid work and enjoy the experience (Sokoka) 2010-09-21 23:44:39.499240
8 2 Tennis, Squash, Badminton, table tennis - looking for sparring partner (Sokoka) 2007-10-09 17:15:13.896508
9 2 Lost&Found (Sokoka) 2007-01-03 04:49:01.499555
10 3 Polish wildlife - best places 2007-07-29 18:15:49.603727
11 3 Politics and politicians 2010-10-03 21:00:27.154597
12 3 Pizza ! Best recipes 2010-08-25 22:26:48.331266
13 3 Animal rights group - join us if you care! 2010-11-02 12:41:37.753989
14 4 The Aspiring Writer 2009-09-08 15:49:57.132171
15 4 Nutrition & food advices 2010-12-02 18:19:30.887307
16 4 Game of thrones 2009-09-18 10:00:16.190795
17 5 The ultimate house and electro group 2008-01-02 14:57:39.269135
18 5 Pirates of the Carribean 2012-03-05 03:28:37.972484
19 5 Musicians Available Poland (Osieczna) 2009-12-21 13:48:10.887986
20 5 Housekeeping - looking for a housekeeper ? Join the group! (Osieczna) 2008-10-28 23:22:26.159789
21 5 Rooms for rent (Osieczna) 2012-08-09 12:14:34.190438
22 5 Counter strike - global ladderboard 2008-11-28 03:33:43.272435
23 5 Nutrition & food advices 2011-02-08 19:38:58.932003

Related

R merging two dataframes, but only select certain year rows from dateframe 2

I've got two dataframes, one with 2016-2020 and one with 2015-2020. I would like to extract the 2015 lines from dataframe2 and insert to dataframe1.
Dateframe1 has date, hits, keyword (same as dataframe2)
The merged line must be matched by keywords. So 2015 "food" from dataframe2, must be inserted before 01.01.2016 "food" in dataframe1.
Ex: Dataframe1:
*date hits keyword*
2016-01-01 10 food
2016-31-01 5 food
2017-31-01 5 food
2018-31-01 5 food
2018-31-01 5 food
2016-01-01 55 drink
2016-22-01 1 drink
2017-31-05 2 drink
2018-31-01 1 drink
So I want all lines in 2015 containing food to be inserted above 2016 food in dataframe1. And the same with drink. All drink 2015 from dataframe2 must be inserted before 2016 drink in dataframe1.
End result:
*date hits keyword*
**2015-31-01 5 food**
2016-01-01 10 food
2016-31-01 5 food
2017-31-01 5 food
2018-31-01 5 food
2018-31-01 5 food
**2015-31-01 7 food**
2016-01-01 55 drink
2016-22-01 1 drink
2017-31-05 2 drink
2018-31-01 1 drink
Three basic frame operations:
Filter Dataframe2 to only include the rows we want.
Dataframe2[grepl("^2015", Dataframe2$date),]
# date x y
# 2 2015-31-01 5 food
# 3 2015-31-01 5 food
# 4 2015-31-01 5 food
Combine row-wise using rbind.
Dataframe1 <- rbind(Dataframe2[grepl("^2015", Dataframe2$date),], Dataframe1)
Dataframe1
# date x y
# 2 2015-31-01 5 food
# 3 2015-31-01 5 food
# 4 2015-31-01 5 food
# 1 2016-01-01 10 food
# 21 2016-31-01 5 food
# 31 2017-31-01 5 food
# 41 2018-31-01 5 food
# 5 2018-31-01 5 food
# 6 2016-01-01 55 drink
# 7 2016-22-01 1 drink
# 8 2017-31-05 2 drink
# 9 2018-31-01 1 drink
Sort the resulting data.
Dataframe1[order(Dataframe1$date),]
# date x y
# 2 2015-31-01 5 food
# 3 2015-31-01 5 food
# 4 2015-31-01 5 food
# 1 2016-01-01 10 food
# 6 2016-01-01 55 drink
# 7 2016-22-01 1 drink
# 21 2016-31-01 5 food
# 31 2017-31-01 5 food
# 8 2017-31-05 2 drink
# 41 2018-31-01 5 food
# 5 2018-31-01 5 food
# 9 2018-31-01 1 drink
I should note that these are all using string date values that will not sort correctly: they are sorting lexicographically, which is not numeric. Realize that
20 > 3
# [1] TRUE
"20" > "3"
# [1] FALSE
To do this right, the columns would be proper Date class columns:
# starting with a fresh `Dataframe1`
Dataframe1$date <- as.Date(Dataframe1$date, format = "%Y-%d-%m")
Dataframe2$date <- as.Date(Dataframe2$date, format = "%Y-%d-%m")
## Filter
lims <- as.Date(c("2015-01-01", "2015-31-12"), format = "%Y-%d-%m")
Dataframe2[ lims[1] <= Dataframe2$date & Dataframe2$date <= lims[2], ] # for demo
## Combine
Dataframe1 <- rbind(Dataframe2[ lims[1] <= Dataframe2$date & Dataframe2$date <= lims[2], ], Dataframe1)
## Order
Dataframe1[order(Dataframe1$date),]
Note that R will always show them in year-month-date order when in a Date class object. If you want it displayed something else, I suggest you do that only in report generation (using format(Dataframe1$date, format="..."), see ?strptime for format hints).

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

Best Way to Subset a Large Dataframe into a List of Variables?

I currently have a data frame of ~83000 rows (13 columns) that has data from years 2000-2012 of crimes, each row is a crime and the zip code is reported (so the zip code xxxxx can be found in year 2001, 2003, and 2007 as an example).
Here is an example of my data:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 99502 1 3 5 2 9479
2009 2 99502 2 3 4 3 3220
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Right now, I am able to assign global variables to all of my zip codes (I am using R studio and my list of data shown is very long and it has significantly slowed the program).
Here is how I have assigned global variables to all of my zip codes:
for (n in all.data$Zip) {
x <- subset(all.data, n == all.data$Zip) #subsets the data
u <- x[1,3] #gets the zip code value
assign(paste0("Zip", u), x, envir = .GlobalEnv) #assigns it to a global environment
#need something here, MasterList <<- ?
}
I would like to contain all of these variables in a list. For example, if all my zip code variables were stored in list "MasterList":
MasterList["Zip11111"]
would yield the data frame:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Is this possible? What would be an alternative/faster/better way to do such? I was hoping that storing these variables in a list would be more efficient.
Bonus points: I know in my for loop I am reassigning variables that already exist to the exact same thing, wasting processing time. Any quick line I could add to speed this up?
Thanks in advance for your help!
With only base R:
dat <- read.table(text = "Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
+ 2000 1 99502 1 3 5 2 9479
+ 2009 2 99502 2 3 4 3 3220
+ 2000 1 11111 1 3 5 2 3479
+ 2004 2 11111 2 3 4 3 1020",header = TRUE,sep = "")
> dats <- split(dat,dat$Zip)
> dats
$`11111`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$`99502`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
> names(dats) <- paste0('Zip',names(dats))
> dats
$Zip11111
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$Zip99502
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
You could change for (n in all.data$Zip) to for (n in unique(all.data$Zip)). That would cut down on redundancy. Why don't you make a list before the loop, MasterList <- list() and then add to the list by
MasterList[[paste0("Zip", n)]] <- x
Yes, I used n for the zip code number because n is assigned each value in the vector you tell it (in your case all.data$Zip, in mine unique(all.data$Zip))
Probably the easiest way to make your list is using the plyr function, like so:
> set.seed(2)
> dat <- data.frame(zip=as.factor(sample(11111:22222,1000,replace=T)),var1=rnorm(1000),var2=rnorm(1000))
> head(dat)
zip var1 var2
1 13165 -0.4597894 -0.84724423
2 18915 0.6179261 0.07042928
3 17481 -0.7204224 1.58119491
4 12978 -0.5835119 0.02059799
5 21598 0.2163245 -0.12337051
6 21594 1.2449912 -1.25737890
> library(plyr)
> MasterList <- dlply(dat,.(zip))
> MasterList[["13165"]]
zip var1 var2
1 13165 -0.4597894 -0.8472442
However it sounds like speed is your motivation here and if so you'd probably be much better off not storing the data in some separate list object and converting your data frame to a data.table():
> library(data.table)
> dat.dt <- data.table(dat)
> dat.dt[zip==13165]
zip var1 var2
1: 13165 -0.4597894 -0.8472442

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

How can I derive a variable in R showing the number of observations that have the same value recorded at earlier dates?

I am using R and I have a data frame containing info about the applications made by individuals for a grant. Individuals can apply for a grant as many times as they like. I want to derive a new variable that tells me how many applications each individual has made up to and including the date of the application represented by each record.
At the moment my data looks like this:
app number date app made applicant
1 2012-08-01 John
2 2012-08-02 John
3 2012-08-02 Jane
4 2012-08-04 John
5 2012-08-08 Alice
6 2012-08-09 Alice
7 2012-08-09 Jane
And I would like to add a further variable so my data frame looks like this:
app number date app made applicant applications by applicant to date
1 2012-08-01 John 1
2 2012-08-02 John 2
3 2012-08-02 Jane 1
4 2012-08-04 John 3
5 2012-08-08 Alice 1
6 2012-08-09 Alice 2
7 2012-08-09 Jane 2
I'm new to R and I'm really struggling to work out how to do this. The closest I am able to get is something like the answer in this question:
How do I count the number of observations at given intervals in R?
But I can't work out how to do this based on the date in each record rather than on pre-set intervals.
Here's a less elegant way than #Justin 's:
A <- read.table(text='"app number" "date app made" "applicant"
1 2012-08-01 John
2 2012-08-02 John
3 2012-08-02 Jane
4 2012-08-04 John
5 2012-08-08 Alice
6 2012-08-09 Alice
7 2012-08-09 Jane',header=TRUE)
# order by applicant name
A <- A[order(A$applicant), ]
# get vector you're looking for
A$app2date <- unlist(sapply(unique(A$applicant),function(x, appl){
seq(sum(A$applicant == x))
}, appl = A$applicant)
)
# back in original order:
A <- A[order(A$"app.number"), ]
You can use plyr for this. If your data is in a data.frame dat, I would add a column called count, then use cumsum
library(plyr)
dat <- structure(list(number = 1:7, date = c("2012-08-01", "2012-08-02",
"2012-08-02", "2012-08-04", "2012-08-08", "2012-08-09", "2012-08-09"
), name = c("John", "John", "Jane", "John", "Alice", "Alice",
"Jane")), .Names = c("number", "date", "name"), row.names = c(NA,
-7L), class = "data.frame")
dat$count <- 1
ddply(dat, .(name), transform, count=cumsum(count))
number date name count
1 5 2012-08-08 Alice 1
2 6 2012-08-09 Alice 2
3 3 2012-08-02 Jane 1
4 7 2012-08-09 Jane 2
5 1 2012-08-01 John 1
6 2 2012-08-02 John 2
7 4 2012-08-04 John 3
>
I assumed your dates were already sorted, however you might want to explicitly sort them anyway before you do your "counting":
dat <- dat[order(dat$date),]
as per the comment, this can be simplified if you understand (which I didn't!) the way transform is working:
ddply(dat, .(name), transform, count=order(date))
number date name count
1 5 2012-08-08 Alice 1
2 6 2012-08-09 Alice 2
3 3 2012-08-02 Jane 1
4 7 2012-08-09 Jane 2
5 1 2012-08-01 John 1
6 2 2012-08-02 John 2
7 4 2012-08-04 John 3
Here is a 1 line approach using the ave function. This version does not require reordering the data, but leaves the data in the same order as it was originally:
A$applications <- ave(A$app.number, A$applicant, FUN=seq_along)

Resources