Heterogeneous list to dataframe in R - r

I got a list (from a JSON extracted with rjson) which defines in a bizarre way the different values of an element for given x and y values (like to draw a graph line). What I look for is to produce a data.frame which contains only the values, with the possible values of x and y in row names and col headers.
In other words, I try to obtain this kind of data frame:
y1 y2
x1 v11 v12
x2 v21 v22
And the list I got as entry is like:
[
[
x1,
[
[y1, v11],
[y2, v12]
]
],
[
x2,
[
[y1, v21],
[y2, v22]
]
]
]
It is a list of lists; each inner list contains two elements: one x value and a list of list. Those most-inner lists have two elements, a y value and a v value, which represents the value of the object for the x of its parent and for the y with it.
I hope my explanation is not too confuse. I have made some unsuccessful attemps with ldply or matrix(unlist(... I look for a way to transform my list without having to pick one by one each value in a double for loop.
Thanks for reading and for any help you can provide.
[EDIT]
Here is the dput of my data:
list(list(20, list(c(1, 224), c(3, 330), c(5, 436), c(10, 701
), c(20, 1231), c(30, 1231))), list(10, list(c(1, 154), c(3,
207), c(5, 366), c(10, 631), c(20, 631), c(30, 631))), list(5,
list(c(1, 119), c(3, 225), c(5, 331), c(10, 331), c(20, 331
), c(30, 331))), list(1, list(c(1, 91), c(3, 91), c(5, 91
), c(10, 91), c(20, 91), c(30, 91))))
In this example, 20, 10, 5, 3, 1 are supposed to be the future x of the dataframe and 1, 3, 5, 10, 20, 30 the future y. The rest are values of the object.

The package jsonliteis able to simplify data structures when converting from JSON to R. I am not sure if rjson offers something similar. Here I am using this to round trip from R to JSON and back, giving me a matrix for y_i and v_ij:
foo <- list(list(20, list(c(1, 224), c(3, 330), c(5, 436), c(10, 701), c(20, 1231), c(30, 1231))),
list(10, list(c(1, 154), c(3, 207), c(5, 366), c(10, 631), c(20, 631), c(30, 631))),
list(5, list(c(1, 119), c(3, 225), c(5, 331), c(10, 331), c(20, 331), c(30, 331))),
list(1, list(c(1, 91), c(3, 91), c(5, 91), c(10, 91), c(20, 91), c(30, 91))))
bar <- jsonlite::fromJSON(jsonlite::toJSON(foo))
baz <- Reduce(rbind,lapply(bar, function(x) t(x[[2]])[2, ]))
colnames(baz) <- bar[[1]][[2]][,1]
rownames(baz) <- unlist(lapply(bar, function(x) x[[1]]))
baz
#> 1 3 5 10 20 30
#> 20 224 330 436 701 1231 1231
#> 10 154 207 366 631 631 631
#> 5 119 225 331 331 331 331
#> 1 91 91 91 91 91 91

Related

How to loop over multiple groups and create radar plots in R

I have the following dataframe:
group
Class
Maths
Science
Name1
7
74
78
Name2
7
80
91
Name3
6
69
80
I want to create different radar plots for the variables Maths and Science for each classes using R. eg: For the above dataframe, two radar plots should be created for two classes 7 and 6.
nrange <- 2
class <- c(7,6)
for (i in nrange){
plot <- ggradar::ggradar(df[i,2:3], values.radar = c(0, 50, 100), group.line.width = 1,
group.point.size = 2, legend.position = "bottom", plot.title=class[i])
}
plot
I using the above code. However, it is only creating the plot for the last row. Please help me with this issue.
Thanks a lot in advance!
You were almost there, but there were two little problems.
The for statement evaluated to for(i in 2) which means it is only using i=2. You can fix this by using for(i in 1:nrange)
You were overwriting plot each time through the loop. If you make plot a list and save each graph as a separate element in the list, then it should work.
mydat <- tibble::tribble(
~group, ~Class, ~Maths, ~Science,
"Name1", 7, 74, 78,
"Name2", 7, 80, 91,
"Name3", 6, 69, 80)
plots <- list()
nrange <- 2
class <- c(7,6)
for (i in 1:3){
plots[[i]] <- ggradar::ggradar(mydat[i,2:4], values.radar = c(0, 50, 100),
grid.max = 100, group.line.width = 1,
group.point.size = 2, legend.position = "bottom", plot.title=mydat$Class[i])
}
plots
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
Created on 2023-02-03 by the reprex package (v2.0.1)
Putting Together with facet_wrap()
library(dplyr)
library(ggplot2)
mydat <- tibble::tribble(
~group, ~Class, ~Maths, ~Science,
"Name1", 7, 74, 78,
"Name2", 7, 80, 91,
"Name3", 6, 69, 80)
mydat <- mydat %>%
mutate(gp = paste(group, Class, sep=": ")) %>%
select(gp, Maths, Science)
ggradar::ggradar(mydat, values.radar = c(0, 50, 100),
grid.max = 100, group.line.width = 1,
group.point.size = 2, legend.position = "bottom") +
facet_wrap(~gp)
Created on 2023-02-06 by the reprex package (v2.0.1)

Predict down years/rows based on previous years/rows, with lmer and dplyr

I have a linear mixed effects model that determines change in grass based on both the previous year's grass and several environmental variables (and their interaction) at different distinct sites over time.
Using this mixed effects model and established, projected future environmental variables, I want to predict change in grass density. Each year's prediction thus depends on the previous year's density, located on the row above it in my dataframe. We begin with a real value from the present year, and then predict into the future.
library(tidyverse); library(lme4)
#data we have from the past, where each site has annual ChlA/Sal/Temp as well as grass density. our formula, change.mod, predicts grass.change, based on these env variables AND last year's grass coverage (grass.y1)
ThePast = tibble(
year = c(2017, 2018, 2019, 2020, 2021, 2017, 2018, 2019, 2020, 2021,2017, 2018, 2019, 2020, 2021),
site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"),
ChlA = c(50, 210, 190, 101, 45, 20, 20, 80, 5, 40, 25, 12, 11, 5, 20),
Sal= c(1, 4, 5, 0.1, 10, 18, 14, 17, 10, 21, 30, 28, 25, 20, 22),
Temp = c(28, 21, 24, 25, 22, 19, 20, 17, 18, 15, 18, 16, 19, 20, 20),
grass = c(.5, .3, .1, .4, .1, .25, .33, .43, .44, .08, .75, .54, .69, .4, .6)) %>%
group_by(site) %>%
mutate(grass.y1 = lag(grass, order_by = year)) %>% #last year's grass
mutate(grass.change = grass - grass.y1) %>% #calculate change
ungroup()
#the ME model
change.mod = lmer(grass.change ~ grass.y1 + log10(ChlA) + log10(Sal) + grass.y1:log10(Temp) + grass.y1:log10(Sal) + (1|site), data = ThePast)
#Future environmental data per site per year, to be used to predict grass.
TheDistantFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18))
#The final dataframe should look like this, where both of the grass columns are predicted out into the future. could have the grass.y1 column in here if we wanted
PredictedFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18),
grass = c(0.237, 0.335, 0.457, 0.700, 0.151, 0.361, 0.176, 0.380, 0.684),
grass.change = c(0.1368, 0.2550, -0.1425, -0.1669, -0.18368, -0.0962, 0.106, 0.229, 0.323 ))
Right now, I can generate the next year's (2022) correct predictions using group_by() and predict(), referencing last year's grass density with a lag function.
#How do we get to PredictedFuture?? Here is what I'm trying:
FutureIsNow = ThePast %>%
filter(year == 2021) %>% #take last year of real data to have baseline starting grass density
bind_rows(TheDistantFuture) %>% #bind future data
arrange(site, year) %>% #arrange by site then year
group_by(site) %>% #maybe this should be rowwise?
mutate(grass.change = predict(change.mod, newdata = data.frame(
grass.y1 = lag(grass, n = 1, order_by = year),
ChlA = ChlA, Sal = Sal, Temp = Temp, site = site))) %>% #this correctly predicts 2022 grass change
mutate(grass = grass.change + lag(grass, n = 1)) #this also works to calculate grass in 2022
This df looks like this:
> FutureIsNow
# A tibble: 12 × 7
# Groups: site [3]
year site ChlA Sal Temp grass grass.change
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 A 45 10 22 NA NA
2 2022 A 40 12 24 0.237 0.137
3 2023 A 95 16 28 NA NA
4 2024 A 149 21 31 NA NA
5 2021 B 40 21 15 NA NA
6 2022 B 200 11 22 0.335 0.255
7 2023 B 10 21 29 NA NA
8 2024 B 10 21 20 NA NA
9 2021 C 20 22 20 NA NA
10 2022 C 10 15 26 0.457 -0.143
11 2023 C 4 32 32 NA NA
12 2024 C 15 22 18 NA NA
Close, but not really repeatable...
Any ideas for predicting grass change for 2023, 2024, down the rows? I prefer working in tidyverse, though it may be possible to solve this more easily with nested for loops. Potential solutions include a rowwise data structure, or maybe to nest_by(station), but I don't know how to then reference the grass.y1 column. Maybe the solution could be via a rolling prediction with rollify, but I am not sure!
Thank you in advance for your help! Long time reader, first time asker!
So, let's go with a simpler example here for a reprex to show how purrr::accumulate2() can work for you here.
Let's setup a discrete time population model where there is also some covariate that affects time
$N_t = 1.5N_{t-1} + C$
Simple! Heck, we can even use accumulate2 to simulate a population, and then add some noise.
library(tidyverse)
# ok, let's make a population from a simple discrete time growth model
# but, with a covariate!
covariate <- runif(5, 5, 10)
# use accumulate2 with the covariate to generate a population timeseries
pop <- accumulate2(1:5,covariate, ~.x*1.5 + .y, .init = 0) %>% unlist()
pop <- pop[-1]
pop_obs <- rnorm(5, pop, 1) #add some noise
Great! Now, turn it into data and fit a model
# the data ####
dat <- tibble(
time = 1:5,
covariate = covariate,
pop_obs = pop_obs,
lag_pop = lag(pop_obs)
)
# the model ####
mod <- lm(pop_obs ~ covariate + lag_pop, data = dat)
# does this look reasonable?
coef(mod)
My coefficients looked reasonable, but, set a seed and see!
Now we will need some data we want to simulate for - new covariates, but, we will need to incorporate the lag.
# now, simulation data ####
simdat <- tibble(
time = 6:10,
covariate = runif(5, 15,20),
lag_pop = dat$pop_obs[5] #the last lagged value!
)
Great! To make this work, we'll need a function that takes arguments of the lagged value and covariate and runs a prediction. Note, here the second argument is just a numeric. But, you could pass an element of a list - a row of a data frame, if you will. This might be accomplished later with some rowwise nesting or somesuch. For you to work out!
# OK, now we need to get predictions for pop at each step in time! ####
sim_pred <- function(lag_pop, covariate){
newdat <- tibble(covariate = covariate,
lag_pop = lag_pop)
predict(mod, newdata = newdat)
}
With this in hand, we can simulate forward using lag_pop to generate a new population. Note, we'll need to use .init to make sure our first value is correct as well as strip off the final value (I think...might want to check that)
# and let her rip!
# note, we have to init with the first value and
# for multiple covariates, make a rowwise list -
# each element of the list is
# one row of the data and the sim_pred function takes it apart
simdat %>%
mutate(pop = accumulate2(lag_pop,
covariate,
~sim_pred(.x, .y),
.init = lag_pop[1]) %>% `[`(-1) %>% unlist())
That should do!

Returning values from a column based on the last value of another column

I have a dataset like this:
data <- data.frame(Time = c(1,4,6,9,11,13,16, 25, 32, 65),
A = c(10, NA, 13, 2, 32, 19, 32, 34, 93, 12),
B = c(1, 99, 32, 31, 12, 13, NA, 13, NA, NA),
C = c(2, 32, NA, NA, NA, NA, NA, NA, NA, NA))
What I want to retrieve are the values in Time that corresponds to the last numerical value in A, B, and C.
For example, the last numerical values for A, B, and C are 12, 13, and 32 respectively.
So, the Time values that correspond are 65, 25, and 4.
I've tried something like data[which(data$Time== max(data$A)), ], but this doesn't work.
We can multiply the row index with the logical matrix, and get the colMaxs (from matrixStats) to subset the 'Time' column
library(matrixStats)
data$Time[colMaxs((!is.na(data[-1])) * row(data[-1]))]
#[1] 65 25 4
Or using base R, we get the index with which/arr.ind, get the max index using a group by operation (tapply) and use that to extract the 'Time' value
m1 <- which(!is.na(data[-1]), arr.ind = TRUE)
data$Time[tapply(m1[,1], m1[,2], FUN = max)]
#[1] 65 25 4
Or with summarise/across in the devel version of dplyr
library(dplyr)
data %>%
summarise(across(A:C, ~ tail(Time[!is.na(.)], 1)))
# A B C
#1 65 25 4
Or using summarise_at with the current version of dplyr
data %>%
summarise_at(vars(A:C), ~ tail(Time[!is.na(.)], 1))

how to multiply multiple df columns

I have a df with a number of columns.
I want to multiply each of the column using a fixed constant.
I am looking for the best possible strategy to achieve this using purrr (I am still trying to get my head around lamp etc etc)
library(tidyverse)
library(lubridate)
df1 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04",
"2019-02-05")),
x = c(1, 2, 3, 4, 5),
y = c(2, 3, 4, 5, 6),
z = c(3, 4, 5, 6, 7)
)
The constants to multiply each of the column is as follows:
c(10, 20, 30)
This is the output I expect:
data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04",
"2019-02-05")),
x = c(10, 20, 30, 40, 50),
y = c(40, 60, 80, 100, 120),
z = c(90, 120, 150, 180, 210)
)
We can use map2 from purrr (part of the tidyverse) to achieve this.
df1[2:4] <- map2(df1[2:4], c(10, 20, 30), ~.x * .y)
df1
# date x y z
# 1 2019-02-01 10 40 90
# 2 2019-02-02 20 60 120
# 3 2019-02-03 30 80 150
# 4 2019-02-04 40 100 180
# 5 2019-02-05 50 120 210
The base R equivalent is mapply.
df1[2:4] <- mapply(FUN = function(x, y) x * y, df1[2:4], c(10, 20, 30), SIMPLIFY = FALSE)

Lookup value in one dataframe based on column name stored as a value in another dataframe

Please see the reproducible (cut + paste) example below. The actual data set has over 4000 serial observations on 11000 people. I need to create columns A, B, C, etc. showing the NUMBER of the "Drug" variables X,Y, Z etc. that corresponds to the first occurrence of a particular value of a "Disease" variable. The numbers refer to actions that were taken with particular drugs (start, stop, increase dose etc.) The "disease" variable refers to whether the disease flared or not in a disease that has many stages including flares and remissions.
For example:
Animal <- c("aardvark", "1", "cheetah", "dromedary", "eel", "1", "bison", "cheetah", "dromedary",
"eel")
Plant <- c("apple_tree", "blossom", "cactus", "1", "bronze", "apple_tree", "bronze", "cactus",
"dragonplant", "1")
Mineral <- c("amber", "bronze", "1", "bronze", "emerald", "1", "bronze", "bronze", "diamond",
"emerald")
Bacteria <- c("acinetobacter", "1", "1", "d-strep", "bronze", "acinetobacter", "bacillus",
"chlamydia", "bronze", "enterobacter" )
AnimalDrugA <- c(1, 11, 12, 13, 14, 15, 16, 17, 18, 19)
AnimalDrugB <- c(20, 1, 22, 23, 24, 25, 26, 27, 28, 29)
PlantDrugA <- c(301, 302, 1, 304, 305, 306, 307, 308, 309, 310)
PlantDrugB <- c(401, 402, 1, 404, 405, 406, 407, 408, 409, 410)
MineralDrugA <- c(1, 2, 3, 4, 1, 6, 7, 8, 9, 10)
MineralDrugB <- c(11, 12, 13, 1, 15, 16, 17, 18, 19, 20)
BacteriaDrugA <- c(1, 2, 3, 4, 5, 6 , 7, 8, 9, 1)
BacteriaDrugB <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
dummy_id <- c(1001, 2002, 3003, 4004, 5005, 6006, 7007, 8008, 9009, 10101)
Elements <- data.frame(dummy_id, Animal, Plant, Mineral, Bacteria, AnimalDrugA, AnimalDrugB,
PlantDrugA, PlantDrugB, MineralDrugA, MineralDrugB, BacteriaDrugA, BacteriaDrugB)
ds <- Elements[,order(names(Elements))]
ds #Got it in alphabetical order... The real data set will be re-ordered chronologically
#Now I want the first occurrence of the word "bronze" for each id
# for each subject 1 through 10. (That is, "bronze" corresponds to start of disease flare.)
first.bronze <- colnames(ds)[apply(ds,1,match,x="bronze")]
first.bronze
#Now, I want to find the number in the DrugA, DrugB variable that corresponds to the first
#occurrence of bronze.
#Using the alphabetically ordered data set, the answer should be:
#dummy_id DrugA DrugB
#1... NA NA
#2... 2 12
#3... NA NA
#4... 4 1
#5... 5 6
#6... NA NA
#7... 7 17
#8... 8 18
#9... 9 2
#10... NA NA
#Note that all first occurrences of "bronze"
# are in Mineral or Bacteria.
#As a first step, join first.bronze to the ds
ds$first.bronze <- first.bronze
ds
#Make a new ds where those who have an NA for first.bronze are excluded:
ds2 <- ds[complete.cases(ds$first.bronze),]
ds2
# Create a template data frame
out <- data.frame(matrix(nr = 1, nc = 3))
colnames(out) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out
#Then grow the data frame...yes I realize potential slowness of computation
test <- for(i in ds2$first.bronze){
data <- rbind(colnames(ds2)[grep(i, names(ds2), ignore.case = FALSE, fixed = TRUE)])
colnames(data) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out <- rbind(out, data)
}
out
#Then delete the first row of NAs
out <- na.omit(out)
out
#Then add the appropriate dummy_ids
dummy_id <- ds2$dummy_id
out_with_ids <- as.data.frame(cbind(dummy_id, out))
out_with_ids
Now I am stuck. I have the name of the column from ds2 listed as a value of Drug A, Drug B in the out_with_ids dataset. I have search through Stack Overflow thoroughly but solutions based on match, merge, replace, and the data.table package don't seem to work.
Thank you!
I think the problem here is data format. May I suggest you store it in "long" table, like this:
library(data.table)
dt <- data.table(dummy_id = rep(dummy_id, 4),
type = rep(c("Animal", "Bacteria", "Mineral", "Plant"), each = 10),
name = c(Animal, Bacteria, Mineral, Plant),
drugA = c(AnimalDrugA, BacteriaDrugA, MineralDrugA, PlantDrugA),
drugB = c(AnimalDrugB, BacteriaDrugB, MineralDrugB, PlantDrugB))
Then it is much easier to filter and do other operations. For example,
dt[name == "bronze"][order(dummy_id)]
Frankly I'm not sure I understand what you want to achieve in the end.

Categories

Resources