Casting data correctly in R using the grep function

Casting data correctly in R using the grep function - r

I'm trying to reshape my data based on the value in a particular column (ie. "up" and "down"). The Up and Down are not in the same order in the data frame, so I'm having difficultly "casting" the data into the right shape.
I've tried used the cast function to shift the data, but I can't get the answers to work in a consistent (aka accurate) fashion.
This is my input:
input = structure(list(X = 1:6, Report = c("Sales.csv", "Sales.csv",
"Sales.csv", "Sales.csv", "Sales.csv", "Sales.csv"), Shock = c("Currencies.USD_Up",
"Currencies.USD_Down", "Currencies.AUD_Up", "Currencies.AUD_Down",
"Currencies.EUR_Down", "Currencies.EUR_Up"), Result = c(-519375.9816,
-7388851.423, -42950.77683, -667.367063, -12819532.15, -138054.0061
), FX = c("USD", "USD", "AUD", "AUD", "EUR", "EUR")), class = "data.frame", row.names = c(NA,
-6L))
and this is my preferred output:
output = structure(list(X = 1:3, Report = c("Sales.csv", "Sales.csv",
"Sales.csv"), Shock = c("Currencies.USD", "Currencies.AUD", "Currencies.EUR"
), Currency = c("USD", "AUD", "EUR"), Up = c(-519375.9816, -42950.77683,
-138054.0061), Down = c(-7388851.423, -667.367063, -12819532.15
)), class = "data.frame", row.names = c(NA, -3L))
Because the EUR data in the input is in a different order, I can't seem to make the data shape correctly. I've tried using the grep function to order this, but I can't make this work. Can anyone suggest a better way?

This is a tidyverse approach to do it:
library(dplyr)
library(tidyr)
library(tibble)
input %>%
as_tibble() %>%
separate(Shock, c("Shock", "tmp"), sep = "_") %>%
rename(Currency = FX) %>%
select(-X) %>%
spread(tmp, Result) %>%
mutate(X = row_number()) %>%
select(X, Report, Shock, Currency, Up, Down)

Related

use dplyr to get list items from dataframe in R

I have a dataframe being returned from Microsoft365R:
SKA_student <- structure(list(name = "Computing SKA 2021-22.xlsx", size = 22266L,
lastModifiedBy =
structure(list(user =
structure(list(email = "my#email.com",
id = "8ae50289-d7af-4779-91dc-e4638421f422",
displayName = "Name, My"), class = "data.frame", row.names = c(NA, -1L))),
class = "data.frame", row.names = c(NA, -1L)),
fileSystemInfo = structure(list(
createdDateTime = "2021-09-08T16:03:38Z",
lastModifiedDateTime = "2021-09-16T00:09:04Z"), class = "data.frame", row.names = c(NA,-1L))), row.names = c(NA, -1L), class = "data.frame")
I can return all the lastModifiedBy data through:
SKA_student %>% select(lastModifiedBy)
lastModifiedBy.user.email lastModifiedBy.user.id lastModifiedBy.user.displayName
1 my#email.com 8ae50289-d7af-4779-91dc-e4638421f422 Name, My
But if I want a specific item in the lastModifiedBy list, it doesn't work, e.g.:
SKA_student %>% select(lastModifiedBy.user.email)
Error: Can't subset columns that don't exist.
x Column `lastModifiedBy.user.email` doesn't exist.
I can get this working through base, but would really like a dplyr answer

This function allows you to flatten all the list columns (I found this ages ago on SO but can't find the original post for credit)
SO_flat_cols <- function(data) {
ListCols <- sapply(data, is.list)
cbind(data[!ListCols], t(apply(data[ListCols], 1, unlist)))
}
Then you can select as you like.
SO_flat_cols (SKA_student) %>%
select(lastModifiedBy.user.email)
Alternatively you can get to the end by recursively pulling the lists
SKA_student %>%
pull(lastModifiedBy) %>%
pull(user) %>%
select(email)

You could use
library(dplyr)
library(tidyr)
SKA_student %>%
unnest_wider(lastModifiedBy) %>%
select(email)
This returns
# A tibble: 1 x 1
email
<chr>
1 my#email.com

Removing rows between two ID-values in panel data set

I have a panel data set with the following columns: "ID", "Year", "Poverty rate", "Health services".
I have data from 2011-2013, and the table is ordered after the value of ID, looking something like this:
merged_data_frame = structure(list(ID = c(1001,1001,1001,2001,2001,2001,2002,2002,2002,2003,2003,2003,3001,3001,3001),
Year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013),
Poverty_rate = c(0.5,0.4,0.3,0.45,0.1,0.35,0.55,0.55,0.55,0.6,0.7,0.8,0.1,0.11,0.1 )), row.names = c(1:15), class = "data.frame")
How do I remove the values for the rows with ID between 2001 and 2003? My actual dataset have more than 5000 values, so I need something that removes everything between 2001 and 2xxx.
I managed to remove one and one value, but that is not an option given the size of the data set:
new_data_frame<-subset(merged_data_frame, merged_data_frame$ID!=20013)

Try this, using filter(!ID %in% c(2001,2003)
merged_data_frame = structure(list(ID = c(1001,1001,1001,2001,2001,2001,2002,2002,2002,2003,2003,2003,3001,3001,3001),
Year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013),
Poverty_rate = c(0.5,0.4,0.3,0.45,0.1,0.35,0.55,0.55,0.55,0.6,0.7,0.8,0.1,0.11,0.1 )), row.names = c(1:15), class = "data.frame")
df = merged_data_frame %>%
filter(!ID %in% c(2001,2003))

How to select one value of a data.frame within a list column with R?

I have a data.frame that contains a type column. The list contains a 1x3 data.frame. I only want one value from this list. Thus will flatten my data.frame so I can write out a csv.
How do I select one item from the nested data.frame (see the 2nd column)?
Here's the nested col. I'd provide the data but cannot flatten to write_csv.
result of dput:
structure(list(id = c("1386707", "1386700", "1386462", "1386340",
"1386246", "1386300"), fields.created = c("2020-05-07T02:09:27.000-0700",
"2020-05-07T01:20:11.000-0700", "2020-05-06T21:38:14.000-0700",
"2020-05-06T07:19:44.000-0700", "2020-05-06T06:11:43.000-0700",
"2020-05-06T02:26:44.000-0700"), fields.customfield_10303 = c(NA,
NA, 3, 3, NA, NA), fields.customfield_28100 = list(NULL, structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76414",
value = "Technical Debt", id = "76414"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), NULL,
structure(list(self = ".../rest/api/2/customFieldOption/76411",
value = "Maintenance", id = "76411"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L), structure(list(
self = ".../rest/api/2/customFieldOption/76412",
value = "New Feature", id = "76412"), .Names = c("self",
"value", "id"), class = "data.frame", row.names = 1L))), row.names = c(NA,
6L), class = "data.frame", .Names = c("id", "fields.created",
"fields.customfield_10303", "fields.customfield_28100"))

I found a way to do this.
First, instead of changing the data, I added a column with mutate. Then, directly selected the same column from all nested lists. Then, I converted the list column into a vector. Finally, I cleaned it up by removing the other columns.
It seems to work. I don't know yet how it will handle multiple rows within the nested df.
dat <- sample_dat %>%
mutate(cats = sapply(nested_col, `[[`, 2)) %>%
mutate(categories = sapply(cats, toString)) %>%
select(-nested_col, -cats)
Related
How to directly select the same column from all nested lists within a list?
r-convert list column into character vector where lists are characters

library(dplyr)
library(tidyr)
df <- tibble(Group=c("A","A","B","C","D","D"),
Batman=1:6,
Superman=c("red","blue","orange","red","blue","red"))
nested <- df %>%
nest(data=-Group)
unnested <- nested %>%
unnest(data)
Nesting and unnesting data with tidyr
library(purrr)
nested %>%
mutate(data=map(data,~select(.x,2))) %>%
unnest(data)
select with purrr, but lapply as you've done is fine, it's just for aesthetics ;)

Create a multiline plot from a dataset with time on one axis and genes on the other

I have a dataset with mean gene counts for each decade as shown below:
structure(list(decade_0 = c(92.500989948184, 2788.27384875413,
28.6937227408861, 1988.03831525414, 1476.83143096418), decade_1 = c(83.4606306426572,
537.725421951383, 10.2747132062782, 235.380422949258, 685.043600629146
), decade_2 = c(188.414375201462, 2091.84249935145, 17.080858894829,
649.55107199935, 1805.3484565514), decade_3 = c(43.3316024314987,
141.64396529835, 2.77851259926935, 94.7748265692319, 413.248354335235
), decade_4 = c(54.4891626582901, 451.076574268175, 12.4298374245007,
346.102609621018, 769.215535857077), decade_5 = c(85.5621750431284,
131.822699578988, 13.3130607062134, 151.002200923853, 387.727911723968
), decade_6 = c(112.860998806804, 4844.59668489898, 19.7317645111144,
2084.76584309876, 766.375852567831), decade_7 = c(73.2198969730458,
566.042952305845, 3.2457873699886, 311.853982701609, 768.801733767044
), decade_8 = c(91.8161648275608, 115.161700090147, 10.7289451320065,
181.747670625714, 549.21661120626), decade_9 = c(123.31045087146,
648.23694540667, 17.7690326882018, 430.301803845829, 677.187054208271
)), row.names = c("ANK1", "NTN4", "PTPRH", "JAG1", "PLAT"), class = "data.frame")
I would like to plot a line graph with the changes in counts over time for each of >30 genes as shown here in excel.
To do this with ggplot I have to convert it to col1: decade, col2: gene, col3: counts.
My question is, either how to convert my table into this ggplot friendly table, or if there is a better way to produce the plot with a different tool?
Thanks!

One possibility: transpose your data frame, convert rownames to columns, then gather ("make long"). Plotting is then easy.
library(tidyverse)
mydat <- structure(list(decade_0 = c(92.500989948184, 2788.27384875413,
28.6937227408861, 1988.03831525414, 1476.83143096418), decade_1 = c(83.4606306426572,
537.725421951383, 10.2747132062782, 235.380422949258, 685.043600629146
), decade_2 = c(188.414375201462, 2091.84249935145, 17.080858894829,
649.55107199935, 1805.3484565514), decade_3 = c(43.3316024314987,
141.64396529835, 2.77851259926935, 94.7748265692319, 413.248354335235
), decade_4 = c(54.4891626582901, 451.076574268175, 12.4298374245007,
346.102609621018, 769.215535857077), decade_5 = c(85.5621750431284,
131.822699578988, 13.3130607062134, 151.002200923853, 387.727911723968
), decade_6 = c(112.860998806804, 4844.59668489898, 19.7317645111144,
2084.76584309876, 766.375852567831), decade_7 = c(73.2198969730458,
566.042952305845, 3.2457873699886, 311.853982701609, 768.801733767044
), decade_8 = c(91.8161648275608, 115.161700090147, 10.7289451320065,
181.747670625714, 549.21661120626), decade_9 = c(123.31045087146,
648.23694540667, 17.7690326882018, 430.301803845829, 677.187054208271
)), row.names = c("ANK1", "NTN4", "PTPRH", "JAG1", "PLAT"), class = "data.frame")
newdat <- mydat %>% t() %>% as.data.frame() %>% tibble::rownames_to_column('decade') %>%
pivot_longer(-decade, names_to = 'gene', values_to = 'count')
ggplot(newdat) + geom_line(aes(decade, count, color = gene, group = gene))
Created on 2020-02-14 by the reprex package (v0.3.0)

R - count of items in line chart: match DateTime to count of items

I have a dataframe with the following structure:
df <- structure(list(Name = structure(1:9, .Label = c("task 1", "task 2",
"task 3", "task 4", "task 5", "task 6", "task 7", "task 8", "task 9"
), class = "factor"), Start = structure(c(1479799800, 1479800100,
1479800400, 1479800700, 1479801000, 1479801300, 1479801600, 1479801900,
1479802200), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1479801072,
1479800892, 1479801492, 1479802092, 1479802692, 1479803292, 1479803892,
1479804492, 1479805092), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -9L), class = "data.frame")
Now I want to count the items in column "Name" over time. They all have a start and end datetimes, which are formated as POSIXct.
With help of this solution here on SO I was able to do so (or at least I think I was) with following code:
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
library(ggplot2)
ggplot(ans[, .N, by = yid], aes(x = yid, y = N)) + geom_line()
Problem now:
How do I match my DateTime-scale to those integer values on the x-axis? Or is there a faster and better solution to solve my problem?
I tried to use x = as.POSIXct(yid, format = "%Y-%m-%dT%H:%M:%S", origin = min(df$Start)) within the aes of the ggplot(). But that didn't work.
EDIT:
When using the solution for this problem, I face another. Items, where there is no count, are displayed with the count of the latest countable item in the plot. This is why we have to merge (leftjoin) the table with the counts (ants) again with a complete sequence of all Datetimes and put a 0 for every NA. So we get explicit values for every necessary datapoint.
Like this:
# The part we use to count and match the right times
df1 <- ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid))
# The part where we use the sequence from the beginning for a LEFT JOIN with the counting dataframe
df2 <- data.frame(time = dates)
dt <- merge(x = df2, y = df1, by = "time", all.x = TRUE)
dt[is.na(dt)] <- 0

In the tidyverse framework, this is a slightly different task -
Generate the sames dates variable you have.
Construct a data frame with all dates and all times (cartesian join)
Filter out the rows that are not in the interval for each task
Add up the tasks for each minute that remain
Plot.
That looks something like this --
library(tidyverse)
library(lubridate)
dates = seq(min(df$Start), max(df$End), by = "min")
df %>%
mutate(key = 1) %>%
left_join(data_frame(key = 1, times = dates)) %>%
mutate(include = times %within% interval(Start, End)) %>%
filter(include) %>%
group_by(times) %>%
summarise(count = n()) %>%
ggplot(aes(times, count)) +
geom_line()
#> Joining, by = "key"
If you need it to be faster, it will almost certainly be faster using your original data.table code.
Consider this.
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid)) %>%
ggplot(aes(time, N)) +
geom_line()
Now we use data.table to calculate the overlap, and then index time off the starting minute. Once we add a new column with the times, we can plot.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Casting data correctly in R using the grep function - r

This is a tidyverse approach to do it: library(dplyr) library(tidyr) library(tibble) input %>% as_tibble() %>% separate(Shock, c("Shock", "tmp"), sep = "_") %>% rename(Currency = FX) %>% select(-X) %>% spread(tmp, Result) %>% mutate(X = row_number()) %>% select(X, Report, Shock, Currency, Up, Down)

Related

use dplyr to get list items from dataframe in R

Removing rows between two ID-values in panel data set

How to select one value of a data.frame within a list column with R?

Create a multiline plot from a dataset with time on one axis and genes on the other

R - count of items in line chart: match DateTime to count of items

Categories

Resources