The main dataframe has a column "passings". It is the only nested variable in the main dataframe. Inside it, there are dataframes (an example a nested cell). In the nested cells, the number of rows varies, yet the number of columns is the same. The columns names are "date" and "title". What I need is to grab a respective date and put it in the main dataframe as a new variable if title is "Закон прийнято" ("A passed law" - translation).
I'm a newbie in coding.
I will appreciate your help!
dataframe
an example of a dataframe within a nested cell
Here is an option where we loop over the 'passings' list column with map (based on the image, it would be a list of 2 column data.frame), filter the rows where the 'title' is "Закон прийнято" (assuming only a single value per row) and pull the 'date' column to create a new column 'date' in the original dataset
library(dplyr)
library(purrr)
df1 %>%
mutate(date = map_chr(passings, ~ .x %>%
filter(title == "Закон прийнято") %>%
pull(date)))
# id passed passings date
#1 54949 TRUE 2015-06-10, 2015-06-08, abcb, Закон прийнято 2015-06-08
#2 55009 TRUE 2015-06-10, 2015-09-08, bcb, Закон прийнято 2015-09-08
NOTE: It works as expected.
data
df1 <- structure(list(id = c(54949, 55009), passed = c(TRUE, TRUE),
passings = list(structure(list(date = c("2015-06-10", "2015-06-08"
), title = c("abcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)), structure(list(date = c("2015-06-10", "2015-09-08"
), title = c("bcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)))), row.names = c(NA, -2L), class = "data.frame")
Related
I've got a table such as this:
structure(list(Suggested.Symbol = c("CCT4", "DHRS2", "PMS2",
"FARSB", "RPL31", "ASNS"), p_onset = c(0.9378, 0.5983, 7.674e-10,
0.09781, 0.5495, 0.7841), p_dc14 = c(0.3975, 0.3707, 6.117e-17,
0.2975, 0.4443, 0.7661), p_tfc6 = c(0.2078, 0.896, 7.388e-19,
0.5896, 0.3043, 0.6696), p_tms30 = c(0.5724, 0.3409, 4.594e-13,
0.2403, 0.1357, 0.3422)), row.names = c(NA, 6L), class = "data.frame")
I'd like to create a new column called 'summary'. In it, on a row-wise basis, I'd like to return the columns names of the cells with values <0.05, comma separated. Is that possible??
We can use toString by looping over the rows, create a logical vector where the values are less than 0.05, subset the names and paste them with toString
df1$summary <- apply(df1[-1], 1, \(x) toString(names(x)[x < 0.05]))
I have been comparing two data frames in R using a package called daff and this is the final table I get:
dput(df)
structure(list(v1 = c("Silva->Silva/Mark", "Brandon->Brandon/Livo", "Mango->Mango or Apple"),
v2 = c("James->James=Jacy","NA->Na/Jane", "Egg->Egg and Orange")),
class = "data.frame", row.names = c(NA, -3L))
The rows fields have ->(arrow) to mean the data was modified in that cell from previous data frame column to current dataframe value. Now from here I had to separate the columns with ->(arrow) separator so that I can have an old column and new changed column. This means I added a suffix_old and _New to new columns. I used this code and see the output:
setDT(df)
df1<- lapply(names(df), function(x) {
mDT <- df[, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_Old", "_New")))
}) %>% as.data.table()
OUTPUT
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Silva/Mark", "Brandon/Livo", "Mango or Apple"),
v2_Old = c("James","NA", "Egg"),
v2_New = c("James=Jacy","Na/Jane", "Egg and Orange")),
class = "data.frame", row.names = c(NA, -3L))
Now my next step is to compare every two columns which have _old and _new suffix to identify what was modified then split and store in new column called diff_v1 and diff_v2. This I did using this code (Realise I have to do this manually by creating different spliting code lines, this is tedious with over 20 separated columns):
df$diff_v1<- mapply(function(x, y) paste(setdiff(y, x), collapse = '| '), strsplit(df$v1_old, '\\||, | | -| \\+'), strsplit(df$v1_Name_new, '\\||, | | -| \\+'))
df$diff_v2<- mapply(function(x, y) paste(setdiff(y, x), collapse = '| '), strsplit(df$v2_old, '\\||, | | -| \\+'), strsplit(df$v2_new, '\\||, | | -| \\+'))
OUTPUT
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Silva/Mark", "Brandon/Livo", "Mango or Apple"),
diff_v1 = c("/Mark", "/Livo", "or Apple"),
v2_Old = c("James","NA", "Egg"),
v2_New = c("James=Jacy","Na/Jane", "Egg and Orange"),
diff_v2 = c("=Jacy","/Jane", "and Orange")),
class = "data.frame", row.names = c(NA, -3L))
My question is can I be able to loop through columns with _old and _new and create new column called diff_v1 and diff_v2 respectively without running code line by line since. I have multiple columns and they keep changing depending on dataframes I am comparing . Wanted to know How I can use code to automatically identify columns with _Old and _New suffix and split then create that new column after the two but should happen on each pair of columns.
Currently I have to go to the data frame, check columns with old and new then manually change in the code that is splitting and creating diff column
We could identify "Old" and "New" columns based on their name using grep. We can use str_remove which is vectorized over string and pattern to remove part of "Old" col which is present in "New" col to create new columns.
old_cols <- grep("Old$", names(df), value = TRUE)
new_cols <- grep("New$", names(df), value = TRUE)
df[sub("New$", "diff", new_cols)] <- Map(stringr::str_remove,
df[new_cols], df[old_cols])
To get the names in order, we can do
df <- df[order(sub("_.*", "", names(df)))]
df
# v1_Old v1_New v1_diff v2_Old v2_New v2_diff
#1 Silva Silva/Mark /Mark James James=Jacy =Jacy
#2 Brandon Brandon/Livo /Livo NA Na/Jane Na/Jane
#3 Mango Mango or Apple or Apple Egg Egg and Orange and Orange
Using tidyverse, we can do
library(tidyverse)
df %>%
bind_cols(map2(df %>% select(ends_with("New")),
df %>% select(ends_with("Old")), stringr::str_remove))
I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")
This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)
I've done slicing within R to separate texts and columns before but having an issue when slicing inside a column. Let's Say I have this data
Zip Code <- c("90042 34.11332407100048 -118.19142869099971",
"90040 33.99649121800047 -118.15148940099971",
"90007 34.02833141800045 -118.28507659499968")
I want extract just the zip code and place it in a different column. The long/lat will also need to go to another column.
Do I use grep?
We can use tidyverse
library(tidyverse)
separate(dat, ZipCode, into = c('value', 'lat', 'lon'), sep= ' ')
# A tibble: 3 x 3
# value lat lon
#* <chr> <chr> <chr>
#1 90042 34.11332407100048 -118.19142869099971
#2 90040 33.99649121800047 -118.15148940099971
#3 90007 34.02833141800045 -118.28507659499968
data
dat <- structure(list(ZipCode = c("90042 34.11332407100048 -118.19142869099971",
"90040 33.99649121800047 -118.15148940099971", "90007 34.02833141800045 -118.28507659499968"
)), .Names = "ZipCode", class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L))
I have a dataframe with a two odd variables. For one one variable, each cell stores a list whose contents is simply a vector of two numbers. For the other variable, each cell stores a three dimensional array (even though only two dimensions are necessary) of 8 numbers.
I want to simplify the dataset by breaking out the odd variable into separate variables. I figured out how to break all the data out using a for loop but this is very slow. I know apply is supposed to be generally quicker, but I can't figure out how I would translate this to apply. Is it possible, or is there a better way to do this?
for (i in 1:nrow(df)){
if (length(df$coordinates.coordinates[[i]]>0)){
df[i,"coordinates.lon"]<- df$coordinates.coordinates[[i]][1]
df[i,"coordinates.lat"]<- df$coordinates.coordinates[[i]][2]
}
if (length(df$place.bounding_box.coordinates[[i]]>0)){
df[i,"place.bounding_box.a.lon"] <-df$place.bounding_box.coordinates[[i]][1,1,1]
df[i,"place.bounding_box.b.lon"] <-df$place.bounding_box.coordinates[[i]][1,2,1]
df[i,"place.bounding_box.c.lon"] <-df$place.bounding_box.coordinates[[i]][1,3,1]
df[i,"place.bounding_box.d.lon"] <-df$place.bounding_box.coordinates[[i]][1,4,1]
df[i,"place.bounding_box.a.lat"] <-df$place.bounding_box.coordinates[[i]][1,1,2]
df[i,"place.bounding_box.b.lat"] <-df$place.bounding_box.coordinates[[i]][1,2,2]
df[i,"place.bounding_box.c.lat"] <-df$place.bounding_box.coordinates[[i]][1,3,2]
df[i,"place.bounding_box.d.lat"] <-df$place.bounding_box.coordinates[[i]][1,4,2]
}
}
EDIT
Here is an example dataframe with one case (via dput)
structure(list(coordinates.coordinates = list(c(112.088477, -7.227974
)), place.bounding_box.coordinates = list(structure(c(112.044456,
112.044456, 112.143242, 112.143242, -7.263067, -7.134563, -7.134563,
-7.263067), .Dim = c(1L, 4L, 2L)))), .Names = c("coordinates.coordinates",
"place.bounding_box.coordinates"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -1L))
In case it helps, this is the data format that gets out when you try to read Twitter stream data using jsonlite's stream_in function (with flatten=TRUE)
library(dplyr)
df = data_frame(
coordinates.coordinates =
list(c(0, 1), c(2, 3)),
place.bounding_box.coordinates =
list(array(0, dim=c(1, 4, 2)),
array(1, dim=c(1, 4, 2))))
df %>%
rowwise %>%
do(with(., data_frame(
longitude = coordinates.coordinates[1],
latitude = coordinates.coordinates[2]) %>% bind_cols(
place.bounding_box.coordinates %>%
as.data.frame %>%
setNames(c(
"place.bounding_box.a.lon",
"place.bounding_box.b.lon",
"place.bounding_box.c.lon",
"place.bounding_box.d.lon",
"place.bounding_box.a.lat",
"place.bounding_box.b.lat",
"place.bounding_box.c.lat",
"place.bounding_box.d.lat")))))