R: add a new column to dataframes from a function - r

I have many tibbles similar to this:
dftest_tw <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33)), .Names = c("text", "Tweet.id",
"created.date", "created.week"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
For testing, we add another one:
dftest2_tw <- dftest_tw
I have this list of my df:
myUserList <- ls(,pattern = "_tw")
What I am looking to do is:
1- add a new column named Twitter.name
2- fill the column with the df name, all this in a function. The following code works for each df taken one by one:
dftest_tw %>% rowwise() %>% mutate(Twitter.name = myUserList[1])
The desired result is this:
MyRes <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33), retweet = c(0, 0, 0), custom = c(0,
0, 0), Twitter.name = c("dftest_tw", "dftest_tw", "dftest_tw"
)), .Names = c("text", "Tweet.id", "created.date", "created.week",
"retweet", "custom", "Twitter.name"), class = c("rowwise_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
When it comes to write a function to be thereafter been applied to all my df (more than 100), I can't achieve it. Any help would be appreciated.

We can use tidyverse options. Get the value of multiple string objects with mget, then with map2 from purrr, create the new column 'Twitter.name in each dataset of the list with corresponding string element of 'myUserList`
library(tidyverse)
lst <- mget(myUserList) %>%
map2(myUserList, ~mutate(.data = .x, Twitter.name = .y))
If we need to modify the objects in the global environment, use list2env
list2env(lst, envir = .GlobalEnv)

Related

merging outputs from a loop

I have two datasets and named E and eF respectively.
E<- structure(list(Inception_Date = structure(c(962323200, 962323200,
810950400, 988675200, 1042502400, 1536624000), tzone = "UTC", class =
c("POSIXct","POSIXt")), Name = c("Calvert Social Index B", "Calvert US
Large Cap Core Rspnb Idx A", "Green Century Equity Individual
Investor", "Praxis Value Index A", "Vanguard FTSE Social Index I",
"Amundi IS Amundi MSCI USA SRI ETF DR")), row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
eF <- structure(list(Inception_Date = structure(c(760233600, 519868800,
1380067200, 1101772800, 1325203200, 628473600, 1325203200, 1123804800
), tzone = "UTC", class = c("POSIXct", "POSIXt")), Name = c("Amana
Growth Investor", "Amana Income Investor", "Amana Income
Institutional", "American Century Sustainable Equity A",
"Ariel Appreciation Institutional", "Ariel Appreciation Investor",
"Ariel Focus Institutional", "Baywood Socially Responsible Invs"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
I applied the following codes to the data E and eF.
for (k in 1:nrow(E)) {
F_temp <- eF;
G_temp <- F_temp %>% filter(abs(F_temp$Inception_Date-
E$Inception_Date[k]) <= 1500);
print(G_temp)}
As the "G_temp" under the "Global Environment" shows it as 0 obs. of 2 variables only (which must be the last components in the loop's list), how to make a .csv file that shows all the "G_temp" components merged together removing duplicates?
Thanks
Using your exact filter criteria would this do it?
G_temp <- data.frame(Inception_Date = as.POSIXct(character()),
Name = character())
for (k in 1:nrow(E)) {
G_temp_int <- eF %>%
filter(abs(eF$Inception_Date - E$Inception_Date[k]) <= 1500)
G_temp <- bind_rows(G_temp, G_temp_int)
}
G_temp <- G_temp %>%
distinct(Inception_Date, Name)
write.csv(G_temp, "G_temp.csv")

Match strings from main df with those in reference df - if found, add all cols from that row of the reference df into main df

I have a "main_df" along the lines of this:
structure(list(study_id = c("02ipnnqgeovkrxz", "02ipnnqgeovkrxz",
"02ipnnqgeovkrxz", "02ipnnqgeovkrxz", "02ipnnqgeovkrxz", "02ipnnqgeovkrxz"
), question = c("3eEVJgaAP6c9FPL", "b8GLxGjZKtstCQZ", "40iyFKjeMEFGI2V",
"6eZGejSZ1oTZYLb", "3pXAUvZH8GGuryd", "0kYkUAHe4iODUl7"), study_rt = c("1.353",
"0.714", "0.68", "0.695", "0.696", "0.656"), study_response = c("picture",
"picture", "picture", "picture", "picture", "picture")), row.names = c(NA,
-6L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), reshapeWide = list(
v.names = NULL, timevar = "index", idvar = c("study_id",
"question"), times = c("rt", "response"), varying = structure(c("response.rt",
"response.response"), .Dim = 1:2)), groups = structure(list(
study_id = "02ipnnqgeovkrxz", .rows = list(1:6)), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
and a reference df along the lines of this:
structure(list(stim = c("ashtray_word", "bell_word", "blouse_word",
"boot_word", "bottle_word", "bread_word"), url = c("eW1BRoUDV4BKQMl",
"5zKTGwHlwlzpssB", "55SVfoQudZJNCFT", "bOORR1zuKYSnAe9", "6RrOQfDZim81pHv",
"1F97ouH0HrwQOgZ"), study_list = c("A", "A", "A", "A", "A", "A"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Each value in the 'question' column of the main df can be found in the 'url' column of the reference df. I want to match these values, and add all columns from that row of the reference df to my main df. The output will look like this:
structure(list(study_id = c("02ipnnqgeovkrxz", "02ipnnqgeovkrxz",
"02ipnnqgeovkrxz", "02ipnnqgeovkrxz", "02ipnnqgeovkrxz", "02ipnnqgeovkrxz"
), question = c("3eEVJgaAP6c9FPL", "b8GLxGjZKtstCQZ", "40iyFKjeMEFGI2V",
"6eZGejSZ1oTZYLb", "3pXAUvZH8GGuryd", "0kYkUAHe4iODUl7"), study_rt = c("1.353",
"0.714", "0.68", "0.695", "0.696", "0.656"), study_response = c("picture",
"picture", "picture", "picture", "picture", "picture"), stim = c("chisel_picture",
"raccoon_picture", "apple_picture", "belt_picture", "bicycle_picture",
"cake_picture"), url = c("3eEVJgaAP6c9FPL", "b8GLxGjZKtstCQZ",
"40iyFKjeMEFGI2V", "6eZGejSZ1oTZYLb", "3pXAUvZH8GGuryd", "0kYkUAHe4iODUl7"
), study_list = c("B FILLER", "B FILLER", "B", "B", "B", "B")), row.names = c(NA,
-6L), groups = structure(list(study_id = "02ipnnqgeovkrxz", .rows = list(
1:6)), row.names = c(NA, -1L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df", "tbl", "data.frame"))
This will allow me to see the 'sensible' item names (e.g. "chisel_picture") that subjects were responding to, as opposed to the nonsensical code names I have now (e.g. "3eEVJgaAP6c9FPL"). The same items appear over and over again in the 'question' column (as different subjects saw the same items), and I need to preserve these repeats.
I have successfully managed this using a for loop...but it's super slow! A tidyverse solution would be amazing!
My awful for loop (study_data = main df / image_urls = reference df):
all_study_stim_items <- study_data$question # List all values in 'question' column.
matched_items <- tibble() # Create empty tibble to store results of for loop.
for (i in all_study_stim_items) {
temp <- image_urls %>%
filter(url == i) %>%
select(stim, url, study_list)
matched_items <- bind_rows(matched_items, temp) } # Continuously overwrite tibble with each match.
# I then join this with the main df.

How to plot layers of tupples on same plot in R?

I am trying to plot the time and NDVI for each region on the same plot. I think to do this I have to convert the date column from characters to time and then plot each layer. However I cannot figure out how to do this. Any thoughts?
list(structure(list(observation = 1L, HRpcode = NA_character_,
timeseries = NA_character_), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3793765496776215", "0.21686891782421552", "0.3785652933528299",
"0.41027240624704164", "0.4035578030242673", "0.341299793064468"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.4071076986818826", "0.09090719657570319", "0.35214166081795284",
"0.4444311032927228", "0.5220702877666005", "0.5732370503295022"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3412131556625801", "0.18815996897460135", "0.5218904976415136",
"0.6970128777711452", "0.7229657162729096", "0.535967435470161"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
111
First we need to clean your data. The first element in this list is empty
df = df[-1]
Now we need to make a data.frame
df = do.call(rbind, df)
I am going to add a region variable, change the name of NDVI to remove the space,
change ndvi into a numeric vector, and change time into a Date object
library(dplyr)
df = df %>%
mutate(region = factor(rep(1:3, rep(6, 3)))) %>%
rename(ndvi = ' NDVI') %>%
mutate(ndvi = as.numeric(ndvi)) %>%
mutate(time = as.Date(time))
Now we can use ggplot2 to plot the data by region
library(ggplot2)
g = df %>%
ggplot(aes(x = time, y = ndvi, col = region)) +
geom_line()
g
Which gives this plot:
Here's an approach with lubridate to handle dates and dplyr to make the binding of the data.frames easier to understand.
Note that the group names are taken from the names of the list, and since those don't exist in the data you provided, we have to set them in advance.
library(lubridate)
library(ggplot2)
library(dplyr)
names(data) <- 1:3
data <- bind_rows(data, .id = "group")
data$time <- ymd(data$time)
setnames(data," NDVI","NDVI")
data$NDVI <- as.numeric(data$NDVI)
ggplot(data, aes(x=time,y=NDVI,color=Group)) + geom_line()

iterate over certain elements of a list, not a data.frame

I am trying to modify certain items from a list based on a criteria (starts with "rr_esp") in the render.data list.
library(tidyr)
library(dplyr)
library(purrr)
per <- 2015:2019
render.data <- list(
emision = structure(
list(
AÑO = c(2017, 2018, 2019),
TRABAJADORESMES_r = c(58147, 57937, 24818),
MASA_r = c(3439195127, 4091347036.2, 2441068565.77),
TRABAJADORESMESsinDOM = c(58147L, 57928L, 24818L),
MESES = c(12, 12, 5)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -3L)
),
siniestros = structure(
list(
AÑO = c(2017, 2018, 2019),
N = c(388L, 327L, 115L),
GR_66 = c(64, 53, 15),
JU = c(41L, 5L, 0L),
JN = c(20, 19, 6),
PORINC_66s = c(437.22, 293.73, 82.12),
EDADs = c(15142L, 12886L, 4712L),
SALARIOs = c(13707950.67, 15151144.7, 4800075.4)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -3L)
),
rr_esp1 = structure(
list(
AÑO = c(2017, 2018, 2019),
MESES = c(12, 12, 5),
TRAB_PROM = c(4845.58, 4828.08, 4963.60),
PORINC = c(6.83, 5.54, 5.47),
SALARIO = c(35329.76, 46333.77, 41739.78),
EDAD = c(39.02, 39.40, 40.97)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -3L)
),
rr_esp7 = structure(
list(
AÑO = c(2017, 2018, 2019),
JUI_LIQ = c(1539624.21, 318726, 0),
JUI_RVA = c(24434809.51, 2292925.89, 0),
JUI_IBNR = c(0, 25284030.0174036, 22434092.26),
JUI_ULT = c(25974433.72, 27895681.90, 22434092.26),
CM_JUICIO = c(1505898.34, 1806002.14, 1557923.07)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -3L)
)
)
When apply a loop over their elements, they loses their original itemnames
Afterwards, I dont know a better way to iterate a subset of list elements and assign them a new value. I google it but I do not find a critical solution for list instead of data.frames.
render.data <- invisible(lapply(seq_along(render.data), function(i){
if(startsWith(names(render.data)[i], prefix = "rr_esp")){
render.data[[i]] %>%
complete(`AÑO` = per) %>%
gather(
key = "metrica", value = "valor", -`AÑO`
) %>%
mutate(# orden de las metricas
metrica = factor(metrica, levels = unique(metrica))
) %>%
spread(
key = `AÑO`, value = "valor"
)} else{
render.data[[i]]
}
setNames(render.data[[i]], names(render.data)[i])
}))
This seems like a case where a for loop is much clearer than an lapply. The main advantages of lapply are (a) that it pre-allocates a data structure for the result and (b) has simple syntax to apply a simple function. You already have a data structure for the result, and your function is complex. I don't know what your expected output is, but I would try this:
# find elements to modify
rr_elements = which(startsWith(names(render.data), prefix = "rr_esp"))
# modify in for loop
for (i in rr_elements) {
render.data[[i]] = render.data[[i]] %>%
complete(`AÑO` = per) %>%
gather(key = "metrica", value = "valor",-`AÑO`) %>%
mutate(# orden de las metricas
metrica = factor(metrica, levels = unique(metrica))) %>%
spread(key = `AÑO`, value = "valor")
}
If you want to make this code more re-usable, create a function for the operation on one data frame, and then you can use it easily with for or lapply. In general, I'd say that picking the data frames on which to use the function is better done externally than internally. (That is, I don't like how you have an if() statement checking the name inside the function. Do this logic outside the function, and only give the function the data you want it to use.)
foo = function(data) {
data %>%
complete(`AÑO` = per) %>%
gather(key = "metrica", value = "valor",-`AÑO`) %>%
mutate(# orden de las metricas
metrica = factor(metrica, levels = unique(metrica))) %>%
spread(key = `AÑO`, value = "valor")
}
# now the for loop or lapply is simple:
rr_elements = which(startsWith(names(render.data), prefix = "rr_esp"))
# for loop version
for (i in rr_elements) {
render.data[[i]] = foo(render.data[[i]])
}
# lapply version
render.data[rr_elements] = lapply(render.data[rr_elements], foo)

R: drop columns from tibbles inside a function

This is a followthrough of this topic. Here are my 3 tibbles:
dftest_tw <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33)), .Names = c("text", "Tweet.id",
"created.date", "created.week"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
dftest1_tw <- dftest_tw
dftest2_tw <- dftest_tw
myUserList <- ls(,pattern = "_tw")
Following yesterday topic, I have the wanted result when running this:
library(tidyverse)
lst <- mget(myUserList) %>%
map2(myUserList, ~mutate(.data = .x, Twitter.name = .y)) %>%
list2env(lst, envir = .GlobalEnv)
I need to drop a few columns for each df. This do the job when running on one df:
select_(dftest_tw, quote(-text), quote(-Tweet.id), quote(-created.date))
It seems like I have a serious probelm when it comes to apply code to each member of a list. I can't find a way to apply it to all df when using lapply, or writing a function:
MySelect <- function(x){
select_(x, quote(-text), quote(-Tweet.id), quote(-created.date))
x
}
for(var in myUserList){MySelect(get(var))}
Thank you for your help.

Resources