I am having trouble merging datasets and am going to merging many together so need to figure out a way to automate getting through the following error:
"Error: Can't combine `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2006_72038163885.csv$HourlyWetBulbTemperature` <double> and `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2009_72038163885.csv$HourlyWetBulbTemperature` <character>."
I have examined the data and see that in one of the files some of the NAs are marked by * so I know that is why the problem is there. I would like to add a command that will convert either all to character or all to numeric so that I can merge but when I try adding as.character I receive this error:
Error: Names repair functions can't return `NA` values
Here is the relevant code I am trying to run which produces the error.
library(rnoaa)
library(tidyverse)
library(fs)
super_big_df <- map_df(my_files, read_csv, col_select = c(1,2,21,32,80), col_types = "cTddd", .id = "file")
Here is the output of dput for the relevant columns of the dataset
structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA,
NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_,
NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA,
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
In sum, I am wondering if there is a command I can put in map_df() that will convert everything to be the same (either character or numeric) so that the rest of the command will still run.
Untested, but the best way forward as #GregorThomas suggested is to read it in properly the first time. In this case, it's likely something like:
super_big_df <- map_df(
my_files, read_csv, na = c("", "NA", "*"),
col_select = c(1,2,21,32,80), col_types = "cTddd",
.id = "file")
If you need to fix it after the fact, then you'll need to read them into a list-of-frames, perhaps changing map_df to map,
super_big_df <- map(
my_files, read_csv, na = c("", "NA", "*"),
col_select = c(1,2,21,32,80), col_types = "cTddd",
.id = "file")
bind_rows(super_big_df)
# Error: Can't combine `..1$HourlyWetBulbTemperature` <character> and `..2$HourlyWetBulbTemperature` <double>.
and then something like
library(dplyr) # in case you did not already have it loaded
purrr::map(super_big_df, ~ mutate(., HourlyWetBulbTemperature = suppressWarnings(as.numeric(HourlyWetBulbTemperature)))) %>%
bind_rows()
# # A tibble: 6 x 5
# STATION DATE HourlyWetBulbTemperature DailyAverageWetBulbTemperature MonthlyWetBulb
# <dbl> <dttm> <dbl> <dbl> <lgl>
# 1 72038163885 2009-01-01 00:00:00 NA NA NA
# 2 72038163885 2009-01-01 00:20:00 38 NA NA
# 3 72038163885 2009-01-01 00:40:00 37 NA NA
# 4 72038163885 2006-05-01 05:00:00 NA 72 NA
# 5 72038163885 2006-05-01 05:20:00 NA NA NA
# 6 72038163885 2006-05-01 05:40:00 NA NA NA
The suppressWarnings here is because we know there is a non-number ("*") in that column somewhere. For that one frame, it will fix that column; for other frames, it should be a no-op since the column is already as.numeric.
Note that I hard-coded the name here since we know what it is ahead of time. If there are more columns that need repairing (i.e., you get more errors after fixing this one), then it might be advantageous to go with a more dynamic/programmatic approach (not yet covered here).
Data
super_big_df <- list(
structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame")),
structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_, NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
)
Related
I have a long historical data like this format (unbalanced). While there is a lag until the data is released (next business day), I would like to record the date as of the day it happened. I tried to use dplyr as follows:
dataframe<-dataframe%>%group_by(date)%>%mutate(cob=lag(date,n=1))
However, it just produces the same result as:
lag(date,1)
date
name
value
2023/1/2
a
X
2023/1/2
b
X
2023/1/2
c
X
2023/1/3
a
X
2023/1/3
b
X
2023/1/4
a
X
2023/1/4
b
X
2023/1/5
a
X
2023/1/5
b
X
2023/1/5
c
X
I thought about:
dataframe<-dataframe%>%group_by(name)%>%mutate(cob=lag(date,n=1))
but it produces NA when there is no observation for a certain sample.
mutate(cob=date-1)
is not considering business day.
I just would like to slide all the dates in dataframe$date by 1 business day.
I attached the part of the actual data (historical prices of Japanese treasury bills).
structure(list(date = c("2002-08-06", "2002-08-06", "2002-08-07",
"2002-08-07", "2002-08-09", "2002-08-09"), code = c(2870075L,
3000075L, 2870075L, 3000075L, 2870075L, 3000075L), due_date = c("2002-08-20",
"2002-09-10", "2002-08-20", "2002-09-10", "2002-08-20", "2002-09-10"
), ave_price = c(99.99, 99.99, 99.99, 99.99, 99.99, 99.99)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
date = c("2002-08-06", "2002-08-07", "2002-08-09"), .rows = structure(list(
1:2, 3:4, 5:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
The expected outcome is as follows:
structure(list(date = c("2002-08-06", "2002-08-06", "2002-08-07",
"2002-08-07", "2002-08-09", "2002-08-09"), code = c(2870075L,
3000075L, 2870075L, 3000075L, 2870075L, 3000075L), due_date = c("2002-08-20",
"2002-09-10", "2002-08-20", "2002-09-10", "2002-08-20", "2002-09-10"
), ave_price = c(99.99, 99.99, 99.99, 99.99, 99.99, 99.99), cob = c(NA,
NA, "2002-08-06", "2002-08-06", "2002-08-07", "2002-08-07")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
date = c("2002-08-06", "2002-08-07", "2002-08-09"), .rows = structure(list(
1:2, 3:4, 5:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))
Thank you very much in advance.
If I understand correctly, you want the previous date recorded in your date column as cob. So, your Aug 9 rows would have the previously recorded date of Aug 7 in your cob column.
If so, you could try the following. First, your example data above is grouped so I started with ungroup. You can get a vector of unique or distinct dates, and get the lag or previous date for those dates. In this case, dates of Aug 6, 7, and 9 will have cob set as NA, Aug 6, and Aug 7.
Then, you can join back to original data with right_join. The final select will keep columns and include order desired.
I left date alone (currently is character value, not in date format).
library(tidyverse)
df %>%
ungroup() %>%
distinct(date) %>%
mutate(cob = lag(date)) %>%
right_join(df) %>%
select(date, code, due_date, ave_price, cob)
Output
date code due_date ave_price cob
<chr> <int> <chr> <dbl> <chr>
1 2002-08-06 2870075 2002-08-20 100. NA
2 2002-08-06 3000075 2002-09-10 100. NA
3 2002-08-07 2870075 2002-08-20 100. 2002-08-06
4 2002-08-07 3000075 2002-09-10 100. 2002-08-06
5 2002-08-09 2870075 2002-08-20 100. 2002-08-07
6 2002-08-09 3000075 2002-09-10 100. 2002-08-07
I have a table that consists of only columns of type Date. The data is about shopping behavior of customers on a website. The columns correspond to the first time an event is triggered by a customer (NULL if no occurrence of the event). One of the columns is the purchase motion.
Here's a MRE for the starting state of the Database:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, 10613), class = "Date"), action_C = structure(c(10707,
10729, 10739, NA, NA), class = "Date")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"))
I want to update the table so that all the columns of a particular row, all the cells that did not occur within 30 days prior to the purchase are replaced with NULL. However, if the purchase motion is NULL, I'd like to keep the dates of the other events.
So after my envisioned transformation, the above table should look as the following:
structure(list(purchase = structure(c(NA, NA, 10729, NA, 10737
), class = "Date"), action_A = structure(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), class = "Date"), action_B = structure(c(NA,
NA, 10713, NA, NA), class = "Date"), action_C = structure(c(10707,
10729, NA, NA, NA), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I have yet to be able to achieve this transformation, and would appreciate the help!
Finally, I'd like to transform the above table into a binary format. I've achieved this via the below code segment; however, I'd like to know if I can do this in a simpler way.
df_c <- df_b %>%
is.na() %>%
magrittr::not() %>%
data.frame()
df_c <- df_c * 1
I assume that by saying "replaced by NULL" you actually mean "replaced by NA".
I also assume that the first structure in your question is df_a.
df_b <- df_a %>% mutate(across(starts_with("action"),
~ if_else(purchase - . > 30, as.Date(NA), .)))
mutate(across(cols, func)) applies func to all selected cols.
the real trick here is to use if_else and cast NA into Date class. Otherwise, the dates will be converted to numeric vectors.
Result:
# Tibble (class tbl_df) 4 x 5:
│purchase │action_A│action_B │action_C
1│NA │NA │NA │NA
2│NA │NA │NA │NA
3│1999-05-18│NA │1999-05-02│1999-05-28
4│NA │NA │NA │NA
5│1999-05-26│NA │NA │NA
One problem which remains as a homework exercise: how do you modify the if_else such that you will keep the action if purchase is NA? (this should be now very simple!) I did not include that on purpose because you omitted it from the question.
I want to know how to use the Savitzky-Golay filter in R to fill gaps and smooth my data.
This is my code, and my code JUST showed NA as results:
library(signal)
sg <- sgolay(p=1, n=3, m=0)
PLOT1500$sg <- filter(sg, PLOT1500$evi21500)
PLOT1500$sg
NA:
PLOT1500$sg
[1] NA NA NA NA NA NA NA NA NA NA
My data sample is shown as follows:
structure(list(system = structure(c(1459641600, 1459728000, 1459814400,
1459900800, 1459987200, 1460073600, 1460160000, 1460246400, 1460332800,
1460419200), tzone = "UTC", class = c("POSIXct", "POSIXt")),
evi21500 = c(0.329, 0.328, NA, NA, NA, NA, NA, NA, NA, NA
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
I want to know how to solve the NA problem, fill the gap, and smooth the data.
I've had this problem before, and it comes if you load the signal signal before dplyr. Try loading signal later. The 2nd issue is that the filter length should be an odd number less than the number of data points.
library(dplyr)
library(signal)
PLOT1500<-structure(list(`system:time_start` = structure(c(1451606400,
1451692800, 1451952000, 1452038400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), evi1500 = c(0.437, NA, NA, 0.486), evi21500 = c(0.408,
0.434, 0.434, 0.423), kndvi1500 = c(0.429, 0.532, 0.532, 0.525
), ndvi1500 = c(0.724, 0.773, 0.773, 0.788), nirv1500 = c(0.172,
0.187, 0.187, 0.182), evi2500 = c(0.611, NA, NA, 0.579), evi22500 = c(0.576,
0.426, 0.426, 0.539), kndvi2500 = c(0.417, NA, NA, 0.443), ndvi2500 = c(0.781,
0.757, 0.757, 0.825), nirv2500 = c(0.286, 0.182, 0.182, 0.254
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
sg <- sgolay(p=1, n=3, m=0)
PLOT1500$sg <- filter(sg, PLOT1500$evi21500)
I have two data frames: users and events.
Both data frames contain a field that links events to users.
How can I create a for loop where every user's unique ID is matched against an event of a particular type and then stores the number of occurrences into a new column within users (users$conversation_started, users$conversation_missed, etc.)?
In short, it is a conditional for loop.
So far I have this but it is wrong:
for(i in users$id){
users$conversation_started <- nrow(event[event$type = "conversation-started"])
}
An example of how to do this would be ideal.
The idea is:
for(each user)
find the matching user ID in events
count the number of event types == "conversation-started"
assign count value to user$conversation_started
end for
Important note:
The type field can contain one of five values so I will need to be able to effectively filter on each type for each associate:
> events$type %>% table %>% as.matrix
[,1]
conversation-accepted 3120
conversation-already-accepted 19673
conversation-declined 27
conversation-missed 831
conversation-request 23427
Data frames (note that these are reduced versions as confidential information has been removed):
users <- structure(list(`_id` = c("JTuXhdI4Ai", "iGIeCEXyVE", "6XFtOJh0bD",
"mNN986oQv9", "9NI71KBMX9", "x1jH7t0Cmy"), language = c("en",
"en", "en", "en", "en", "en"), registering = c(TRUE, TRUE, FALSE,
FALSE, FALSE, NA), `_created_at` = structure(c(1485995043.131,
1488898839.838, 1480461193.146, 1481407887.979, 1489942757.189,
1491311381.916), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`_updated_at` = structure(c(1521039527.236, 1488898864.834,
1527618624.877, 1481407959.116, 1490043838.561, 1491320333.09
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lastOnlineTimestamp = c(1521039526.90314,
NA, 1480461472, 1481407959, 1490043838, NA), isAgent = c(FALSE,
NA, FALSE, FALSE, FALSE, NA), lastAvailableTime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = ""), available = c(NA, NA, NA, NA, NA,
NA), busy = c(NA, NA, NA, NA, NA, NA), joinedTeam = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = ""), timezone = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c("list.1", "list.2", "list.3", "list.4",
"list.5", "list.6"), class = "data.frame")
and
events <- structure(list(`_id` = c("JKY8ZwkM1S", "CG7Xj8dAsA", "pUkFFxoahy",
"yJVJ34rUCl", "XxXelkIFh7", "GCOsENVSz6"), expirationTime = structure(c(1527261147.873,
NA, 1527262121.332, NA, 1527263411.619, 1527263411.619), class = c("POSIXct",
"POSIXt"), tzone = ""), partId = c("d22bfddc-cd51-489f-aec8-5ab9225c0dd5",
"d22bfddc-cd51-489f-aec8-5ab9225c0dd5", "cf4356da-b63e-4e4d-8e7b-fb63035801d8",
"cf4356da-b63e-4e4d-8e7b-fb63035801d8", "a720185e-c300-47c0-b30d-64e1f272d482",
"a720185e-c300-47c0-b30d-64e1f272d482"), type = c("conversation-request",
"conversation-accepted", "conversation-request", "conversation-accepted",
"conversation-request", "conversation-request"), `_p_conversation` = c("Conversation$6nSaLeWqs7",
"Conversation$6nSaLeWqs7", "Conversation$6nSaLeWqs7", "Conversation$6nSaLeWqs7",
"Conversation$bDuAYSZgen", "Conversation$bDuAYSZgen"), `_p_merchant` = c("Merchant$0A2UYADe5x",
"Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x",
"Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x"), `_p_associate` = c("D9ihQOWrXC",
"D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC"
), `_wperm` = list(list(), list(), list(), list(), list(), list()),
`_rperm` = list("*", "*", "*", "*", "*", "*"), `_created_at` = structure(c(1527264657.998,
1527264662.043, 1527265661.846, 1527265669.435, 1527266922.056,
1527266922.059), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`_updated_at` = structure(c(1527264657.998, 1527264662.043,
1527265661.846, 1527265669.435, 1527266922.056, 1527266922.059
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), read = c(TRUE,
NA, TRUE, NA, NA, NA), data.customerName = c("Shopper 109339",
NA, "Shopper 109339", NA, "Shopper 109364", "Shopper 109364"
), data.departmentName = c("Personal advisors", NA, "Personal advisors",
NA, "Personal advisors", "Personal advisors"), data.recurring = c(FALSE,
NA, TRUE, NA, FALSE, FALSE), data.new = c(TRUE, NA, FALSE,
NA, TRUE, TRUE), data.missed = c(0L, NA, 0L, NA, 0L, 0L),
data.customerId = c("84uOFRLmLd", "84uOFRLmLd", "84uOFRLmLd",
"84uOFRLmLd", "5Dw4iax3Tj", "5Dw4iax3Tj"), data.claimingTime = c(NA,
4L, NA, 7L, NA, NA), data.lead = c(NA, NA, FALSE, NA, NA,
NA), data.maxMissed = c(NA, NA, NA, NA, NA, NA), data.associateName = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), data.maxDecline = c(NA, NA, NA, NA, NA, NA
), data.goUnavailable = c(NA, NA, NA, NA, NA, NA)), row.names = c("list.1",
"list.2", "list.3", "list.4", "list.5", "list.6"), class = "data.frame")
Update: 21st September 2018
This solution now results in an NA-only data frame being produced at the end of the function. When written to a .csv, this is what I get (naturally, Excel displays NA-values as blank values):
My data source has not changed, nor has my script.
What might be causing this?
My guess is that this is an unforeseen case where there may have been 0 hits for each step has occurred; as such, is there a way to add 0 to those cases where there weren't any hits, rather than NA/ blank values?
Is there a way to avoid this?
New solution based on the provided data.
Note: As your data had no overlap in _id, I changed the events$_id to be the same as in users.
Simplified example data:
users <- structure(list(`_id` = structure(c(4L, 3L, 1L, 5L, 2L, 6L),
.Label = c("6XFtOJh0bD", "9NI71KBMX9", "iGIeCEXyVE",
"JTuXhdI4Ai", "mNN986oQv9", "x1jH7t0Cmy"),
class = "factor")), .Names = "_id",
row.names = c(NA, -6L), class = "data.frame")
events <- structure(list(`_id` = c("JKY8ZwkM1S", "CG7Xj8dAsA", "pUkFFxoahy",
"yJVJ34rUCl", "XxXelkIFh7", "GCOsENVSz6"),
type = c("conversation-request", "conversation-accepted",
"conversation-request", "conversation-accepted",
"conversation-request", "conversation-request")),
.Names = c("_id", "type"), class = "data.frame",
row.names = c("list.1", "list.2", "list.3", "list.4", "list.5", "list.6"))
events$`_id` <- users$`_id`
> users
_id
1 JTuXhdI4Ai
2 iGIeCEXyVE
3 6XFtOJh0bD
4 mNN986oQv9
5 9NI71KBMX9
6 x1jH7t0Cmy
> events
_id type
list.1 JTuXhdI4Ai conversation-request
list.2 iGIeCEXyVE conversation-accepted
list.3 6XFtOJh0bD conversation-request
list.4 mNN986oQv9 conversation-accepted
list.5 9NI71KBMX9 conversation-request
list.6 x1jH7t0Cmy conversation-request
We can use the same approach I suggested before, just enhance it a bit.
First we loop over unique(events$type) to store a table() of every type of event per id in a list:
test <- lapply(unique(events$type), function(x) table(events$`_id`, events$type == x))
Then we store the specific type as the name of the respective table in the list:
names(test) <- unique(events$type)
Now we use a simple for-loop to match() the user$_id with the rownames of the table and store the information in a new variable with the name of the event type:
for(i in names(test)){
users[, i] <- test[[i]][, 2][match(users$`_id`, rownames(test[[i]]))]
}
Result:
> users
_id conversation-request conversation-accepted
1 JTuXhdI4Ai 1 0
2 iGIeCEXyVE 0 1
3 6XFtOJh0bD 1 0
4 mNN986oQv9 0 1
5 9NI71KBMX9 1 0
6 x1jH7t0Cmy 1 0
Hope this helps!
I am binding a number of data frames data frames and have noticed that I get weird values in one of the bindings. Datetime in second df is disturbed after binding, it is one hour less than in original df.
kk <- structure(list(date = structure(c(1499133600, 1499137200, 1499140800,
1499144400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
temp = c(14.7, 14.6, 14.3, 14.2)), .Names = c("date", "temp"
), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
ff <- structure(list(date = structure(c(1499144400, 1499148000, 1499151600,
1499155200), class = c("POSIXct", "POSIXt"), tzone = ""), temp = 14:17), .Names = c("date",
"temp"), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
Calling functions from different packages give me same result:
dplyr:: bind_rows(kk, ff)
data.table::rbindlist(list(kk, ff))
rbind(kk,ff)
I do not get what is going on. Could it have something to do with date format?