Converting init into date? - r

I have been working with a dataset that comes with a date column. When I run typeof(headlineDat$Date) I get a type integer.
I've tried pasting in a few things I found off google but none have seemed to work. I've tried running this piece of code
as.POSIXct(strptime(headlineDat$Time.read,format= "%Y-%m-%d"))
My aim is to have the same format as the year column below. The reason why I want to do this is that I want to be able to create a unique identifier so I can easily match dates when I merge the two data frames.
Any help on this would be greatly appreciated !
This is my dput output:
dput(droplevels(headlineDat[1:5, ]))
structure(list(Date = structure(c(1L, 3L, 3L, 2L, 4L), .Label = c("2018-04-26T11:31:02+00:00",
"2018-05-02T21:10:20+00:00", "2018-05-03T15:30:59+00:00", "2018-05-03T18:00:39+00:00"
), class = "factor"), Headline = structure(c(5L, 2L, 4L, 3L,
1L), .Label = c("Bitcoin Futures Trading Questioned By Chinese National Media",
"Daily Volatility Decline? Bitcoin Has Seen $1K Range 43 Times In 2018",
"Reddit to Relaunch Bitcoin Payments (And Add More Cryptos)",
"Sell In May and Go Away? Not for Bitcoin Bulls", "Square Books Small Profit for First Quarter of Bitcoin Sales"
), class = "factor")), row.names = c(NA, 5L), class = "data.frame")

You are starting with a standard format, so as.Date does the conversion just fine.
headlineDat$Date = as.Date(headlineDat$Date)

Related

Using apply family and multiple functions on lists in R

I have a question following my answer to this question on this question
Matching vertex attributes across a list of edgelists R
My solution was to use for loops, but we should always try to optimize(vectorize) when we can.
What I'm trying to understand is how I would vectorize the solution I made in the post.
My solution was
for(i in 1:length(graph_list)){
graph_list[[i]]=set_vertex_attr(graph_list[[i]],"gender", value=attribute_df$gender[match(V(graph_list[[i]])$name, attribute_df$names)])
}
Ideally we could vectorize this with lapply but I'm having some trouble conceiving how to do that. Here's what I've got
graph_lists_new=lapply(graph_list, set_vertex_attr, value=attribute_df$gender[match(V(??????????)$name, attribute_df$names)]))
What I'm unclear about is what I'd put in the part with the ??????. The thing inside the V() function should be each item in the list, but what I don't get is what I'd put inside when I'm using lapply.
All data can be found in the link I posted, but here's the data anyway
attribute_df<- structure(list(names = structure(c(6L, 7L, 5L, 2L, 1L, 8L, 3L,
4L), .Label = c("Andy", "Angela", "Eric", "Jamie", "Jeff", "Jim",
"Pam", "Tim"), class = "factor"), gender = structure(c(3L, 2L,
3L, 2L, 3L, 1L, 1L, 2L), .Label = c("", "F", "M"), class = "factor"),
happiness = c(8, 9, 4.5, 5.7, 5, 6, 7, 8)), class = "data.frame", row.names = c(NA,
-8L))
edgelist<-list(structure(list(nominator1 = structure(c(3L, 4L, 1L, 2L), .Label = c("Angela",
"Jeff", "Jim", "Pam"), class = "factor"), nominee1 = structure(c(1L,
2L, 3L, 2L), .Label = c("Andy", "Angela", "Jeff"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(nominator2 = structure(c(4L, 1L, 2L, 3L
), .Label = c("Eric", "Jamie", "Oscar", "Tim"), class = "factor"),
nominee2 = structure(c(1L, 3L, 2L, 3L), .Label = c("Eric",
"Oscar", "Tim"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)))
graph_list<- lapply(edgelist, graph_from_data_frame)
Since you need to use graph_list[[i]] multiple times in your call, to use lapply you need to write a custom function, such as this anonymous function. (It's the same code as your loop, I just wrapped it in function(x) and replaced all instances of graph_list[[i]] with x.)
graph_list = lapply(graph_list, function(x)
set_vertex_attr(x, "gender", value = attribute_df$gender[match(V(x)$name, attribute_df$names)])
)
(Note that I didn't test this, but it should work unless I made a typo.)
lapply isn't vectorization---it's just "loop hiding". In this case, I think your for loop is a nicer way to do things than lapply. Especially since you are modifying existing objects, your simple for loop will probably be more efficient than an lapply solution, as well as more readable.
When we talk about vectorization for efficiency, we almost always mean atomic vectors, not lists. (It's vectorization, after all, not listization.) The reason to use lapply and related functions (sapply, vapply, Map, most of the purrr package) isn't computer efficiency, it's readability, and human-efficiency to write.
Let's say you have a list of data frames, my_list = list(iris, mtcars, CO2). If you want to get the number of rows for each of the data frames in the list and store it in a variable, we could use sapply or a for loop:
# easy to write, easy to read
rows_apply = sapply(my_list, nrow)
# annoying to read and write
rows_for = integer(length(my_list))
for (i in seq_along(my_list)) rows_for[i] = nrow(my_list[[i]])
But the more complex your task gets, the more readable a for loop becomes compared to an alternative like these. In your case, I'd prefer the for loop.
For more reading on this, see the old question Is apply more than syntactic sugar?. Since those answers were written, R has been upgraded to include a just-in-time compiler, which further speeds up for loops relative to apply. In the nearly 10-year-old answers there, you'll see that sometimes *apply is slightly faster than a for loop. Since the JIT compiler, I think you'll find the opposite: most of the time a for loop is slightly faster than *apply.
But in both of those cases, unless you're doing something absolutely trivial inside the for/apply, whatever you do inside for/apply will dominate the timings.

The value of dataframe change failure

When I extract the matadata from an IR site, I found that the value of dataframe could not be rewrite.
In the matadata I extract, there is an value of attribute named “Related URLs” is “查看原文”(means “look up the source”), which need to be replaced by its real link in the webpage.
> dput(imeta_dc)
structure(list(itemDisplayTable = structure(c(5L, 8L, 6L, 4L,
3L, 7L, 1L, 1L, 12L, 9L, 13L, 11L, 2L, 10L), .Names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), .Label = c(" In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.",
"[http://ir.las.ac.cn/handle/12502/8904] ", "1, Issue:4, Pages:1-9",
"2016-11-03 ", "Data-driven Discovery: A New Era of Exploiting the Literature and Data",
"Journal of Data and Information Science ", "Ying Ding (E-mail:dingying#indiana.edu) ",
"Ying Ding; Kyle Stirling ", "查看原文 ", "期刊论文", "期刊论文 ",
"其他 ", "英语 "), class = "factor")), .Names = "itemDisplayTable", row.names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), class = "data.frame")
I tried to use name of row and column to locate the value of “Related URLs” and change its value by such sentence:
meta_ru <- “http://www.jdis.org”
imeta_dc[c("Related URLs"), c("itemDisplayTable")] <- meta_ru
I use rownames instead of rownumbers because those metadata has different length and different sequence of attribute, only this way can locate one attribute accurately. Further more, when I do this, none of error or warning occurs, but the data could not write into it, and it changed to blank. What should we do to avoid this problem?
There is one problem with your dataset, the field itemDisplayTable is in factor , you need to first convert it into character then use rownames() function to assign it to a value like below.
df$itemDisplayTable <- as.character(df$itemDisplayTable)
meta_ru <- c("http://www.jdis.org")
df[(rownames(df) %in% c("Related URLs"))==T,"itemDisplayTable"] <- meta_ru
View(df)
Output:
You can see here that Related URLs is not empty now and filled with "http://www.jdis.org" in the final output.

read.table from write.table in R

I'm trying to do a qdap::multigsub in order to fix some typos, misspelled names, variant expressions and some other "aberrations" in a list of climatic event types (yes, it's the NOAA's data set on storms that belongs to an assignment in a coursera class on reproducible research; although this fixing is neither required nor expected in the assignment: it's me trying my best!).
So I have events named "flash flood", "flash flooding", "flash floods" and the like, and I'd like to group them all in a level called "flash flood". So what I did first was:
expr <- c("^flash.*floo.*","thun.*")
repl <- c("flash flood","thunderstorm")
Length of each vector is 51 and this is a knitr assignment, so in order to keep it readable (margin column=80), I had to go with something like
expr <- c(expr,"new_expr_1","new_expr_2")
repl <- c(repl,"new_repl_1","new_repl_2") # repeated many, many times
Which makes the code kind of messy. Of course, I have the complete expr and repl vectors, so I would like to have each pair (expr and repl) of correspondent values in a row, so the reader of the code would have an easy time (that's why dput won't work here: they don't align each pair of values).
I tried this:
a <- data.frame(expr=expr,repl=repl)
print(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it failed (I think because print throws the output without quotation marks and there are some two-word expr or repl). I also tried
write.table(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it doesn't work either (I think because write.table outputs each item between quotes, and read.table finds too many quotation marks to handle).
I'd like to have in my Rmarkdown file something like this:
exprRepl <- read.table(header=TRUE,text="expr repl
expr_1 repl_1
expr_2 repl_2")
How can I achieve this from the data I have now?
dput of the first 5 rows of data frame follow:
> dput(a[1:5,])
structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
If there's any other approach to replace the wrong/variant names, I'd be very happy to hear from it and give it a try!
One solution is to use a singe quote ' around the pasted text (this works as long as there are no ' in your data):
d <- structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
write.table(d, row.names=FALSE)
# copy paste output of write.table in text field below:
read.table(header = TRUE, text='"expr" "repl"
"^HURRICANE.*" "HURRICANE"
"^BLIZZARD.*" "BLIZZARD"
"^FLASH.*FLOOD.*" "FLASH FLOOD"
"^HAIL.*" "HAIL"
"^HEAVY.*RAIN.*" "HEAVY RAIN"')

Error in charToDate(x) : character string is not in a standard unambiguous format

I am writing because I have nowhere else to go to get an answer. I am trying to shrink my existing table a bit. It is of the next form:
Živilec; Proizvodnja; Kariera d.o.o.; 18.11.2014 hh.mm.ss; Ljubljana
Živilec; Prehrambena industrija; Kariera d.o.o.; 18.11.2014 hh.mm.ss; Ljubljana
Vodja; Strojništvo; Adecco; 18.11.2014 hh.mm.ss; Maribor
Vodja; Tehnične storitve; Adecco; 18.11.2014 hh.mm.ss; Maribor
Vodja; Elektrotehnika; Adecco; 18.11.2014 hh.mm.ss; Celje
, the dates are actually inserted as 18.11.2014 8:35:59 but I dont need the time, just the date.
And what I wish to get to is this:
Živilec; Proizvodnja,Preh. industrija; Kariera d.o.o.; 18.11.2014; Ljubljana
Vodja; Stroj.,Teh. stor., Elektro.; Adecco; 18.11.2014; Maribor, Celje
I have tryed getting this with the help of this R-code:
matrik<-matrix(0,600,30)
for (i in 1:dim(a)[1]){
if (is.element(a[i,3],matrik[,15])==TRUE & is.element(a[i,1],matrik[,1])==TRUE){
katero<-which(a[i,1]==matrik[,1])
kdo<-which(a[i,15]==matrik[,15])
kje<-min(intersect(kdo,katero))
if (kje!=0){
prosto<-min(which(matrik[kje,2:14]==0))
matrik[kje,prosto]<-as.character(a[i,2])
prosti<-min(which(matrik[kje,17:30]==0))
matrik[kje,prosti]<-as.character(a[i,5])
}
if (kje==0){
povrsti<-min(which(matrik[,1]==0))
matrik[povrsti,1]<-as.character(a[i,1])
prosto<-min(which(matrik[povrsti,2:14]==0))+1
matrik[povrsti,prosto]<-as.character(a[i,2])
matrik[povrsti,15]<-as.character(a[i,3])
matrik[povrsti,16]<-as.character(a[i,4])
prosti<-min(which(matrik[povrsti,17:30]==0))+1
matrik[povrsti,prosti]<-as.character(a[i,5])
}
}
else {
povrsti<-min(which(matrik[,1]==0))
matrik[povrsti,1]<-as.character(a[i,1])
prosto<-min(which(matrik[povrsti,2:14]==0))+1
matrik[povrsti,prosto]<-as.character(a[i,2])
matrik[povrsti,15]<-as.character(a[i,3])
matrik[povrsti,16]<-as.character(a[i,4])
prosti<-min(which(matrik[povrsti,17:30]==0))+16
matrik[povrsti,prosti]<-as.character(a[i,5])
}
}
Basically I make a new matrix in which I will store the values, because i cannot store the categories like teh. storitve, strojništvo, elektro in one cell and just 2 values in another cell in the same column I decided to look at the maximum value of all the categories and make that many cells. If this problem is solvable otherwise please let me know aswell if you could. So anyways after making a zero matrix, I check if the first element (so "Živilec") and the third element (so "Kariera d.o.o.) are the same, if that is true I would like to just add values to the second and fifth(last) column. If not I see that I must add a new row to the existing matrix with all the values from the table. As I run this code I get the error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
What to do? Any solutions?
Thank you for your time.
In order to parse the dates, you can do it like below:
library(lubridate)
x <- c("18.11.2014 8:35:59")
as.Date(dmy_hms(x))
Otherwise, you should give the community some sample data...use
dput(your_data)
people will show you the way in no time.
UPDATE
Here is a solution:
Load some useful libraries...
library(stringr)
library(dplyr)
your data...
toy_data <-
structure(list(V1 = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("Vodja",
"Živilec"), class = "factor"), V2 = structure(c(5L, 4L, 2L, 3L,
1L), .Label = c(" Elektrotehnika", " Strojništvo",
" Tehnične storitve", " Prehrambena industrija", " Proizvodnja"
), class = "factor"), V3 = structure(c(2L, 5L, 1L, 4L, 3L), .Label = c(" Adecco",
" Kariera d.o.o.", " Adecco", " Adecco",
" Kariera d.o.o."), class = "factor"), V4 = structure(c(2L, 2L,
1L, 1L, 1L), .Label = c(" 18.11.2014", " 18.11.2014"
), class = "factor"), V5 = structure(c(2L, 2L, 3L, 3L, 1L), .Label = c(" Celje",
" Ljubljana", " Maribor"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-5L))
an useful function...
my_str_c <- function(x){str_c(unique(x), collapse = ";")}
a code for your desired output...
toy_data %>%
mutate_each(funs(str_trim)) %>%
group_by(V1) %>%
summarise_each(funs(my_str_c))

How to add values for a column after scraping the table from a website

I am trying to get the total deaths from Ebola from the List of ebola outbreaks and cant seem to find my mistake. Would appreciate some help. The website link is http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks
I have used the following code:
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1)[[2]]
df1$"Human death"
but when I am trying to add the values in this using the sum function. It gives the following error
Error in Summary.factor(c(5L, 12L, 1L, 2L, 9L, 1L, 1L, 1L, 1L, 14L, 1L, :
sum not meaningful for factors
Can someone please help me figure this out?
You are reading the table in with R default which converts characters to factors. You can use stringsAsFactors = FALSE in readHTMLTable and this will be passed to data.frame. Also the table uses commas for thousand separators which you will need to remove :
library(XML)
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1, which = 2, stringsAsFactors = FALSE)
df1$"Human death"
sum(as.integer(gsub(",", "", df1$"Human death")))
> mySum
[1] 6910

Resources