read.table from write.table in R - r

I'm trying to do a qdap::multigsub in order to fix some typos, misspelled names, variant expressions and some other "aberrations" in a list of climatic event types (yes, it's the NOAA's data set on storms that belongs to an assignment in a coursera class on reproducible research; although this fixing is neither required nor expected in the assignment: it's me trying my best!).
So I have events named "flash flood", "flash flooding", "flash floods" and the like, and I'd like to group them all in a level called "flash flood". So what I did first was:
expr <- c("^flash.*floo.*","thun.*")
repl <- c("flash flood","thunderstorm")
Length of each vector is 51 and this is a knitr assignment, so in order to keep it readable (margin column=80), I had to go with something like
expr <- c(expr,"new_expr_1","new_expr_2")
repl <- c(repl,"new_repl_1","new_repl_2") # repeated many, many times
Which makes the code kind of messy. Of course, I have the complete expr and repl vectors, so I would like to have each pair (expr and repl) of correspondent values in a row, so the reader of the code would have an easy time (that's why dput won't work here: they don't align each pair of values).
I tried this:
a <- data.frame(expr=expr,repl=repl)
print(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it failed (I think because print throws the output without quotation marks and there are some two-word expr or repl). I also tried
write.table(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it doesn't work either (I think because write.table outputs each item between quotes, and read.table finds too many quotation marks to handle).
I'd like to have in my Rmarkdown file something like this:
exprRepl <- read.table(header=TRUE,text="expr repl
expr_1 repl_1
expr_2 repl_2")
How can I achieve this from the data I have now?
dput of the first 5 rows of data frame follow:
> dput(a[1:5,])
structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
If there's any other approach to replace the wrong/variant names, I'd be very happy to hear from it and give it a try!

One solution is to use a singe quote ' around the pasted text (this works as long as there are no ' in your data):
d <- structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
write.table(d, row.names=FALSE)
# copy paste output of write.table in text field below:
read.table(header = TRUE, text='"expr" "repl"
"^HURRICANE.*" "HURRICANE"
"^BLIZZARD.*" "BLIZZARD"
"^FLASH.*FLOOD.*" "FLASH FLOOD"
"^HAIL.*" "HAIL"
"^HEAVY.*RAIN.*" "HEAVY RAIN"')

Related

Converting init into date?

I have been working with a dataset that comes with a date column. When I run typeof(headlineDat$Date) I get a type integer.
I've tried pasting in a few things I found off google but none have seemed to work. I've tried running this piece of code
as.POSIXct(strptime(headlineDat$Time.read,format= "%Y-%m-%d"))
My aim is to have the same format as the year column below. The reason why I want to do this is that I want to be able to create a unique identifier so I can easily match dates when I merge the two data frames.
Any help on this would be greatly appreciated !
This is my dput output:
dput(droplevels(headlineDat[1:5, ]))
structure(list(Date = structure(c(1L, 3L, 3L, 2L, 4L), .Label = c("2018-04-26T11:31:02+00:00",
"2018-05-02T21:10:20+00:00", "2018-05-03T15:30:59+00:00", "2018-05-03T18:00:39+00:00"
), class = "factor"), Headline = structure(c(5L, 2L, 4L, 3L,
1L), .Label = c("Bitcoin Futures Trading Questioned By Chinese National Media",
"Daily Volatility Decline? Bitcoin Has Seen $1K Range 43 Times In 2018",
"Reddit to Relaunch Bitcoin Payments (And Add More Cryptos)",
"Sell In May and Go Away? Not for Bitcoin Bulls", "Square Books Small Profit for First Quarter of Bitcoin Sales"
), class = "factor")), row.names = c(NA, 5L), class = "data.frame")
You are starting with a standard format, so as.Date does the conversion just fine.
headlineDat$Date = as.Date(headlineDat$Date)

Using apply family and multiple functions on lists in R

I have a question following my answer to this question on this question
Matching vertex attributes across a list of edgelists R
My solution was to use for loops, but we should always try to optimize(vectorize) when we can.
What I'm trying to understand is how I would vectorize the solution I made in the post.
My solution was
for(i in 1:length(graph_list)){
graph_list[[i]]=set_vertex_attr(graph_list[[i]],"gender", value=attribute_df$gender[match(V(graph_list[[i]])$name, attribute_df$names)])
}
Ideally we could vectorize this with lapply but I'm having some trouble conceiving how to do that. Here's what I've got
graph_lists_new=lapply(graph_list, set_vertex_attr, value=attribute_df$gender[match(V(??????????)$name, attribute_df$names)]))
What I'm unclear about is what I'd put in the part with the ??????. The thing inside the V() function should be each item in the list, but what I don't get is what I'd put inside when I'm using lapply.
All data can be found in the link I posted, but here's the data anyway
attribute_df<- structure(list(names = structure(c(6L, 7L, 5L, 2L, 1L, 8L, 3L,
4L), .Label = c("Andy", "Angela", "Eric", "Jamie", "Jeff", "Jim",
"Pam", "Tim"), class = "factor"), gender = structure(c(3L, 2L,
3L, 2L, 3L, 1L, 1L, 2L), .Label = c("", "F", "M"), class = "factor"),
happiness = c(8, 9, 4.5, 5.7, 5, 6, 7, 8)), class = "data.frame", row.names = c(NA,
-8L))
edgelist<-list(structure(list(nominator1 = structure(c(3L, 4L, 1L, 2L), .Label = c("Angela",
"Jeff", "Jim", "Pam"), class = "factor"), nominee1 = structure(c(1L,
2L, 3L, 2L), .Label = c("Andy", "Angela", "Jeff"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(nominator2 = structure(c(4L, 1L, 2L, 3L
), .Label = c("Eric", "Jamie", "Oscar", "Tim"), class = "factor"),
nominee2 = structure(c(1L, 3L, 2L, 3L), .Label = c("Eric",
"Oscar", "Tim"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L)))
graph_list<- lapply(edgelist, graph_from_data_frame)
Since you need to use graph_list[[i]] multiple times in your call, to use lapply you need to write a custom function, such as this anonymous function. (It's the same code as your loop, I just wrapped it in function(x) and replaced all instances of graph_list[[i]] with x.)
graph_list = lapply(graph_list, function(x)
set_vertex_attr(x, "gender", value = attribute_df$gender[match(V(x)$name, attribute_df$names)])
)
(Note that I didn't test this, but it should work unless I made a typo.)
lapply isn't vectorization---it's just "loop hiding". In this case, I think your for loop is a nicer way to do things than lapply. Especially since you are modifying existing objects, your simple for loop will probably be more efficient than an lapply solution, as well as more readable.
When we talk about vectorization for efficiency, we almost always mean atomic vectors, not lists. (It's vectorization, after all, not listization.) The reason to use lapply and related functions (sapply, vapply, Map, most of the purrr package) isn't computer efficiency, it's readability, and human-efficiency to write.
Let's say you have a list of data frames, my_list = list(iris, mtcars, CO2). If you want to get the number of rows for each of the data frames in the list and store it in a variable, we could use sapply or a for loop:
# easy to write, easy to read
rows_apply = sapply(my_list, nrow)
# annoying to read and write
rows_for = integer(length(my_list))
for (i in seq_along(my_list)) rows_for[i] = nrow(my_list[[i]])
But the more complex your task gets, the more readable a for loop becomes compared to an alternative like these. In your case, I'd prefer the for loop.
For more reading on this, see the old question Is apply more than syntactic sugar?. Since those answers were written, R has been upgraded to include a just-in-time compiler, which further speeds up for loops relative to apply. In the nearly 10-year-old answers there, you'll see that sometimes *apply is slightly faster than a for loop. Since the JIT compiler, I think you'll find the opposite: most of the time a for loop is slightly faster than *apply.
But in both of those cases, unless you're doing something absolutely trivial inside the for/apply, whatever you do inside for/apply will dominate the timings.

R error : arguments imply differing number of rows

So I am trying to operate a function over a few columns of a data frame, using a for loop.
z <- function(x) gsub("[^\\.\\d]", "", x, perl = TRUE)
data <- cbind(data[1:2], for(i in seq(3, 9)) {y(data[[i]])})
I keep running into the error as mentioned in the subject
arguments imply differing number of rows
The number of rows in all my columns are same.
I tried to use lapply for this, but though it works, it converts the column types over which I apply the function to factor. The columns are numerical values, but are originally read as characters from the file (they are stored as such). So when I try to convert to numbers after using lapply, I get number of levels as output (like, 1,2,3...)
Any suggestions, using either the for loop, or lapply are welcome. Thanks in advance.
> dput(head(data,3))
structure(list(MCF.Channel.Grouping = structure(c(6L, 6L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), Device.Category = structure(c(2L,
1L, 3L), .Label = c("desktop", "mobile", "tablet"), class = "factor"),
Spend = c("A$503,172.17", "A$375,940.43", "A$92,560.94"),
Clicks = c("1,545,416", "1,037,740", "291,314"), Impressions = c("7,328,657",
"3,787,612", "1,178,508"), Data.Driven.Conversions = c("1,697,814.32",
"1,540,810.43", "430,738.63"), Data.Driven.CPA = c("A$0.30",
"A$0.24", "A$0.21"), Data.Driven.Conversion.Value = c("A$12,815,842.66",
"A$13,883,073.58", "A$3,804,800.15"), Data.Driven.ROAS = c("2547.01%",
"3692.89%", "4110.59%")), .Names = c("MCF.Channel.Grouping",
"Device.Category", "Spend", "Clicks", "Impressions", "Data.Driven.Conversions",
"Data.Driven.CPA", "Data.Driven.Conversion.Value", "Data.Driven.ROAS"
), row.names = c(NA, 3L), class = "data.frame")
We can use
data[-(1:2)] <- lapply(data[-(1:2)], z)
The function is run on columns that are not the first or second. The output is assigned to the same subset in the data.
The original method did not work because the for loop does not result in saved output. Check by trying to save it as a variable:
x <- for(i in seq(3, 9)) {z(data[[i]])}
x
NULL
Even though we saved the contents of the loop, nothing was captured. The loop ran then dumped the results. To see how a loop could work, we can assign values within:
for ( i in 3:9) data[,i] <- z(data[,i])

Error in charToDate(x) : character string is not in a standard unambiguous format

I am writing because I have nowhere else to go to get an answer. I am trying to shrink my existing table a bit. It is of the next form:
Živilec; Proizvodnja; Kariera d.o.o.; 18.11.2014 hh.mm.ss; Ljubljana
Živilec; Prehrambena industrija; Kariera d.o.o.; 18.11.2014 hh.mm.ss; Ljubljana
Vodja; Strojništvo; Adecco; 18.11.2014 hh.mm.ss; Maribor
Vodja; Tehnične storitve; Adecco; 18.11.2014 hh.mm.ss; Maribor
Vodja; Elektrotehnika; Adecco; 18.11.2014 hh.mm.ss; Celje
, the dates are actually inserted as 18.11.2014 8:35:59 but I dont need the time, just the date.
And what I wish to get to is this:
Živilec; Proizvodnja,Preh. industrija; Kariera d.o.o.; 18.11.2014; Ljubljana
Vodja; Stroj.,Teh. stor., Elektro.; Adecco; 18.11.2014; Maribor, Celje
I have tryed getting this with the help of this R-code:
matrik<-matrix(0,600,30)
for (i in 1:dim(a)[1]){
if (is.element(a[i,3],matrik[,15])==TRUE & is.element(a[i,1],matrik[,1])==TRUE){
katero<-which(a[i,1]==matrik[,1])
kdo<-which(a[i,15]==matrik[,15])
kje<-min(intersect(kdo,katero))
if (kje!=0){
prosto<-min(which(matrik[kje,2:14]==0))
matrik[kje,prosto]<-as.character(a[i,2])
prosti<-min(which(matrik[kje,17:30]==0))
matrik[kje,prosti]<-as.character(a[i,5])
}
if (kje==0){
povrsti<-min(which(matrik[,1]==0))
matrik[povrsti,1]<-as.character(a[i,1])
prosto<-min(which(matrik[povrsti,2:14]==0))+1
matrik[povrsti,prosto]<-as.character(a[i,2])
matrik[povrsti,15]<-as.character(a[i,3])
matrik[povrsti,16]<-as.character(a[i,4])
prosti<-min(which(matrik[povrsti,17:30]==0))+1
matrik[povrsti,prosti]<-as.character(a[i,5])
}
}
else {
povrsti<-min(which(matrik[,1]==0))
matrik[povrsti,1]<-as.character(a[i,1])
prosto<-min(which(matrik[povrsti,2:14]==0))+1
matrik[povrsti,prosto]<-as.character(a[i,2])
matrik[povrsti,15]<-as.character(a[i,3])
matrik[povrsti,16]<-as.character(a[i,4])
prosti<-min(which(matrik[povrsti,17:30]==0))+16
matrik[povrsti,prosti]<-as.character(a[i,5])
}
}
Basically I make a new matrix in which I will store the values, because i cannot store the categories like teh. storitve, strojništvo, elektro in one cell and just 2 values in another cell in the same column I decided to look at the maximum value of all the categories and make that many cells. If this problem is solvable otherwise please let me know aswell if you could. So anyways after making a zero matrix, I check if the first element (so "Živilec") and the third element (so "Kariera d.o.o.) are the same, if that is true I would like to just add values to the second and fifth(last) column. If not I see that I must add a new row to the existing matrix with all the values from the table. As I run this code I get the error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
What to do? Any solutions?
Thank you for your time.
In order to parse the dates, you can do it like below:
library(lubridate)
x <- c("18.11.2014 8:35:59")
as.Date(dmy_hms(x))
Otherwise, you should give the community some sample data...use
dput(your_data)
people will show you the way in no time.
UPDATE
Here is a solution:
Load some useful libraries...
library(stringr)
library(dplyr)
your data...
toy_data <-
structure(list(V1 = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("Vodja",
"Živilec"), class = "factor"), V2 = structure(c(5L, 4L, 2L, 3L,
1L), .Label = c(" Elektrotehnika", " Strojništvo",
" Tehnične storitve", " Prehrambena industrija", " Proizvodnja"
), class = "factor"), V3 = structure(c(2L, 5L, 1L, 4L, 3L), .Label = c(" Adecco",
" Kariera d.o.o.", " Adecco", " Adecco",
" Kariera d.o.o."), class = "factor"), V4 = structure(c(2L, 2L,
1L, 1L, 1L), .Label = c(" 18.11.2014", " 18.11.2014"
), class = "factor"), V5 = structure(c(2L, 2L, 3L, 3L, 1L), .Label = c(" Celje",
" Ljubljana", " Maribor"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-5L))
an useful function...
my_str_c <- function(x){str_c(unique(x), collapse = ";")}
a code for your desired output...
toy_data %>%
mutate_each(funs(str_trim)) %>%
group_by(V1) %>%
summarise_each(funs(my_str_c))

Include a text representation of an object (like dput) in a function call for reproducible research

I have created a shiny app in which a user can load a file and use the object as a function argument. I also print the code to run the function locally (so that I or anyone else could copy and paste to reproduce the result).
What I would like to do is to be able to use something like dput but to save the text representation of the loaded object to an object rather than the console. dput outputs to the console, but simply returns a copy of it's first argument. I can use deparse but it fails when the length of the object exceeds width.cutoff (default 60 and max 500).
The following hacky reproducible example illustrates. In it I use image as the example function. In my case I have other functions with more arguments.
#create example matrices
m2 <- matrix(1:4,2,2)
m4 <- matrix(1:4,4,4)
#this is what I want to recreate
image(z=m2,col=rainbow(4))
image(z=m4,col=rainbow(4))
#convert the matrices to their text representation
txtm2 <- deparse(m2)
txtm4 <- deparse(m4)
#create a list of arguments
lArgs2 <- list( z=txtm2, col=rainbow(4) )
lArgs4 <- list( z=txtm4, col=rainbow(4) )
#construct arguments list
vArgs2 <- paste0(names(lArgs2),"=",lArgs2,", ")
vArgs4 <- paste0(names(lArgs4),"=",lArgs4,", ")
#remove final comma and space
vArgs2[length(vArgs2)] <- substr(vArgs2[length(vArgs2)],0,nchar(vArgs2[length(vArgs2)])-2)
vArgs4[length(vArgs4)] <- substr(vArgs4[length(vArgs4)],0,nchar(vArgs4[length(vArgs4)])-2)
#create the text function call
cat("image(",vArgs2,")")
cat("image(",vArgs4,")")
#the 1st one when pasted works
image( z=structure(1:4, .Dim = c(2L, 2L)), col=c("#FF0000FF", "#80FF00FF", "#00FFFFFF", "#8000FFFF") )
#the 2nd one gives an error because the object has been split across multiple lines
image( z=c("structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, ", "2L, 3L, 4L), .Dim = c(4L, 4L))"), col=c("#FF0000FF", "#80FF00FF", "#00FFFFFF", "#8000FFFF") )
#In an ideal world I would also like it to work when I did this, but maybe that's asking too much
image(z=txtm2,col=rainbow(4))
I realise that the way I construct the function call is a hack, but when I looked at it a while ago I couldn't find a better way of doing. Open to any suggestions. Thanks.
You can do something like :
## an object that you want to recreate
m2 <- matrix(1:4,2,2)
## use capture.output to save structure as a string in a varible
xx <- capture.output(dput(m2))
## recreate the object
m2_ <- eval(parse(text=xx))
image(z=m2_,col=rainbow(4))

Resources