The value of dataframe change failure - r

When I extract the matadata from an IR site, I found that the value of dataframe could not be rewrite.
In the matadata I extract, there is an value of attribute named “Related URLs” is “查看原文”(means “look up the source”), which need to be replaced by its real link in the webpage.
> dput(imeta_dc)
structure(list(itemDisplayTable = structure(c(5L, 8L, 6L, 4L,
3L, 7L, 1L, 1L, 12L, 9L, 13L, 11L, 2L, 10L), .Names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), .Label = c(" In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.",
"[http://ir.las.ac.cn/handle/12502/8904] ", "1, Issue:4, Pages:1-9",
"2016-11-03 ", "Data-driven Discovery: A New Era of Exploiting the Literature and Data",
"Journal of Data and Information Science ", "Ying Ding (E-mail:dingying#indiana.edu) ",
"Ying Ding; Kyle Stirling ", "查看原文 ", "期刊论文", "期刊论文 ",
"其他 ", "英语 "), class = "factor")), .Names = "itemDisplayTable", row.names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), class = "data.frame")
I tried to use name of row and column to locate the value of “Related URLs” and change its value by such sentence:
meta_ru <- “http://www.jdis.org”
imeta_dc[c("Related URLs"), c("itemDisplayTable")] <- meta_ru
I use rownames instead of rownumbers because those metadata has different length and different sequence of attribute, only this way can locate one attribute accurately. Further more, when I do this, none of error or warning occurs, but the data could not write into it, and it changed to blank. What should we do to avoid this problem?

There is one problem with your dataset, the field itemDisplayTable is in factor , you need to first convert it into character then use rownames() function to assign it to a value like below.
df$itemDisplayTable <- as.character(df$itemDisplayTable)
meta_ru <- c("http://www.jdis.org")
df[(rownames(df) %in% c("Related URLs"))==T,"itemDisplayTable"] <- meta_ru
View(df)
Output:
You can see here that Related URLs is not empty now and filled with "http://www.jdis.org" in the final output.

Related

Converting init into date?

I have been working with a dataset that comes with a date column. When I run typeof(headlineDat$Date) I get a type integer.
I've tried pasting in a few things I found off google but none have seemed to work. I've tried running this piece of code
as.POSIXct(strptime(headlineDat$Time.read,format= "%Y-%m-%d"))
My aim is to have the same format as the year column below. The reason why I want to do this is that I want to be able to create a unique identifier so I can easily match dates when I merge the two data frames.
Any help on this would be greatly appreciated !
This is my dput output:
dput(droplevels(headlineDat[1:5, ]))
structure(list(Date = structure(c(1L, 3L, 3L, 2L, 4L), .Label = c("2018-04-26T11:31:02+00:00",
"2018-05-02T21:10:20+00:00", "2018-05-03T15:30:59+00:00", "2018-05-03T18:00:39+00:00"
), class = "factor"), Headline = structure(c(5L, 2L, 4L, 3L,
1L), .Label = c("Bitcoin Futures Trading Questioned By Chinese National Media",
"Daily Volatility Decline? Bitcoin Has Seen $1K Range 43 Times In 2018",
"Reddit to Relaunch Bitcoin Payments (And Add More Cryptos)",
"Sell In May and Go Away? Not for Bitcoin Bulls", "Square Books Small Profit for First Quarter of Bitcoin Sales"
), class = "factor")), row.names = c(NA, 5L), class = "data.frame")
You are starting with a standard format, so as.Date does the conversion just fine.
headlineDat$Date = as.Date(headlineDat$Date)

R error : arguments imply differing number of rows

So I am trying to operate a function over a few columns of a data frame, using a for loop.
z <- function(x) gsub("[^\\.\\d]", "", x, perl = TRUE)
data <- cbind(data[1:2], for(i in seq(3, 9)) {y(data[[i]])})
I keep running into the error as mentioned in the subject
arguments imply differing number of rows
The number of rows in all my columns are same.
I tried to use lapply for this, but though it works, it converts the column types over which I apply the function to factor. The columns are numerical values, but are originally read as characters from the file (they are stored as such). So when I try to convert to numbers after using lapply, I get number of levels as output (like, 1,2,3...)
Any suggestions, using either the for loop, or lapply are welcome. Thanks in advance.
> dput(head(data,3))
structure(list(MCF.Channel.Grouping = structure(c(6L, 6L, 6L), .Label = c("(Other)",
"Direct", "Display", "Email", "Organic Search", "Paid Search",
"Referral", "Social Network"), class = "factor"), Device.Category = structure(c(2L,
1L, 3L), .Label = c("desktop", "mobile", "tablet"), class = "factor"),
Spend = c("A$503,172.17", "A$375,940.43", "A$92,560.94"),
Clicks = c("1,545,416", "1,037,740", "291,314"), Impressions = c("7,328,657",
"3,787,612", "1,178,508"), Data.Driven.Conversions = c("1,697,814.32",
"1,540,810.43", "430,738.63"), Data.Driven.CPA = c("A$0.30",
"A$0.24", "A$0.21"), Data.Driven.Conversion.Value = c("A$12,815,842.66",
"A$13,883,073.58", "A$3,804,800.15"), Data.Driven.ROAS = c("2547.01%",
"3692.89%", "4110.59%")), .Names = c("MCF.Channel.Grouping",
"Device.Category", "Spend", "Clicks", "Impressions", "Data.Driven.Conversions",
"Data.Driven.CPA", "Data.Driven.Conversion.Value", "Data.Driven.ROAS"
), row.names = c(NA, 3L), class = "data.frame")
We can use
data[-(1:2)] <- lapply(data[-(1:2)], z)
The function is run on columns that are not the first or second. The output is assigned to the same subset in the data.
The original method did not work because the for loop does not result in saved output. Check by trying to save it as a variable:
x <- for(i in seq(3, 9)) {z(data[[i]])}
x
NULL
Even though we saved the contents of the loop, nothing was captured. The loop ran then dumped the results. To see how a loop could work, we can assign values within:
for ( i in 3:9) data[,i] <- z(data[,i])

read.table from write.table in R

I'm trying to do a qdap::multigsub in order to fix some typos, misspelled names, variant expressions and some other "aberrations" in a list of climatic event types (yes, it's the NOAA's data set on storms that belongs to an assignment in a coursera class on reproducible research; although this fixing is neither required nor expected in the assignment: it's me trying my best!).
So I have events named "flash flood", "flash flooding", "flash floods" and the like, and I'd like to group them all in a level called "flash flood". So what I did first was:
expr <- c("^flash.*floo.*","thun.*")
repl <- c("flash flood","thunderstorm")
Length of each vector is 51 and this is a knitr assignment, so in order to keep it readable (margin column=80), I had to go with something like
expr <- c(expr,"new_expr_1","new_expr_2")
repl <- c(repl,"new_repl_1","new_repl_2") # repeated many, many times
Which makes the code kind of messy. Of course, I have the complete expr and repl vectors, so I would like to have each pair (expr and repl) of correspondent values in a row, so the reader of the code would have an easy time (that's why dput won't work here: they don't align each pair of values).
I tried this:
a <- data.frame(expr=expr,repl=repl)
print(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it failed (I think because print throws the output without quotation marks and there are some two-word expr or repl). I also tried
write.table(a,rownames=FALSE)
# copying the output, and then
b <- read.table(header=TRUE,text="paste_text_here")
but it doesn't work either (I think because write.table outputs each item between quotes, and read.table finds too many quotation marks to handle).
I'd like to have in my Rmarkdown file something like this:
exprRepl <- read.table(header=TRUE,text="expr repl
expr_1 repl_1
expr_2 repl_2")
How can I achieve this from the data I have now?
dput of the first 5 rows of data frame follow:
> dput(a[1:5,])
structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
If there's any other approach to replace the wrong/variant names, I'd be very happy to hear from it and give it a try!
One solution is to use a singe quote ' around the pasted text (this works as long as there are no ' in your data):
d <- structure(list(expr = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("^BLIZZARD.*",
"^FLASH.*FLOOD.*", "^HAIL.*", "^HEAVY.*RAIN.*", "^HURRICANE.*"
), class = "factor"), repl = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("BLIZZARD",
"FLASH FLOOD", "HAIL", "HEAVY RAIN", "HURRICANE"), class = "factor")), .Names = c("expr",
"repl"), row.names = c(NA, 5L), class = "data.frame")
write.table(d, row.names=FALSE)
# copy paste output of write.table in text field below:
read.table(header = TRUE, text='"expr" "repl"
"^HURRICANE.*" "HURRICANE"
"^BLIZZARD.*" "BLIZZARD"
"^FLASH.*FLOOD.*" "FLASH FLOOD"
"^HAIL.*" "HAIL"
"^HEAVY.*RAIN.*" "HEAVY RAIN"')

how to return rows with a keyword within a string contained in a cell in r

I thought that this would be a simple one line of code, but the solution to my challenge is eluding me. I am betting that my limited experience with the domain of R programming might be the source.
Data Set
df <- structure(list(Key_MXZ = c(1731025L, 1731022L, 1731010L, 1730996L,
1722128L, 1722125L, 1722124L, 1722123L, 1722121L, 1722116L, 1722111L,
1722109L), Key_Event = c(1642965L, 1642962L, 1647418L, 1642936L,
1634904L, 1537090L, 1537090L, 1616520L, 1634897L, 1634892L, 1634887L,
1634885L), Number_Call = structure(c(11L, 9L, 10L, 12L, 1L, 3L,
2L, 4L, 5L, 6L, 8L, 7L), .Label = c("3004209178-2010-04468",
"3004209178-2010-04469", "3004209178-2010-04470", "3004209178-2010-04471",
"3004209178-2010-04472", "3004209178-2010-04475", "3004209178-2010-04477",
"3004209178-2010-04478", "3004209178-2010-04842", "3004209178-2010-04850",
"I wish to return this row with the header", "Maybe this row will work too"
), class = "factor")), .Names = c("Key_MXZ", "Key_Event", "Number_Call"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12"))
In the last column I have placed two strings among other data types that would be used to identify the rows for a new dataframe -- using the phrase "this row". The end result might look like:
Key_MXZ|Key_Event|Number_Call
1|1731025|1642965|I wish to return this row with the header
4|1730996|1642936|Maybe this row will work too
I have tried the following variations of code and others unseen to breakthrough with little success.
txt <- c("this row")
table1 <- df[grep(txt,df),]
table2 <- df[pmatch(txt,df),]
df[,3]<-is.logical(df[,3])
table3 <- subset(df,grep(txt,df[,3]))
Any ideas on this challenge?
Quite similar to DMTs answer. Below uses data.table approach which is fast in case you have millions of rows:
setDT(df); setkey(df, Number_Call)
df[grep("this row", Number_Call, ignore.case = TRUE)]
Key_MXZ Key_Event Number_Call
1: 1731025 1642965 I wish to return this row with the header
2: 1730996 1642936 Maybe this row will work too
Here is an approach that uses qdap's Search function. It's a wrapper for agrep so it can do fuzzy matching and the degree of fuzziness can be set:
library(qdap)
Search(df, "this row", 3)
## Key_MXZ Key_Event Number_Call
## 1 1731025 1642965 I wish to return this row with the header
## 4 1730996 1642936 Maybe this row will work too
go with
df[grep("this row", df$Number_Call, fixed=TRUE),]
# Key_MXZ Key_Event Number_Call
#1 1731025 1642965 I wish to return this row with the header
#4 1730996 1642936 Maybe this row will work too
Just needed to reference the actual column you wanted grep to try to match
fixed=TRUE looks for exact matches, and grep returns indeces of those elements in the list that hit the match. If your match is a bit more nuanced you can replace "this row" with a regular expression

Unrecognized column name after loading data table with fread

I load a data table with fread the way I always do. The files has ~2M records and is tab delimited.
The load is successful. I can print the head of the table and the column names, so far so good.
But then either changing the name of the first column or setting it as a key fails complaining it cannot find the old column name. I am sure there is no typo in the column name, no heading or trailing space, I tried many times with copy/paste and retyping. I can change the name of apparently any other column.
The first column is long integer id's, so I had to load the bit64 package to get rid of a warning in 'fread', but it did not seem to help. Is it a clue?
Does anyone have any idea what could cause such a symptom? How to debug?
I use R 3.1.0 on Windows 64, latest version of all packages.
Edit: more details
The data load command:
txnData <- fread(txnInDataPathFileName, header=TRUE, sep="\t", na.strings="NA")
The column names:
colnames(txnData)
[1] "txn_ext_id" "txn_desc" "txn_type_id" "site_id" "date_id" "device_id" "cust_id"
[8] "empl_id" "txn_start_time" "txn_end_time" "total_sales" "total_units" "gross_margin"
The rename column that fails (and so does setkey):
setnames(txnData, "txn_ext_id", "txnId")
Error in setnames(txnData, "txn_ext_id", "txnId") :
Items of 'old' not found in column names: txn_ext_id
And finally the requested dput command:
dput(head(txnData))
structure(list(`txn_ext_id` = structure(c(4.88536962440272e-311,
1.10971996159584e-311, 9.9460266389845e-312, 1.0227644072435e-311,
1.10329710699982e-311, 1.01930594588518e-311), class = "integer64"),
txn_desc = c("checkout transaction", "checkout transaction",
"checkout transaction", "checkout transaction", "checkout transaction",
"checkout transaction"), txn_type_id = c(0L, 0L, 0L, 0L,
0L, 0L), site_id = c(982L, 982L, 982L, 982L, 982L, 982L),
date_id = c("2012-12-24", "2013-11-27", "2013-04-08", "2013-06-04",
"2013-11-14", "2013-05-28"), device_id = c(8L, 7L, 8L, 53L,
8L, 5L), cust_id = structure(c(2.02600292130833e-313, 2.02572944866119e-313,
2.02583815970388e-313, 2.02580527009968e-313, 2.02568405005593e-313,
2.02736582767668e-313), class = "integer64"), empl_id = c("?",
"?", "?", "?", "?", "?"), txn_start_time = c("2012-12-24T08:35:56",
"2013-11-27T12:43:30", "2013-04-08T11:48:29", "2013-06-04T15:27:47",
"2013-11-14T12:57:38", "2013-05-28T11:03:21"), txn_end_time = c("2012-12-24T08:38:00",
"2013-11-27T12:47:00", "2013-04-08T11:49:00", "2013-06-04T15:35:00",
"2013-11-14T13:00:00", "2013-05-28T11:05:00"), total_sales = c(48.86,
69.7, 8.53, 33.46, 39.19, 35.56), total_units = c(12L, 44L,
3L, 4L, 14L, 17L), gross_margin = c(0, 0, 0, 0, 0, 0)), .Names = c("txn_ext_id",
"txn_desc", "txn_type_id", "site_id", "date_id", "device_id",
"cust_id", "empl_id", "txn_start_time", "txn_end_time", "total_sales",
"total_units", "gross_margin"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000002c0788>)
The hidden character was the Byte Order Mark (BOM) displayed as  when you get a chance of seeing it. You can in principle see it in editors set in ANSI display mode -- well I could not in Notepad++! In R, printing the head of the data table does show it as well using RStudio but it does not show it using Eclipse StatET that I use by default, explaining why I did not notice it immediately.
See the following links re. how to get rid of the BOM character: SO1, SO2, yiiframework.
I loaded my file in Notepad++, Encoding -> Convert to UTF-8 without BOM, saved, and this BOM character disappeared, all went fine.
A pure R solution to this problem without touching the file is to include the BOM character as the prefix in the rename command: setnames(dataTable, "firstColumnName", "firstColumnName"). This worked in RStudio and I suppose would work in R console as well. However, it does not work in Eclipse-StatET as the BOM character remains hidden while messing up data table accesses: the first column is not accessible with or without the BOM prefix in name, the setnames fail either way.

Resources