So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?
If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000
Related
I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE I still get this error in some cells.
Here is how I import the data:
raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))
The taxon_import function is used to split the taxa and assign variable names:
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In[<-.factor(tmp, thisvar, value = "Undefined") : invalid factor level, NA generated
Here follows the data_cleanup function:
data_cleanup <- function(data) {
strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
for (i in strip_1) {
data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
}
data[data==""] <- "Undefined"
strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
for (j in strip_2) {
data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
}
return(data)
}
The function is simply applied like: test <- data_cleanup(raw_data_1)
I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn
I hope you will forgive my ignorance, however I tried many solutions before posting here.
We start by using the tidyverse library. Let me give a twist to your question, as it's about replacing NAs, but I think with this code you should avoid that problem.
As I read your code, you erase the strings "D_0__", "D_1__", ... from the observation strings. Then you replace the strings "Ambiguous_taxa", "unidentified", ... with the string "Undefined".
According to your data, I replaced the functions with regex, which makes a little easy to clean your data:
library(tidyverse)
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
raw_data_1 <- taxon_import(read.delim("taxonomy.tsv", stringsAsFactors = FALSE))
raw_data_1 <- data.frame(lapply(raw_data_1,as.character),stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(raw_data_1,function(x) sub("^D_[0-6]__","",x)), stringAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("__|unidentified|Ambiguous_taxa|uncultured","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("Unknown|uncultured\\s.\\*|Unassigned\\s.\\*","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("wastewater\\sUnassigned|metagenome","Undefined",x)), stringsAsFactors = FALSE)
depured[depured ==""] <- "Undefined"
Let me explain my code. First, I read in many websites that it's better to avoid loops, as "for". So how you replace text that starts with "D_0__"?
The answer is regex (regular expression). It seems complicated at first but with practice it'll be helpful. See this expression:
"^D_[0-6]__"
It means: "Take the start of the string which begins with "D_" and follows a number between 0 and 6 and follows "__"
Aha. So you can use the function sub
sub("^D_[0-6]__","",string)
which reads: replace the regular expression with a blank space "" in the string.
Now you see another regex:
"__|unidentified|Ambiguous_taxa|uncultured"
It means: select the string "__" or "unidentified" or "Ambiguous_taxa" ...
Be careful with this regex
"Unknown|uncultured\\s.\\*|Unassigned\\s.\\*"
it means: select the string "Unknown" or "uncultured .*" or...
the blank space it's represented by \s and the asterisk is \*
Now what about the as.data.frame function? Every time I use it I have to make it "stringsAsFactors = FALSE" because the function tries to use the characters, as factors.
With this code no NA are created.
Hope it helps, please don't hesitate to ask if needed.
Regards,
Alexis
Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate
Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.
I am trying to make a R-script to get a better overview of my CSV bank data.
My goal is to group all my costs into different categories.
For instance, I want McDonalds and Burger King go into “resturantsCosts”.
Food market costs from Kaisers, Lidl, and Rewe shall go to “foodCompaniesCosts”.
Subscription costs from Vattenfall, Gasag, and Vodaphone shall go to “subscriptionCosts”.
My difficulty right now is to process the information.
Here are some inquires examples from my CSV file:
"01554 MCDONALDS", "REWE251", "11379 BURGER KING ALEX BHF", "KAISERS TENGELMANN 82139*DE", "KAISERS TENGELMANN 82124*DE"
My idea was to split each inquire into a list, remove all numbers, and make all letters small.
For instance “KAISERS TENGELMANN 82124*DE” would be:
"kaisers" "tengelmann" "*de"
Then was my idea to match the result into different premade lists to see if one of the words are in there. Example, foodCompanies list contains the following words: "kaisers", "lidl", "rewe".
Because the foodCompanies list contains the word “kaisers” and the inquire contains the word “kaisers” there would be a match. However, I having difficulties getting it to work.
Could somebody help me?
EDIT: The problem is not to read data. The problem is to process data. I can read all the companies and costs and they are stored in "company" and "costs". It is the following that doesn't work correctly:
temp <- tolower( trimws( gsub('[[:digit:]]+', '', company[i]) ) )
temp <- strsplit(temp, " ")
For instance, set "KAISERS TENGELMANN 82139*DE" as the variable company. The I get the following result:
"c(\"kaisers\", \"tengelmann\", \"*de\")"
Here is my full code:
mydata = read.csv2("mydata.csv", header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "")
company = mydata[[6]]
costs = mydata[[9]]
foodCompanies = c("kaisers", "lidl", "rewe")
resturants = c("burger king", "mcdonalds")
subscriptions = c("vattenfall", "gasag", "vodaphone")
foodCompaniesCosts = c()
resturantsCosts = c()
subscriptionCosts = c()
for (i in 1:length(company)){
temp <- tolower( trimws( gsub('[[:digit:]]+', '', company[i]) ) )
temp <- strsplit(temp, " ")
if(any ( temp %in% foodCompanies ) == TRUE) {foodCompaniesCosts <- c(foodCompaniesCosts, costs[i])
} else if(any ( temp %in% resturants ) == TRUE) {resturantsCosts <- c(resturantsCosts, costs[i])
} else if(any ( temp %in% subscriptions ) == TRUE) {subscriptionCosts <- c(subscriptionCosts, costs[i])
}
}
In your for loop, convert temp to a data.frame before your if statements begin. Specifically, add the line temp <- data.frame(temp).
I am trying to use the following syntax to get the occupation information from George Clooney's wikipedia page. Eventually I would like there to be a loop to get data on various personalitys' occupations.
However, I get the following problem running the below code:
Error in if (symbol != "role") symbol = NULL : argument is of length zero
I am not sure why this keeps on coming up.
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/George_Clooney'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'class')
if (symbol!='role') symbol=NULL
if(!is.null(text) & !is.null(symbol))
data.frame(symbol, text) } )
As #gsee mentioned, you need to check that symbol isn't NULL before you check its value. Here's a minor update to your code that works (at least for George).
df = ldply(
links,
function(x)
{
text = xmlValue(x)
if (!nzchar(text)) text = NULL
symbol = xmlGetAttr(x, 'class')
if (!is.null(symbol) && symbol != 'role') symbol = NULL
if(!is.null(text) & !is.null(symbol))
data.frame(symbol, text)
}
)
Use col.names = my_column_names in kable() with my_column_names being character vector of your wanted names, for me it worked!