Processing a CSV file in R - r

I am trying to make a R-script to get a better overview of my CSV bank data.
My goal is to group all my costs into different categories.
For instance, I want McDonalds and Burger King go into “resturantsCosts”.
Food market costs from Kaisers, Lidl, and Rewe shall go to “foodCompaniesCosts”.
Subscription costs from Vattenfall, Gasag, and Vodaphone shall go to “subscriptionCosts”.
My difficulty right now is to process the information.
Here are some inquires examples from my CSV file:
"01554 MCDONALDS", "REWE251", "11379 BURGER KING ALEX BHF", "KAISERS TENGELMANN 82139*DE", "KAISERS TENGELMANN 82124*DE"
My idea was to split each inquire into a list, remove all numbers, and make all letters small.
For instance “KAISERS TENGELMANN 82124*DE” would be:
"kaisers" "tengelmann" "*de"
Then was my idea to match the result into different premade lists to see if one of the words are in there. Example, foodCompanies list contains the following words: "kaisers", "lidl", "rewe".
Because the foodCompanies list contains the word “kaisers” and the inquire contains the word “kaisers” there would be a match. However, I having difficulties getting it to work.
Could somebody help me?
EDIT: The problem is not to read data. The problem is to process data. I can read all the companies and costs and they are stored in "company" and "costs". It is the following that doesn't work correctly:
temp <- tolower( trimws( gsub('[[:digit:]]+', '', company[i]) ) )
temp <- strsplit(temp, " ")
For instance, set "KAISERS TENGELMANN 82139*DE" as the variable company. The I get the following result:
"c(\"kaisers\", \"tengelmann\", \"*de\")"
Here is my full code:
mydata = read.csv2("mydata.csv", header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "")
company = mydata[[6]]
costs = mydata[[9]]
foodCompanies = c("kaisers", "lidl", "rewe")
resturants = c("burger king", "mcdonalds")
subscriptions = c("vattenfall", "gasag", "vodaphone")
foodCompaniesCosts = c()
resturantsCosts = c()
subscriptionCosts = c()
for (i in 1:length(company)){
temp <- tolower( trimws( gsub('[[:digit:]]+', '', company[i]) ) )
temp <- strsplit(temp, " ")
if(any ( temp %in% foodCompanies ) == TRUE) {foodCompaniesCosts <- c(foodCompaniesCosts, costs[i])
} else if(any ( temp %in% resturants ) == TRUE) {resturantsCosts <- c(resturantsCosts, costs[i])
} else if(any ( temp %in% subscriptions ) == TRUE) {subscriptionCosts <- c(subscriptionCosts, costs[i])
}
}

In your for loop, convert temp to a data.frame before your if statements begin. Specifically, add the line temp <- data.frame(temp).

Related

Problems extracting metadata from NCBI in R

I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')

How to access a single item in an R data frame?

So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?
If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000

Need help writing data from a table in R for unique values using a loop

Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

R code hangs in between with large data?

I am dealing with db with around 5lac+ records. I want to count the words in the data.
This is my code
library(tm)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,user="postgres",password="root", dbname="pharma",host="localhost",port=5432)
query<-"select data->'PubmedArticleSet'->'PubmedArticle'->'MedlineCitation'->'Article'->'Journal'->>'Title' from searchresult where id BETWEEN 1 AND (select max(id) from searchresult)"
der<-dbGetQuery(con,query)
der<- VectorSource(der)
der<- Corpus(der)
der<-tolower(der)
wordlist<-strsplit(der, "\\W+", perl=TRUE)
wordvector<-unlist(wordlist)
freqlist<-table(wordvector)
sortedfreqlist<-sort(freqlist, decreasing=TRUE)
sortedtable<-paste(names(sortedfreqlist),sortedfreqlist, sep="\t")
cat("Word\tFrequency", sortedtable, file=choose.files(), sep="\n")
But the code hangs and stops at " wordlist<-strsplit(der, "\\W+", perl=TRUE)" can some one please help me with this?
Is this because of the huge data?
Try replacing
wordlist<-strsplit(der, "\\W+", perl=TRUE)
with
word_vector = scan(text = as.character(der[1]),
what = "character", quote = "", quiet = TRUE)
sorted_word_table = sort(table(word_vector ))
There are a few funny things going on in your code (ie you make a Corpus and then call tolower() on the whole thing which turns it into a character vector), but this should get you going.
The first bit splits your text up into words. You might also want to remove punctuation before you do this though using der = removePunctuation(der[1]). The second bit makes a table of the word frequencies.
If the second bit is slow you could use the data.table package and the following function based on this answer instead of calling table()
t_dt <- function(x, key = TRUE){
#creates a 1-d frequency table for x
library(data.table)
dt <- data.table(x)
if(key) setkey(dt,x)
tab <- dt[, list(freq = .N), by = x]
out <- tab$freq
names(out) <- tab$x
out
}
sorted_word_table = sort(t_dt(word_vector ))

Resources