I have a data set that shows engineer visits and the type of job attended.
Engineer's Visit Table:
OrderNum, Engineer, Job Type, Date
1 John Install 01/04/15
2 Phil Remove 02/04/15
3 George Install 01/04/15
4 George Replace 02/04/15
5 George Replace 03/04/15
6 John Install 01/04/15
7 John Install 01/04/15
8 John Replace 02/04/15
9 John Remove 02/04/15
For the example table above - I would like to show for each engineer (but using John as an example):
His predominant job type was "Install";
The total number of jobs he attended was 5;
He worked for 3 days;
Meaning he attended 1.67 jobs per day.
I was attempting to add this to the load script using various additional columns but I'm having trouble getting an aggr/count statement to work.
Is this a reasonable approach or am I going about it the wrong way?
Thanks.
You definitely don't want to do it in the script because then you would have to try and guess any combination of selections your users might make before hand and create aggregations for each case. In the front end it is fairly trivial except for the 1st measure. To illustrate the problem I added 2 more orders for Phil an Install and a Replace so that he has 1 of each
Here is the first draft I made:
Now the problem is that '-' for Phil. The mode() function is working as designed there but I bet nobody wants to see that the job they perform most often is nothing.
I tried a few things but this is as close as I got to something useful:
The expression I used is
`if(isnull(mode([Job Type])),concat(DISTINCT [Job Type],','),mode([Job Type]))`
but it also isn't as good as it could be (now the guys with no clear mode just get a list of all the jobs they've done, rather than a list of the joint most often done jobs. But at least now it looks like they are working). I am however stumped as to how to get it to do what I want
Related
I am trying to run sentiment analysis in r using "sentimentr" package. I fed in a list of comments and in the output got element_id, sentence_id, word_count, sentiment. Comments with long phrases are getting converted into single sentences. I want to know the logic based on which package does that ?
I have 4 main categories for my comments- Food, Atmosphere, Price and service. and I have also set bigrams for those themes, i am trying to split sentences based on themes
install.packages("sentimentr")
library(sentimentr)
data <- read.csv("Comments.csv")
data_new <- as.matrix(data)
scores <- sentiment(data_new)
#scores
write.csv(scores,"results.csv")
For e.g - " We had a large party of about 25, so some issues were understandable. But the servers seemed totally overwhelmed. There are so many issues I cannot even begin to explain. Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on. We were very disappointed" got split up into 5 sentences
1) We had a large party of about 25, so some issues were understandable
2) But the servers seemed totally overwhelmed.
3) There are so many issues I cannot even begin to explain.
4) Simply stated food took over an hour to be served, it was overcooked when it arrived, my son had a steak that was charred, manager came to table said they were now out of steak, I could go on and on.
5) We were very disappointed
I want to know if there is any semantic logic behind the splitting or it's just based on full stops?
It uses textshape::split_sentence(), see https://github.com/trinker/sentimentr/blob/e70f218602b7ba0a3f9226fb0781e9dae28ae3bf/R/get_sentences.R#L32
A bit of searching found the logic is here:
https://github.com/trinker/textshape/blob/13308ed9eb1c31709294e0c2cbdb22cc2cac93ac/R/split_sentence.R#L148
I.e. yes it is splitting on ?.!, but then it is using a bunch of regexes to look for exceptions, such as "No.7" and "Philip K. Dick".
I'm looking to be able to perform the equivalent of a count if on a data set similar to the below. I found something similar here, but I'm not sure how to translate it into Enterprise Guide. I would like to create several new columns that count how many date occurrences there are for each primary key by year, so for example:
PrimKey Date
1 5/4/2014
2 3/1/2013
1 10/1/2014
3 9/10/2014
To be this:
PrimKey 2014 2013
1 2 0
2 0 1
3 1 0
I was hoping to use the advanced expression for calculated fields option in query builder, but if there is another better way I am completely open.
Here is what I tried (and failed):
CASE
WHEN Date(t1.DATE) BETWEEN Date(1/1/2014) and Date(12/31/2014)
THEN (COUNT(t1.DATE))
END
But that ended up just counting the total date occurrences without regard to my between statement.
Assuming you're using Query Builder you can use something like the following:
I don't think you need the CASE statement, instead use the YEAR() function to calculate the year and test if it's equal to 2014/2013. The test for equality will return a 1/0 which can be summed to the total per group. Make sure to include PrimKey in your GROUP BY section of query builder.
sum(year(t1.date)=2014) as Y2014,
sum(year(t2.date)=2013) as Y2013,
I don't like this type of solution because it's not dynamic, i.e. if your years change you have to change your code, and there's nothing in the code to return an error if that happens either. A better solution is to do a Summary Task by Year/PrimKey and then use a Transpose Task to get the data in the structure you want it.
I have a huge list of text files (50,000+) that contain normal sentences. Some of these sentences have words that have merged together because some of the endlines have been placed together. How do I go about unmerging some of these words in R?
The only suggestion I could get was here and kind of attempted something from here but both suggestions require big matrices which I can't use because I either run out of memory or RStudio crashes :( can someone help please? Here's an example of a text file I'm using (there are 50,000+ more where this came from):
Mad cow disease, BSE, or bovine spongiform encephalopathy, has cost the country dear.
More than 170,000 cattle in England, Scotland and Wales have contracted BSE since 1988.
More than a million unwanted calves have been slaughtered, and more than two and a quarter million older cattle killed, their remains dumped in case they might be harbouring the infection.
In May, one of the biggest cattle markets, at Banbury in Oxfordshire, closed down. Avictim at least in part, of this bizarre crisis.
The total cost of BSE to the taxpayer is set to top ÂŁ4 billion.
EDIT: for example:
"It had been cushioned by subsidies, living in an unreal world. Many farmers didn't think aboutwhat happened beyond the farm gate, because there were always people willing to buy what they produced."
See the 'aboutwhat' part. Well that happens to about 1 in every 100 or so articles. Not this actual article, I just made the above up as an example. The words have been joined together somehow (I think when I read in some articles some of them have missed spaces or my notepad reader joins the end of one line with another).
EDIT 2: here's the error I get when I use variation of what they have here replacing the created lists with read-in lists:
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 627
I've never seen that error before but it does come up here and here but no solution to it on either :(
Based on your comments, I'd use an environment which is basically a hashtable in R. Start by building a hash of all known words:
words <- new.env(hash=TRUE)
for (w in c("hello","world","this","is","a","test")) words[[tolower(w)]] <- T
(you'd actually want to use the contents of /usr/share/dict/words or similar), then we define a function that does what you described:
dosplit <- function (w) {
if(is.null(words[[tolower(w)]])) {
n <- nchar(w)
for (i in 1:(n-1)) {
a <- substr(w,1,i)
b <- substr(w,i+1,n)
if(!is.null(words[[tolower(a)]]) && !is.null(words[[tolower(b)]]))
return (c(a,b))
}
}
w
}
then we can test it:
test <- 'hello world, this isa test'
ll <- lapply(strsplit(test,'[ \t]')[[1]], dosplit)
and if you want it back into a space separated list:
do.call(paste, as.list(unlist(ll,use.names=FALSE)))
Note that this is going to be slow for large amounts of text, R isn't really built for this sort of thing. I'd personally use Python for this sort of task, and a compiled language if it got much larger.
I was wondering what the most efficient way is to get the available articles for a given nntp group. The method I have implemented works as follows:
(i) Select the group:
GROUP group.name.subname
(ii) Get a list of article numbers from the group (pushed back into a vector 'codes'):
LISTGROUP
(iii) Loop over codes and grab articles (e.g. headers)
for code in codes do
HEAD code
end
However, this doesn't scale well with large groups with many article codes.
In RFC 3977, the GROUP command is indicated as also returning the 'low' and 'high' article numbers. For example,
[C] GROUP misc.test
[S] 211 1234 3000234 3002322 misc.test
where 3000234 and 2002322 are the low and high numbers. I'm therefore thinking of using these instead rather than initially pushing back all article codes. But can these numbers be relied upon? Is 3000234 definitely indicative of the first article id in the above-selected group and likewise is 3002322 definitely indicative of the last article id in the above-selected group or are they just estimates?
Many thanks,
Ben
It turns out I was thinking about this all wrong. All I need to do is
(i) set the group using GROUP
(ii) execute the NEXT command followed by HEAD for however many headers I want (up to count):
for c : count do
articleId <-- NEXT
HEAD articleID
end
EDIT: I'm sure there must be a better way but until anyone suggests otherwise I'll assume this way to be the most effective. Cheers.
I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/