I'm a Pig Latin beginner and I needed to accomplish a task where I should identify troll posts. Such posts are calculated by the ratio of #likes/#replies of a post. Therefore, it is necessary to determine for each post (1) its replies and (2) recursively all the replies for each reply.
In the task it was stated that either Map Reduce, Pig Latin or Hive can be used. But since I do not know how to achieve recursion in pure Pig Latin, I solved it by using embeddedPig, where I use Java for the recursion part. So my question is: Is it possible at all to realize such a recursive task using only Pig Latin? And if so, can anyone show me any really small example applying recursion?
The Test Input is a small static social network containing users with posts, likes and so on. Every line is a triple of subject-predicate-object, e.g. Bill-likes-anyPost. Now each post can be in such a relation with a reply, and this reply can also be in a relation with a reply (meaning someone replied to the reply). My Pig Latin Code below tries to output the posts with their ratio. The problem is that I do not use recursion to get all the replies of each reply.
REGISTER RDFStorage.jar ;
indata = LOAD '$input_file' USING RDFStorage() AS
(s:chararray,p:chararray,o:chararray) ;
likes = FILTER indata BY p == 'sib:like';
likes_group = GROUP likes BY o;
likes_grouped_count = FOREACH likes_group GENERATE group AS object,
COUNT(likes) AS amount;
comments = FILTER indata BY STARTSWITH(s,'sibpo:') AND p == 'sioc:container_of';
comments_grouped = GROUP comments BY s;
comments_grouped_count = FOREACH comments_grouped GENERATE group AS subject,
COUNT(comments) AS amount;
--GET creation date for all posts
posts_dates_t = FILTER indata BY STARTSWITH(s,'sibpo:') AND p == 'dc:created';
posts_dates = FOREACH posts_dates_t GENERATE s, p, REGEX_EXTRACT(o, '\\"([0-9]{4}-[0-9]{2}-[0-9]{2})T', 1) AS o;
--Get creation date for all comments
comments_dates_t = FILTER indata BY STARTSWITH(s,'sibc:') AND p == 'dc:created';
comments_dates = FOREACH comments_dates_t GENERATE s, p, REGEX_EXTRACT(o, '\\"([0-9]{4}-[0-9]{2}-[0-9]{2})T', 1) AS o;
--Associate each comment to its corresponding post that has a creation date
posts_comments = JOIN posts_dates BY s, comments BY s;
--Join to get creation dates for all comments
posts_comments_with_dates = JOIN posts_comments BY comments::o, comments_dates BY s;
--calculate the days between a post and each one of its comments
all_dates = FOREACH posts_comments_with_dates GENERATE $0 AS post,ABS (DaysBetween(ToDate( $2 ),ToDate($8))) as lifetime_1 ;
--GROUP by post
all_dates_grouped = GROUP all_dates BY post;
--Get the life time of a post, which is the maximum difference between a post and its comments
posts_lifetime = FOREACH all_dates_grouped GENERATE group as post, MAX (all_dates.lifetime_1) as lifetime_2;
combined = JOIN likes_grouped_count BY object, comments_grouped_count BY subject;
combined_dates= JOIN combined BY comments_grouped_count::subject, posts_lifetime BY post;
combined_ratio = FOREACH combined_dates GENERATE likes_grouped_count::object AS post, (float)likes_grouped_count::amount/(float)comments_grouped_count::amount/1f AS ratio, posts_lifetime::lifetime_2 as lifetime ;
--Sort posts by ration first then by lifetime ascending
combined_ratio_sorted = ORDER combined_ratio BY ratio ASC, lifetime ASC;
outdata = LIMIT combined_ratio_sorted 50;
STORE outdata INTO '$output_file' USING PigStorage(',') ;
Thanks for reading and spending your time.
Related
(server side script)
This is a stripped down version of my code but what this should be doing is
find records where the "uniqueid" is equal to matchid
return 0 if there are less than two of these items
print the region of each item if there are two or more items
return the number of items
function copyFile(matchid){
var fileName = getProp('projectName')+" "+row[0];
var query = app.models.Files.newQuery();
query.filters.uniqueid._equals = matchid;
records = query.run();
var len = records.length;
if (len < 2) return 0;
console.log(row[2]+" - "+len);
for (var i=0; i<len;i++){
console.log("Loop "+i);
var r = records[i];
console.log(r.region);
}
return records.length
Strangely, it can only get at the region (or any of the other data for the FIRST record ( records[0]) for the others it says undefined. This is extremely confusing and frustrating. To reiterate it passes the len < 2 check, so there are more records in the set returned from the query, they just seem to be undefined if I try to get them from records[i]
Note: uniqueid is not actually a unique field, the name is from something else, sorry about confusion.
Question: WHY can't I get at records[1] records [2]
This was a ridiculous problem and I don't entirely understand the solution.
Changing "records" to "recs" entirely fixes my problem.
why does records[0] work, records[1] does not
but recs[0] and recs[1] both work.
I believe "records" has a special meaning and points at something regardless of assignment in this context.
I am trying to create a map that holds an activity and the total duration of that activity, knowing that the activity appears more times with different durations.
Normally, I would have solved it like this:
Map<String,Duration> result2 = new HashMap<String,Duration>();
for(MonitoredData m: lista)
{
if(result2.containsKey(m.getActivity())) result2.replace(m.getActivity(),result2.get(m.getActivity()).plus(m.getDuration()));
else result2.put(m.getActivity(), m.getDuration());
}
But I am trying to do this with a stream, but I can't figure out how to put the sum in there.
Function<Duration, Duration> totalDuration = x -> x.plus(x);
Map<String, Duration> result2 = lista.stream().collect(
Collectors.groupingBy(MonitoredData::getActivity,
Collectors.groupingBy(totalDuration.apply(), Collectors.counting()))
);
I tried in various ways to group them, to map them directly, or to sum them directly in the brackets, but i'm stuck.
Use the 3-argument version of toMap collector:
import static java.util.stream.Collectors.toMap;
Map<String,Duration> result = lista.stream()
.collect(toMap(MonitoredData::getActivity, MonitoredData::getDuration, Duration::plus));
Also, note that Map interface got some nice additions in Java 8. One of them is merge. With that, even your iterative for loop can be rewritten to be much cleaner:
for (MonitoredData m: lista) {
result.merge(m.getActivity(), m.getDuration(), Duration::plus);
}
I'm currently having troubles figuring out how to use Java 8 streams.
I'm trying to go from lista_dottori (Map<Integer, Doctor>) to a new map patientsPerSp where to every medical specialization (method getSpecialization) I map the number of patients wich have a doctor with this specialization (method getPatients in class Doctor returns a List of that doctor's patients). I can't understand how to use the counting method for this purpose, and I can't seem to find any examples nor explanations for this kind of problems on the internet.
That's what i've written, it does give me error in the counting section:
public Collection<String> countPatientsPerSpecialization(){
patientsPerSp=
lista_dottori.values().stream()
.map(Doctor::getSpecialization)
.collect(groupingBy(Doctor::getSpecialization, counting(Doctor::getPatients.size())))
;
}
Seems that you want to sum the sizes of patients lists. This can be done by summingInt() collector, not counting() (which just counts occurences; doctors in this case). Also mapping seems to be unnecessary here. So you cuold write:
patientsPerSp = lista_dottori.values().stream()
.collect(groupingBy(Doctor::getSpecialization,
summingInt(doctor -> doctor.getPatients().size())));
Note that the results will be incorrect if several doctors have the same patient (this way it will be counted several times). If it's possible in your case, then it would be probably better to make a set of patients:
patientsPerSp = lista_dottori.values().stream()
.collect(groupingBy(Doctor::getSpecialization,
mapping(Doctor::getPatients(), toSet())));
This way you will have a map which maps specialization to the set of patients, so the size of this set will be the count which you want. If you just need the count without set, you can add a final step via collectingAndThen():
patientsPerSp = lista_dottori.values().stream()
.collect(groupingBy(Doctor::getSpecialization,
collectingAndThen(
mapping(Doctor::getPatients(), toSet()),
Set::size)));
I solved the problem avoiding using the streams. That's the solution I used:
public Collection<String> countPatientsPerSpecialization(){
int numSpec = 0;
Map<String, Integer> spec = new HashMap<>();
for(Doctor d : lista_dottori.values()){
if(!spec.containsKey(d.getSpecialization())){
spec.put(d.getSpecialization(), d.getPatients().size());
numSpec++;
}
else{ //cioè se la specializzazione c'è già
spec.replace(d.getSpecialization(), +d.getPatients().size());
}
}
patientsPerSp.add(spec.keySet() + ": " + spec.values());
for(String s : patientsPerSp)
System.out.println(s);
return patientsPerSp;
}
I couldn't seem to be able to solve it using your solutions, although they were very well exposed, sorry.
Thank you anyway for taking the time to answer
Map<String, Integer> patientsPerSpecialization =
doctors.values()
.stream()
.collect(Collectors.groupingBy(Doctor::getSpecialization,
Collectors.summingInt(Doctor::nberOfAssignedPatients)));
I am fairly new to Scalding and I am trying to write a scalding program that takes as input 2 datasets:
1) book_id_title: ('id,'title): contains the mapping between book ID and book title, Both are strings.
2) book_sim: ('id1, 'id2, 'sim): contains the similarity between pairs of books, identified by their IDs.
The goal of the scalding program is to replace each (id1, id2) in book_ratings with their respective titles by looking up the book_id_title table. However, I am not able to retrieve the title. I would appreciate it if someone could help with the getTitle() function below.
My scalding code is as follows:
// read in the mapping between book id and title from a csv file
val book_id_title =
Csv(book_file, fields=book_format)
.read
.project('id,'title)
// read in the similarity data from a csv file and map the ids to the titles
// by calling getTitle function
val result =
book_sim
.map(('id1, 'id2)->('title1, 'title2)) {
pair:(String,String)=> (getTitle(pair._1), getTitle(pair._2))
}
.write(out)
// function that searches for the id and retrieves the title
def getTitle(search_id: String) = {
val btitle =
book_id_title
.filter('id){id:String => id == search_id} // extract row matching the id
.project('title) // get the title
}
thanks
Hadoop is a batch processing system and there is no way to lookup data by index. Instead, you need to join book_id_title and book_sim by id, probably two times: for left and right ids. Something like:
book_sim.joinWithSmaller('id1->id, book_id_title).joinWithSmaller('id2->id, book_id_title)
I am not very familiar with the field-based API so consider the above as a pseudocode. You also need to add appropriate projections. Hopefully, it still gives you an idea.
Im trying to compare Calendars with JPA2. The query looks somewhat like that:
TypedQuery<X> q = em.createQuery("select r from Record r where r.calendar= :calendar", X.class);
Calendar c = foo(); // setting fields and stuff
q.setParameter("calendar", c);
This, however, compares the date + time. I want to know if MM:DD:YYYY is equal and do not care about the time. Is there a nice way to do that in JPA2 or do I have to create a native query?
I tried setting HH:MM:SS:... to zero before saving it in the db but I don't know if this is very wise, regarding time zones and daylight saving and stuff.
q.setParameter("calendar", c, TemporalType.DATE)
You can pass the TemporalType.DATE to setParameter method to truncate the date+time.
There is no mention of DateTime functions allowing to do that in the spec of JPQL, but you could always cheat and do
select r from Record r where r.calendar >= :theDayAtZeroOClock and r.calendar < :theDayAfterAtZeroOClock
Mysql and H2 compatible comparison of dates ignoring time part:
`#Query("SELECT DISTINCT s " +
"FROM Session s " +
"JOIN s.movie m " +
"WHERE m.id = :movie AND CAST(s.time AS date) = CAST(:date AS date) " +
"ORDER BY s.time")
List<Session> getByMovieAndDate(#Param("movie") Long movie, #Param("date") LocalDateTime date);`
When using an Oracle database, you can use the trunc function in your JPQL query, e.g.:
TypedQuery<X> q = em.createQuery("select r from Record r where trunc(r.calendar) = trunc(:calendar)", X.class);
See also https://cirovladimir.wordpress.com/2015/05/18/jpa-trunc-date-in-jpql-query-oracle/
I had to use date_trunc on the where clause:
TypedQuery<X> q = em.createQuery("select r from Record r where date_trunc('day',r).calendar= :calendar", X.class);
Calendar c = foo(); // setting fields and stuff
q.setParameter("calendar", c, TemporalType.DATE);
In Hibernate 6 and above, you can use date_trunc(text, timestamp) to truncate the timestamp more precisely, for example:
date_trunc('hour', timestamp) to truncate the timestamp up to the hour (no minutes and no seconds).