Second and third function in sparkR - r

In sparkR I have a DataFrame data. It contains user, game.
user contains the users and game contains the name of a game a user has played. There are only 14 games, namely 1,2,...,14.
gives this output
user game
3521 3
52 14
865 4
52 3
I want to find the first game a fixed user is playing. For example user 52 plays game 14 3 3 5 10 and here game 14 is the first game this user was playing.
In sparkR I do it this way
su <- groupBy(data, data$user)
sus <- agg(su, FirstPlayed= first(data$game))
# Making it local
local_sus <- collect(sus)
Here I get the correct result because I can use the first function in sparkR.
I want to find the 'second' and 'third' game a user has played but I can't do that because sparkR don't have a "second" function.
How should one then solve it - maybe I should use the except-function to delete the first element ?


How to generate a new column and fill the values if the particular columns values in CSV A and CSV B match?

I am working on R Programming, where there are 2 CSV files containing the below data.
My program should work if the values of Shipping_ID(CSV_A) and Customer_ID(CSV_B) matches. The CSV_A should create an extra column named 'New Column' and that should fill the data of 'Previous status'(CSV_B) to the 'New Column' in CSV_A.
Finally, CSV_A is to be exported as CSV_A in a different location to my system.
The issue with my current script:
The issue is the below script only matches with the columns and gives me the results as a list.
Could someone help me with how to do this in R?
S.No. Shipping_ID Current Status
1 50 Shipped
2 30 Shipped but not delivered
3 40 In progress
4 10 Shipped
5 20 Not Shipped
S.No. Customer_ID Previous Status
1 10 Shipping in progress
2 20 Shipping in progress
3 30 Shipped
Expected Result as CSV_A
S. No. Shipping_ID Current Status New Column
1 50 shipped
2 30 Shipped but not delivered Shipped
3 40 in progress
4 10 Shipped Shipping in progress
5 20 Not Shipped Shipping in progress
R SCript
CSV_A <- 'C:/Users/Userid/Desktop/csv/CSV_A.csv'
CSV_B <- 'C:/Users/Userid/Desktop/csv/CSV_B.csv'
CSV_A<-read.csv(CSV_A )
CSV_A$Shipping_ID<- CSV_B$Customer_ID[match(CSV_A$Shipping_ID, CSV_B$Customer_ID)]
Try changing the second column name of CSV_B
colnames(CSV_B)[2] <- "Shipping_ID"
Then join dataframes,
left_join(CSV_A, CSV_B, by = Shipping_ID)

How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
ABC Stores
Bay Area
Wuhan Masks
Santa Fe
Twitter Cafe
Middle East
Santa Cruz
Silver Gym
Worli Sea Link
56.564311, 78.909087
CK Clothings
90 th Street
34.445887, 12.887654
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

To sort a specific column in a DataFrame in SparkR

In SparkR I have a DataFrame data. It contains time, game and id.
then gives ID = 1 4 1 1 215 985 ..., game = 1 5 1 10 and time 2012-2-1, 2013-9-9, ...
Now game contains a gametype which is numbers from 1 to 10.
For a given gametype I want to find the minimum time, meaning the first time this game has been played. For gametype 1 I do this
data1 <- filter(data, data$game == 1)
This new data contains all data for gametype 1. To find the minimum time I do this
g <- groupBy(data1, game$time)
first(arrange(g, desc(g$time)))
but this can't run in sparkR. It says "object of type S4 is not subsettable".
Game 1 has been played 2012-01-02, 2013-05-04, 2011-01-04,... I would like to find the minimum-time.
If all you want is a minimum time sorting a whole data set doesn't make sense. You can simply use min:
agg(df, min(df$time))
or for each type of game:
groupBy(df, df$game) %>% agg(min(df$time))
By typing
arrange(game, game$time)
I get all of the time sorted. By taking first function I get the first entry. If I want the last entry I simply type this
first(arrange(game, desc(game$time)))
Just to clarify because this is something I keep running into: the error you were getting is probably because you also imported dplyr into your environment. If you would have used SparkR::first(SparkR::arrange(g, SparkR::desc(g$time))) things would probably have been fine (although obviously the query could've been more efficient).

Excel: Select data for graph

To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool -

%Rpush >> lists of complex objects (e.g. pandas DataFrames in IPython Notebook)

Once again, I am having a great time with Notebook and the emerging rmagic infrastructure, but I have another question about the bridge between the two. Currently I am attempting to pass several subsets of a pandas DataFrame to R for visualization with ggplot2. Just to be clear upfront, I know that I could pass the entire DataFrame and perform additional subsetting in R. My preference, however, is to leverage the data management capability of Python and the subset-wise operations I am performing are just easier and faster using pandas than the equivalent operations in R. So for the sake of efficiency and morbid curiosity...
I have been trying to figure out if there is a way to push several objects at once. The wrinkle is that sometimes I don't know in advance how many items will need to be pushed. To retain flexibility, I have been populating dictionaries with DataFrames throughout the front end of the script. The following code provides a reasonable facsimile of what I am working through (I have not converted via com.convert_to_r_dataframe for simplicity, but my real code does take this step):
import pandas as pd
from pandas import DataFrame
%load_ext rmagic
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
%Rpush n1
As can be seen, I can assign a static name and push the DataFrame into the R namespace individually (as well as in a 'list' >> %Rpush n1 n2). What I cannot do is something like the following:
for name in d_dict.keys():
%Rpush d_dict[name]
That snippet raises an exception >> KeyError: u'd_dict[name]'. I also tried to deposit the dynamically named DataFrames in a list, the list references end up pointing to the data rather than the object reference:
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
exec 'df_list.append(%s)' % name
print df_list
for df in df_list:
%Rpush df
[ 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15,
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19]
%Rpush did not throw an exception when I looped through the lists contents, but the DataFrames could not be found in the R namespace. I have not been able to find much discussion of this topic beyond talk about the conversion of lists to R vectors. Any help would be greatly appreciated!
Rmagic's push uses the name that you give it both to look up the Python variable, and to name the R variable it creates. So it needs a valid name, not just any expression, on both sides.
There's a trick you can do to get the name from a Python variable:
name = 'd1'
%Rpush {name}
# equivalent to %Rpush d1
But if you want to do more advanced things, it's best to get hold of the r object and use that to put your objects in. Rmagic is just a convenience wrapper over rpy2, which is a full API. So you can do:
from rpy2.robjects import r
r.assign('a', 1)
You can mix and match which interface you use - rmagic and rpy2 are talking to the same instance of R.
