Second and third function in sparkR - r

In sparkR I have a DataFrame data. It contains user, game.
user contains the users and game contains the name of a game a user has played. There are only 14 games, namely 1,2,...,14.
So
head(data)
gives this output
user game
3521 3
52 14
865 4
52 3
I want to find the first game a fixed user is playing. For example user 52 plays game 14 3 3 5 10 and here game 14 is the first game this user was playing.
In sparkR I do it this way
su <- groupBy(data, data$user)
sus <- agg(su, FirstPlayed= first(data$game))
# Making it local
local_sus <- collect(sus)
Here I get the correct result because I can use the first function in sparkR.
I want to find the 'second' and 'third' game a user has played but I can't do that because sparkR don't have a "second" function.
How should one then solve it - maybe I should use the except-function to delete the first element ?

Related

How to generate a new column and fill the values if the particular columns values in CSV A and CSV B match?

I am working on R Programming, where there are 2 CSV files containing the below data.
My program should work if the values of Shipping_ID(CSV_A) and Customer_ID(CSV_B) matches. The CSV_A should create an extra column named 'New Column' and that should fill the data of 'Previous status'(CSV_B) to the 'New Column' in CSV_A.
Finally, CSV_A is to be exported as CSV_A in a different location to my system.
The issue with my current script:
The issue is the below script only matches with the columns and gives me the results as a list.
Could someone help me with how to do this in R?
CSV_A:
S.No. Shipping_ID Current Status
1 50 Shipped
2 30 Shipped but not delivered
3 40 In progress
4 10 Shipped
5 20 Not Shipped
CSV_B:
S.No. Customer_ID Previous Status
1 10 Shipping in progress
2 20 Shipping in progress
3 30 Shipped
Expected Result as CSV_A
S. No. Shipping_ID Current Status New Column
1 50 shipped
2 30 Shipped but not delivered Shipped
3 40 in progress
4 10 Shipped Shipping in progress
5 20 Not Shipped Shipping in progress
R SCript
library(SASxport)
CSV_A <- 'C:/Users/Userid/Desktop/csv/CSV_A.csv'
CSV_B <- 'C:/Users/Userid/Desktop/csv/CSV_B.csv'
library(tidyverse)
CSV_A<-read.csv(CSV_A )
CSV_A
CSV_B<-read.csv(CSV_B)
CSV_B
CSV_A$Shipping_ID<- CSV_B$Customer_ID[match(CSV_A$Shipping_ID, CSV_B$Customer_ID)]
Try changing the second column name of CSV_B
colnames(CSV_B)[2] <- "Shipping_ID"
Then join dataframes,
library(dplyr)
left_join(CSV_A, CSV_B, by = Shipping_ID)

How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
Sr.No.
Store_Name
Phone-No.
Agent_id
Area
Lat-Long
1
ABC Stores
89099090
121
Bay Area
23.909090,89.878798
2
Wuhan Masks
45453434
122
Santa Fe
24.452134,78.123243
3
Twitter Cafe
67556090
123
Middle East
11.889766,23.334483
4
abc
33445569
121
Santa Cruz
23.345678,89.234213
5
Silver Gym
11004110
234
Worli Sea Link
56.564311, 78.909087
6
CK Clothings
00908876
223
90 th Street
34.445887, 12.887654
Facts:
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
Problem:
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

To sort a specific column in a DataFrame in SparkR

In SparkR I have a DataFrame data. It contains time, game and id.
head(data)
then gives ID = 1 4 1 1 215 985 ..., game = 1 5 1 10 and time 2012-2-1, 2013-9-9, ...
Now game contains a gametype which is numbers from 1 to 10.
For a given gametype I want to find the minimum time, meaning the first time this game has been played. For gametype 1 I do this
data1 <- filter(data, data$game == 1)
This new data contains all data for gametype 1. To find the minimum time I do this
g <- groupBy(data1, game$time)
first(arrange(g, desc(g$time)))
but this can't run in sparkR. It says "object of type S4 is not subsettable".
Game 1 has been played 2012-01-02, 2013-05-04, 2011-01-04,... I would like to find the minimum-time.
If all you want is a minimum time sorting a whole data set doesn't make sense. You can simply use min:
agg(df, min(df$time))
or for each type of game:
groupBy(df, df$game) %>% agg(min(df$time))
By typing
arrange(game, game$time)
I get all of the time sorted. By taking first function I get the first entry. If I want the last entry I simply type this
first(arrange(game, desc(game$time)))
Just to clarify because this is something I keep running into: the error you were getting is probably because you also imported dplyr into your environment. If you would have used SparkR::first(SparkR::arrange(g, SparkR::desc(g$time))) things would probably have been fine (although obviously the query could've been more efficient).

Excel: Select data for graph

To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool - www.cloudyexcel.com/excel-to-graph/

%Rpush >> lists of complex objects (e.g. pandas DataFrames in IPython Notebook)

Once again, I am having a great time with Notebook and the emerging rmagic infrastructure, but I have another question about the bridge between the two. Currently I am attempting to pass several subsets of a pandas DataFrame to R for visualization with ggplot2. Just to be clear upfront, I know that I could pass the entire DataFrame and perform additional subsetting in R. My preference, however, is to leverage the data management capability of Python and the subset-wise operations I am performing are just easier and faster using pandas than the equivalent operations in R. So for the sake of efficiency and morbid curiosity...
I have been trying to figure out if there is a way to push several objects at once. The wrinkle is that sometimes I don't know in advance how many items will need to be pushed. To retain flexibility, I have been populating dictionaries with DataFrames throughout the front end of the script. The following code provides a reasonable facsimile of what I am working through (I have not converted via com.convert_to_r_dataframe for simplicity, but my real code does take this step):
import pandas as pd
from pandas import DataFrame
%load_ext rmagic
d1=DataFrame(np.arange(16).reshape(4,4))
d2=DataFrame(np.arange(20).reshape(5,4))
d_list=[d1,d2]
names=['n1','n2']
d_dict=dict(zip(names,d_list))
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
%Rpush n1
As can be seen, I can assign a static name and push the DataFrame into the R namespace individually (as well as in a 'list' >> %Rpush n1 n2). What I cannot do is something like the following:
for name in d_dict.keys():
%Rpush d_dict[name]
That snippet raises an exception >> KeyError: u'd_dict[name]'. I also tried to deposit the dynamically named DataFrames in a list, the list references end up pointing to the data rather than the object reference:
df_list=[]
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
exec 'df_list.append(%s)' % name
print df_list
for df in df_list:
%Rpush df
[ 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15,
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19]
%Rpush did not throw an exception when I looped through the lists contents, but the DataFrames could not be found in the R namespace. I have not been able to find much discussion of this topic beyond talk about the conversion of lists to R vectors. Any help would be greatly appreciated!
Rmagic's push uses the name that you give it both to look up the Python variable, and to name the R variable it creates. So it needs a valid name, not just any expression, on both sides.
There's a trick you can do to get the name from a Python variable:
d1=DataFrame(np.arange(16).reshape(4,4))
name = 'd1'
%Rpush {name}
# equivalent to %Rpush d1
But if you want to do more advanced things, it's best to get hold of the r object and use that to put your objects in. Rmagic is just a convenience wrapper over rpy2, which is a full API. So you can do:
from rpy2.robjects import r
r.assign('a', 1)
You can mix and match which interface you use - rmagic and rpy2 are talking to the same instance of R.

Resources