Recommender Split Returning Empty Dataset - azure-machine-learning-studio

I'm using a "Split Data" module set to recommender split to split data for training and testing a matchbox recommender. The input data is a valid user-item-rating tuple (for example, 575978 - 157381 - 3) and I've left the parameters for the recommender split as default (0s for everything), besides changing it to a .75 and .25 split. However, when this module finishes, it returns the complete, unsplit dataset for dataset1 and a completely empty (but labelled) dataset for dataset2. This also happens when doing a stratified split using the "Split Rows" mode. Any idea what's going on?
Thanks.
Edit: Including a sample of my data.
UserID ItemID Rating
835793 165937 3
154738 11214 3
938459 748288 3
819375 789768 6
738571 98987 3
847509 153777 3
991757 124458 3
968685 288070 2
236349 8337 3
127299 545885 3

Figured it out. In my "Remove Duplicate Rows" module up the chain a bit I was only removing duplicates by UserID instead of UserID and ItemID. This still left quite a bit of rows but I'm assuming it messed with the stratification.

Related

is there an R function for merging duplicates to the same row?

I am conducting research on SARS-CoV-2 test on healthcare workers. Some workers were tested multiple times (they are identified by employee number). Therefore I would like to have a new column were the second/third test-value (=numeric) and date of test is listed for the same healthcare worker. However, I am completely oblivious as to how to approach this. I'd guess you could group by duplicate for the employee number and use some sort of mutate() function?
All tips are appreciated!
Maybe you could utilize the dcast function from library(data.table):
Lets asume you have the following data table:
cov_test <- data.table(worker_id =c(1,1,2),test_count=c(1,2,1),test_result=c("negative", "positive","negative"))
worker_id
test_count
test_result
1
test 1
negative
1
test 2
positive
2
test 1
negative
Using the following code you get following table:
dcast(data = cov_test, ...~test_count, value.var="test_result")
worker_id
test1
test2
1
negative
positive
2
negative
NA
The question is whether you have a column that describes the current test number for a person. If not, you would have to extract this information from the date column.

RMYSQL Writetable error

I have the following R dataframe
Sl NO Name Marks
1 A 15
2 B 20
3 C 25
I have a mysql table as follows. (Score.table)
No CandidateName Score
1 AA 1
2 BB 2
3 CC 3
I have written my dataframe to Score.table using this code
username='username'
password='userpass'
dbname='cdb'
hostname='***.***.***.***'
cdbconnection = dbConnect(MySQL(), user=username, password=userpass,
dbname=dbname, host=hostname)
Next we write the dataframe to the table as follows
score.table<-'score.table'
dbWriteTable(cdbconn, score.table, dataframe, append =F, overwrite=T).
The code runs and I get TRUE as the output.
However, when I check the SQL table, the new values haven't overwritten the existing values.
I request someone to help me. The code works. I have reinstalled the RMySQL package again and rerun and the results are the same.
That updates are not happening indicates that the RMySQL package cannot successfully map any of the rows from your data frame to already existing records in the table. So this would imply that your call to dbWriteTable has a problem. Two potential problems I see are that you did not assign values for field.types or row.names. Consider making the following call:
score.table <- 'score.table'
dbWriteTable(cdbconn, score.table, dataframe,
field.types=list(`Sl NO`="int", Name="varchar(55)", Marks="int"),
row.names=FALSE)
If you omit field.types, then the package will try to infer what the types are. I am not expert with this package, so I don't know how robust this inference is, but most likely you would want to specify explicit types for complex update queries.
The bigger problem might actually be not specifying a value for row.names. It can default to TRUE, in which case the package will actually send an extra column during the update. This can cause problems, for example if your target table has three columns, and the data frame also has three columns, then you are trying to update with four columns.

Merge columns with the same name R

I'm fairly new to R. I'm working with a data set that is incredibly redundant with a lot of columns (~400). There are several duplicate column names, however the data is not duplicate, so I need to sum the columns when collapsing them.
The columns all have a similar name that allows easy identification, so I'm hoping I can use that to my advantage.
I attempted to perform the following:
ColNames <- unique(colnames(df))
CombinedDf <- data.frame(sapply(ColNames, function(i)rowSums(Test[,ColNames==i, drop=FALSE])))
This works if I sum over the range of columns that only contain integers, but the issue is that other columns have strings and such in them, so rowSums throws a fit.
Assuming that the identifier is "XXX", how can I aggregate all the columns that are of the same name leaving the other columns as is?
Thank you for your time.
Edit: Sample data has been asked for, I cannot give the exact data as it is sensitive, but I will give an example:
Name COL1XXX COL2XXX COL1XXX COL3XXX COL2XXX Type
Henry 5 15 25 31 1 Orange
Tom 8 16 12 4 3 Green
Should return
Name COL1XXX COL2XXX COL3XXX Type
Henry 30 16 31 Orange
Tom 20 19 4 Green
I'm not really sure, but you may try transposing the data and then aggregating by unique names.
t_df=as.data.frame(t(df))
new_df=aggregate(t_df, by=list(rownames(t_df)),sum)
Again, without sample data I'm unsure if it'll work, but based on what you said, that might work.

R: Subsetting rows by group based on time difference

I have the following data frame:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 1976-02-09 1976-12-11
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
I want to subset my data frame in such a way that the new data frame only shows the rows in which the values of date_show are further than 10 days apart but this condition should only be applied per group. I.e. if the values in the date_show column are less than 10 days apart but the group_ids are different, I need to keep both entries. What I want my result to look like based on the above table is:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
Which row gets deleted isn't important because the reason why I'm subsetting in the first place is to calculate the number of rows I am left with after applying this criteria.
I've tried playing around with the diff function but I'm not sure how to go about it in the simplest possible way because this problem is already within another sapply function so I'm trying to avoid any kind of additional loop (in this case by group_id).
The df I'm working with has around 100 000 rows. Ideally, I would like to do this with base R because I have no rights to install any additional packages on the machine I'm working on but if this is not possible (or if solving this with an additional package would be significantly better), I can try and ask my admin to install it.
Any tips would be appreciated!

R: Data transfer between two lists (source list smaller than target list)

I searched, but I couldn't find a similar question, so I apologize if I may have missed it.
My problem is actually pretty simple. I have two lists, a large one and a smaller one.
The smaller one consists of the averages of the data in the large list (ten lines have
been aggregated to form the small list -> it has one tenth the size of the larger one). All I want now, is to add a new column in the large list (which is no problem) and showing the averages next
to the original data. I am aware that I will see the average ten times, but that's fine.
I tried to solve this "problem" with simple list comparisons, e.g. (the relevant averages, as well as the original data have identical identifiers in the first column):
Large_List$Average_column[ Large_List$identifier == Small_List$identifier ] <- Small_List$Average[ Large_List$identifier == Small_List$identifier ];
Yet for some reason, it doesn't work. Probably because the target vector is larger than the source vector. I really tried a lot, and the only thing that seems to work is a loop structure. But that is no option because my list is way too large... I am sure there must be a smart solution to this simple issue.
UPDATE & SPECIFICATION
Thank you for your suggestions. But it seems I need to be more specific. The problem is that in most, but not in all cases, the average is formed out of ten consecutive datapoints. It may occur that less is used because of holes in the sample. Therefore, a replication will unfortunately not do the job.
Here’s an example (1_Ident is the minute identifier, 10_Ident being the ten minute identifier) :
Original_List:
1_Ident | 10_Ident|Minute_value|
July1-0| July1-0d| 1
July1-2| July1-0d| 1
(..)
July1-10| July1-0d| 1
July1-11| July1-1d| 1
July1-12| July1-1d| 2
July1-21| July1-21| 3
July1-31| July1-31| 2
Resulting Small_list:
10_Ident|Minute_average|
July1-0d| 1
July1-1d| 1.5
July1-2d| 3
July1-3d| 2
Desired outcome:
Large_List:
1_Ident |10_Ident|Minute_value|Minute_average|
July1-0| July1-0d| 1 1
July1-2| July1-0d| 1 1
(..)
July1-10| July1-0d| 1 1
July1-11| July1-1d| 1 1.5
July1-12| July1-1d| 2 1.5
July1-21| July1-21| 3 3
July1-31| July1-31| 2 2
I think the main problem is that the Small_list$Minute_average vector is not the same size as the Large_list$Minute_value vector. As said, one could compare the two lists line by line, doing a loop, but the size of the tables is >1M lines, so that won't work.
What I want to do is basically the following:
1) Look in the Large_List$10_Ident and compare it Small_List$10_Ident
2) Where the values match, transfer the corresponding Small_List$Minute_average value to Large_List$Minute_average
Thanks!
You could use match or merge to do that but why not just calculate the averages off the groupings?
Large_List$Average_column <- ave(Large_List$col_to_be_avgd,
Large_List$group_var,
FUN=mean, na.rm=TRUE)
The merge code might look like
merge( Large_List, Small_List[c('identifier', "Average"], by='identifier' , all.x=TRUE)

Resources