dipping my feet into R after using excel for many years and have a question. I am thoroughly impressed with how much faster R is, it used to take Excel over an hour to do 10,000 simulations and R did 25,000 of the same sim in 4 mins. Awesome.
This is fantasy football related as I am trying to create a lineup optimizer in R and found the RGLPK library to be a good option. There are multiple other questions on SO that helped me get to where I am today however I have hit a road block. Here are some of the other topics.
Fantasy football linear programming in R with RGLPK
Rglpk - Fantasy Football Lineup Optimiser - Rbind of For Loop Output
Rglpk - Fantasy Football Lineup Optimiser - Forcing the Inclusion of a Player
Here is my stock optimizer
#stock optimal linups solver
name <- myData$Name
pos <- myData$Pos
pts <- myData$Projection
cost <- myData$Salary
team <- myData$Team
opp <- myData$Opp
num.players <- length(name)
f <- pts
var.types <- rep("B", num.players)
A <- rbind(as.numeric(pos=="QB")
, as.numeric(pos=="RB")
, as.numeric(pos=="WR")
, as.numeric(pos=="TE")
, as.numeric(pos=="K")
, as.numeric(pos=="D")
,cost)
dir <- c("=="
,"=="
,"=="
,"=="
,"=="
,"=="
,"<=")
b <- c(1
, 2
, 3
, 1
, 1
, 1
, 60000)
library(Rglpk)
sol <- Rglpk_solve_LP(obj = f
, mat = A
, dir = dir
, rhs = b
, types = var.types
, max=TRUE)
myData[sol$solution == 1,]
sprintf('Cost is:$%i', sum(cost[sol$solution > 0]))
sprintf('Projected Points is: %f', sol$optimum)
Here is a link to the data I'm using.
https://www.dropbox.com/s/d5m8jjnq32f0cpe/Week6NFLProjections.csv?dl=0
I'm also to the point where I can loop the code to create multiple lineups by setting the objective = to the previous score - .01. As a side note this process slows down significantly as it keeps going on(say by lineup #50), is this normal and is there a more efficient way to loop this?
My real question is how can I add some more extensive constraints. In Fantasy football it is useful to "pair" players from the same team together and I can't figure out how I would put that into the constraints.
For a simple pairing example how could I add a constraint so that my "optimal lineup" would have the D and K from the same team? I actually have been able to work around this question by just combining the D+K in the CSV file but am interested in how I would code that into R.
A more complex pairing scenario would be to have my QB and just 1 of the (3)WR/(1)TE be on the same team.
Another would be to make sure none of the offensive players is playing vs my own defense.
Any help would be greatly appreciated. Can't seem to find an answer to this anywhere.
Try doing something similar to this, you'll just need to modify it to suit your situation. I've taken this direct from my own code, but basically, input the players I want and create a separate data frame with these. Then I optimise the left over positions and rbind together to create the final lineup. This loops through and gives as many lineups as the user wants.
Inclusions<-readline("Enter players to include into optimal lineups: ")
Inclusions <- as.character(unlist(strsplit(Inclusions, ",")))
Inclusions_table<-Data[ Data$Player.Name %in% Inclusions, ]
Inclusions_no<-nrow(Inclusions_table)
Data<-Data[ ! Data$Player.Name %in% Inclusions, ]
Lineup_no<-readline("How many lineups to be generated?: ")
num.players <- length(Data$Player.Name)
obj<-Data$fpts
var.types<-rep("B",num.players)
subscore<-1000
Lineups <- list()
for(i in 1:Lineup_no)
{
matrix <- rbind(as.numeric(Data$Position == "QB"), # num QB
as.numeric(Data$Position == "RB"), # num RB
as.numeric(Data$Position == "RB"), # num RB
as.numeric(Data$Position == "WR"), # num WR
as.numeric(Data$Position == "WR"), # num WR
as.numeric(Data$Position == "TE"), # num TE
as.numeric(Data$Position == "TE"), # num TE
as.numeric(Data$Position %in% c("RB", "WR", "TE")), # Num RB/WR/TE
as.numeric(Data$Position == "DEF"),# num DEF
Data$Salary,Data$fpts)
direction <- c("==",
">=",
"<=",
">=",
"<=",
">=",
"<=",
"==",
"==",
"<=","<")
opt_var<-subscore-0.01
rhs<-c(1-sum(Inclusions_table$Position=="QB"),max(0,2-sum(Inclusions_table$Position=="RB")),4-sum(Inclusions_table$Position=="RB"),max(0,2-sum(Inclusions_table$Position=="WR")),4-sum(Inclusions_table$Position=="WR"),max(0,1-sum(Inclusions_table$Position=="TE")),2-sum(Inclusions_table$Position=="TE"),7-sum(Inclusions_table$Position=="RB")-sum(Inclusions_table$Position=="WR")-sum(Inclusions_table$Position=="TE"),1-sum(Inclusions_table$Position=="DEF"),100000-sum(Inclusions_table$Salary),opt_var)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,
types = var.types, max = TRUE)
Lineup<-data.frame(Data[sol$solution==1,])
subscore<-sum(Lineup$fpts)
Lineup<-rbind(Lineup,Inclusions_table)
Lineup<-Lineup[order(Lineup$Position),]
Salary<-sum(Lineup$Salary)
Score<-sum(Lineup$fpts)
print(Lineup)
print(Salary)
print(Score)
Lineups[[i]]<-Lineup
}
Data is my data set and looks like this for reference:
Position Player.Name Team Opponent Salary PPG fpts Pos_Rank upper lower Off_Snaps Pct_Off
1056 TE A.J. Derby Patriots Bills 5000 0 0.0000 82 0 0 NA <NA>
462 RB Aaron Ripkowski Packers Falcons 6000 1.8 1.3116 75 1.8852 0.01 22 25%
78 QB Aaron Rodgers Packers Falcons 19350 20.6 18.4292 1 19.9689 17.2 87 100%
1466 WR Adam Humphries Buccaneers Raiders 7650 8.1 9.4808 46 11.2125 7.5664 38 51%
1808 WR Albert Wilson Chiefs Colts 5000 4.3 5.6673 74 6.2438 4.78 11 21%
1252 WR Aldrick Robinson Falcons Packers 5000 3.8 2.9114 96 3.2836 2.0152 10 15%
636 RB Alex Collins Seahawks Saints 6000 2.7 1.5992 69 2.1513 0.41 1 2%
Hopefully you can modify this example to suit you.
Related
I have and excel with data for 2 DBs and multiple Measures (Msr) for each. There is classic ratio data Num/Denom=Ratio for each. Can anybody suggest what visualization I can use in R to graphically find big differences (let say 10%+) for each of measure between Test and X1 databases and then for each Measure.
So we compare Denom, Num, Rate between line 1 and 2.
..and then 3,4
..and then 5,6 etc
Tried to do in Excel but read that R could be much better for this purposes. But for now I can see most paired viz works for scattered display. I need something more traditional e.g. in my sample we can mark X1.SRB.Rare as low
In my example I have 3 measures, in reality could be 30. Thanks much for info.
M
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
df
db msr denom num rate
1 test BCS 11848 5255 44.35
2 x1 BCS 11049 6376 57.71
3 test CCS 35836 16908 47.18
4 x1 CCS 38458 18124 47.13
5 test SRB 54160 26253 48.47
6 x1 SRB 56387 15000 26.60
If I understood correctly, this should do what you want. I reshaped the data so you have one row per msr with separate columns for each db. I used data.table for it's performance.
library(data.table)
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
#set as a data.table
setDT(df)
#cast into one row per MSR - fill in with the "rate" variable
out <- dcast(msr ~ db, data = df, value.var = "rate")
#Compute difference
out[, test_x1_diff := test - x1]
#filter out diff >= 10
out[abs(test_x1_diff) >= 10]
#> msr test x1 test_x1_diff
#> 1: BCS 44.35 57.71 -13.36
#> 2: SRB 48.47 26.60 21.87
Created on 2019-01-11 by the reprex package (v0.2.1)
I have two datasets A and B with 8 coloumns each. Dataset A has 942 rows and Dataset B has 5079 rows. I have to compare Dataset A and Dataset B and do fuzzy matching. If there is any row is matched in Dataset B I have to mark "Matched" in dataset A in additional column.
I'm relatively new to R and not sure how to optimize r code with lapply, mapply or sapply instead of forloop.
Following is my code
##############################
# Install Necessary Packages #
##############################
#install.packages("openxlsx")
#install.packages("stringdist")
#install.packages("XLConnect")
##############################
# Load Packages #
##############################
library(openxlsx)
library(stringdist)
library(XLConnect)
cmd_newleads <- read.xlsx("Src/CMD - New Leads to Load.xlsx", sheet = "Top Leads Full Data", startRow = 1, colNames = TRUE)
cmd_newleads[c("Lead_Match","Opportunity_Match")] <- ""
c4c_leads <- read.xlsx("Src/C4C - Leads.xlsx", sheet = "Leads", startRow = 1, colNames = TRUE)
#c4c_opportunities <- read.xlsx("Src/C4C - Opportunities Data 6-24-16.xlsx", sheet = "Export 06-24-2016 04.55.46 PM", startRow = 1, colNames = TRUE)
cmd_newleads_selcols <- cmd_newleads[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
cmd_newleads_selcols[is.na(cmd_newleads_selcols)] <- ""
#rownames(cmd_newleads_selcols)
c4cleads_selcols <- c4c_leads[,c("Lead","Address1.(Lead)","City.(Lead)","Region.(Lead)","Postal.Code.(Lead)","Country.(Lead)","Sector.(Lead)","Type.(Lead)")]
c4cleads_selcols[is.na(c4cleads_selcols)] <- ""
#cmd_c4copportunities_selcols <- c4c_opportunities[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
rcount_cmdnewleads <- nrow(cmd_newleads)
rcount_c4cleads <- nrow(c4c_leads)
#rcount_c4copportunities <- nrow(c4c_opportunities)
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
cmd_project_address <- cmd_newleads_selcols[i,2]
cmd_project_city <- cmd_newleads_selcols[i,3]
cmd_project_region_code <- cmd_newleads_selcols[i,4]
cmd_project_postalcode <- cmd_newleads_selcols[i,5]
cmd_project_country <- cmd_newleads_selcols[i,6]
cmd_project_sector <- cmd_newleads_selcols[i,7]
cmd_project_type <- cmd_newleads_selcols[i,8]
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
c4cleads_project_address <- c4cleads_selcols[j,2]
c4cleads_project_city <- c4cleads_selcols[j,3]
c4cleads_project_region_code <- c4cleads_selcols[j,4]
c4cleads_project_postalcode <- c4cleads_selcols[j,5]
c4cleads_project_country <- c4cleads_selcols[j,6]
c4cleads_project_sector <- c4cleads_selcols[j,7]
c4cleads_project_type <- c4cleads_selcols[j,8]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
address_percent <- stringsim(cmd_project_address,c4cleads_project_address, method="dl", p=0.1)
city_percent <- stringsim(cmd_project_city,c4cleads_project_city, method="dl", p=0.1)
region_percent <- stringsim(cmd_project_region_code,c4cleads_project_region_code, method="dl", p=0.1)
postalcode_percent <- stringsim(cmd_project_postalcode,c4cleads_project_postalcode, method="dl", p=0.1)
country_percent <- stringsim(cmd_project_country,c4cleads_project_country, method="dl", p=0.1)
sector_percent <- stringsim(cmd_project_sector,c4cleads_project_sector, method="dl", p=0.1)
type_percent <- stringsim(cmd_project_type,c4cleads_project_type, method="dl", p=0.1)
if(project_percent > 0.833 && address_percent > 0.833 && city_percent > 0.833 && region_percent > 0.833 && postalcode_percent > 0.833 && country_percent > 0.833 && sector_percent > 0.833 && type_percent > 0.833)
{
cmd_newleads[i,51] <- c4cleads[j,c4cleads$Lead.ID]
}
else
{
cmd_newleads[i,51] <- "New Lead"
}
}
}
Sample data for cmd_newleads_selcols and c4cleads_selcols respectively
project_name project_address project_city
1 Wynn Mystic Casino & Hotel 22 Chemical Ln Everett
2 Northpoint Complex Development East Street Cambridge
3 Northpoint Complex Development East Street Cambridge
4 Northpoint Complex Development East Street Cambridge
5 Northpoint Complex Development East Street Cambridge
6 Northpoint Complex Development East Street Cambridge
project_state_province_region_code project_postalcode
1 MA 02149
2 MA 02138
3 MA 02138
4 MA 02138
5 MA 02138
6 MA 02138
project_country project_sector project_type
1 United States of America Hospitality New Building
2 United States of America Apartments New Building
3 United States of America Apartments New Building
4 United States of America Apartments New Building
5 United States of America Apartments New Building
6 United States of America Apartments New Building
Lead Address1.(Lead) City.(Lead) Region.(Lead) Postal.Code.(Lead) Country.(Lead)
1 1 Hotel Brooklyn Bridge Park Old Fulton St & Furman St Brooklyn New York 11201 United States
2 10 Trinity Square Hotel 10 Trinity Square London # EC3P United Kingdom
3 100 Stewart 1900 1st Avenue Seattle Washington 98101 United States
4 1136 S Wabash # # # # Not assigned
5 115-129 37th Street 115-129 37th Street Union CIty New Jersey # United States
6 1418 W Addison 1418 w Addison Chicago # 60613 Not assigned
Sector.(Lead) Type.(Lead)
1 Hospitality New Building
2 Hospitality Brand Conversion
3 Hospitality New Building
4 High Rise Residential New Building
5 Developer New Building
6 High Rise Residential New Building
If you are experiencing efficiency problems, it's not because you are using a for loop. The main issue is that you are doing a lot of work for every possible combination of rows in your two data sets. Using more efficient language features might speed things up a bit, but it wouldn't change the fact that you're doing a lot of unnecessary computation.
One of the best ways to increase efficiency in data matching problems is to rule out obvious non-matches to cut down on unnecessary computations. For example, you could change your inner loop to first check some key condition; if the score is low (i.e. it's obviously a non-match) you don't need to compute similarity scores for the rest of the attributes.
For example:
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
...
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
if (project_percent < .83) {
# you already know that this is a non-match, so go to the next one
next
} else {
# check the rest of the values!
...
}
}
}
I'm not familiar with the R RecordLinkage package, but the Python recordlinkage package has tools for ruling out obvious non-matches early in the process to increase efficiency. Consider checking out this tutorial to learn more about speeding up record linkage by ruling out obvious non matches.
You might want to look at the package RecordLinkage, which allows you to perform phonetic matching, probabilistic record linkage and machine learning approaches.
I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}
I am trying to fill in a large matrix (55920484 elements) in R that will eventually be symmetric (so I am actually only performing calculations for half of the matrix). The resulting values matrix is a square matrix which has the same row and column names. Each value in the matrix is the result of comparing unique lists and counting the number of intersections. This data comes from a larger dataframe (427.5 Mb). Here is my fastest solution so far, I am trying to get rid of the loops which I know are slow:
for(i in 1:length(rownames(values))){
for(j in i:length(colnames(values))){
A = data[data$Stock==rownames(values)[i],"Fund"]
B = data[data$Stock==colnames(values)[j],"Fund"]
values[i, j] = length(intersect(A, B))
}
}
I have tried several other approaches such as using a database with an SQL connection, using a sparse matrix with 0s and 1s, and using the sqldf package in R.
Here is the structure of my data:
head(data)
Fund Stock Type Shares.Held Maket.Value X..of.Portfolio Rank Change.in.Shares X..Change X..Ownership
1 12 WEST CAPITAL MANAGEMENT LP GRUB CALL 500000 12100000 0.0173 12 500000 New N/A
2 12 WEST CAPITAL MANAGEMENT LP FIVE SH 214521 6886000 0.0099 15 214521 New 0
3 12 WEST CAPITAL MANAGEMENT LP SHAK SH 314114 12439000 0.0178 11 307114 4387 1
4 12 WEST CAPITAL MANAGEMENT LP FRSH SH 324120 3650000 0.0053 16 -175880 -35 2
5 12 WEST CAPITAL MANAGEMENT LP ATRA SH 393700 10398000 0.0149 14 162003 69 1
6 12 WEST CAPITAL MANAGEMENT LP ALNY SH 651000 61285000 0.0875 4 No Change 0 1
I see three problems, in order of increasing importance:
(1) You call rownames(values) and colnames(values) many times, instead of calling them just once outside of the loops. It may or may not help.
(2) You calculate A = data[data$Stock==rownames(values)[i],"Fund"] under the innermost loop, while you should calculate it outside of this loop.
(3) Most important: Your code uses only two columns of your table: Fund and Stock. I see that in your data there are many rows with same both Fund and Stock. You should eliminate this redundacy. Maybe you want to create data1=data[,c("Fund","Stock")] and eliminate redundant rows in data1 (without loop):
data1 = data1[,order(data1[,"Fund"])]
len = nrow(data1)
good = c(TRUE,data1[-len,1]!=data1[-1,1]|data1[-len,2]!=data1[-1,2])
data1 = data1[good,]
(I did not test the code above)
Maybe you want to go further and create the list, which, for each fund, specifies what stocks it contains, without redundancies.
PS: you can still create the list which, for each stock, specifies what funds have it:
rv = rownames(values)
len = length(rv)
fund.list = list()
for (i in 1:len)
fund.list[[,i]] = data[data$Stock==rv[i],"Fund"]
for (i in 1:len) {
A = fund.list[[i]]
for (j in i:len) {
values[i, j] = length(intersect(A, fund.list[[j]]))
}
}
I was able to successfully run an RF model using some R code I was given. That is below and it includes snippet of my data too.
The only problem is that the way the code is written it only outputs a vector of probabilities and no data from the original test data set called "testset". So now I am trying to figure out how to output my probabilities along with the original data frame because I couldn't find a solution online. In other words I want it to be another column in the data set, like right after my FLSAStat column. That's so I can then output all of ittogether to a csv file.
Here's what I have:
#####################################################
# 1. SETUP DATA
#####################################################
mydata <- read.csv("train_test.csv", header=TRUE)
colnames(testset)
[1] "train" "Target" "ApptCode" "Directorate" "New_Discipline" "Series" "Adjusted.Age"
[8] "Adj.Service" "Adj.Age.Service" "HiEducLv" "Gender" "RetCd" "FLSAStat"
> head(testset)
train Target ApptCode Directorate New_Discipline Series Adjusted.Age Adj.Service Adj.Age.Service HiEducLv Gender
5909 0 NA IN Business Math Computer Science IT PSTS 54.44 10 64.44 Bachelor Male
5910 0 NA IN Computation Math Computer Science IT PSTS 51.51 15 66.51 Bachelor Male
5911 0 NA IN Physical and Life Sciences Physics PSTS 40.45 5 45.45 PHD Male
5912 0 NA IN Weapons and Complex Integ Physics PSTS 62.21 35 97.21 PHD Male
5913 0 NA IN Weapons and Complex Integ Physics PSTS 45.65 15 60.65 PHD Male
5914 0 NA FX Physical and Life Sciences Physics PSTS 36.13 5 41.12 PHD Male
RetCd FLSAStat
5909 TCP2 E
5910 TCP2 E
5911 TCP2 E
5912 TCP2 E
5913 TCP1 E
5914 TCP2 E
#create train and test sets
trainset = mydata[mydata$train == 1,]
testset = mydata[mydata$train == 0,]
#eliminate unwanted columns from train set
trainset$train = NULL
#####################################################
# 2. set the formula
#####################################################
theTarget <- "Target"
theFormula <- as.formula(paste("as.factor(",theTarget, ") ~ . "))
theFormula1 <- as.formula(paste(theTarget," ~ . "))
trainTarget = trainset[,which(names(trainset)==theTarget)]
testTarget = testset[,which(names(testset)==theTarget)]
#####################################################
# Random Forest
#####################################################
library(randomForest)
what <- "Random Forest"
FOREST_model <- randomForest(theFormula, data=trainset, ntree=500)
train_pred <- predict(FOREST_model, trainset, type="prob")[,2]
test_pred <- predict(FOREST_model, testset, type="prob")[,2]
display_results()
testID <- testset$case_id
predictions <- test_pred
submit_file = cbind(testID,predictions)
write.csv(submit_file, file="RANDOM4.csv", row.names = FALSE)
I think the problem is that I am lacking an additional line of code that does the merging of the predictions vector back into testSet. I'm guessing this this would go somewhere before the third to last line of code.
Just add the column to your dataframe like so:
testset$Predictions <- test_pred
write.csv(testset, file="RANDOM4.csv", row.names = FALSE)