Newbie working on Horse Racing Database using R - r

I'm new to the group and to R language.
I've written some code (below) that achieves the desired result.
However, i'm aware that i'm reproducing lines of the same code which would surely be more efficiently coded using a for loop.
Also, there will be races with large numbers of horses so I really need to be able to run a for loop that runs through each horse.
ie. num_runners = NROW(my_new_data)
my_new_data holds data on horses previous races.
DaH is a numeric rating that is attached to each of a horse's previous runs with DaH1 being the most recent and DaH6 is six races back.
Code, a character, signifies the type of race that the horse competed in. ie. Flat, Fences.
I have played with using for loops, ie. for(i in 1:6) without success.
Since I am assigning to a new horse each time I would hope something such as the following would work:
horse(i) = c(my_new_data$DaH1[i],my_new_data$DaH2[i],my_new_data$DaH3[i],my_new_data$DaH4[i],my_new_data$DaH5[i],my_new_data$DaH6[i])
But I know that horse(i) is not allowed.
Would my best strategy be to pre-define a dataframe of size: 6 rows and 6 columns
and use 2 for loops to populate [row][column]? Something like:
final_data[i,j]
Here is the code I am presently using which creates the dataframe racetest:
horse1 = c(my_new_data$DaH1[1],my_new_data$DaH2[1],my_new_data$DaH3[1],my_new_data$DaH4[1],my_new_data$DaH5[1],my_new_data$DaH6[1])
horse2 = c(my_new_data$DaH1[2],my_new_data$DaH2[2],my_new_data$DaH3[2],my_new_data$DaH4[2],my_new_data$DaH5[2],my_new_data$DaH6[2])
horse3 = c(my_new_data$DaH1[3],my_new_data$DaH2[3],my_new_data$DaH3[3],my_new_data$DaH4[3],my_new_data$DaH5[3],my_new_data$DaH6[3])
horse4 = c(my_new_data$DaH1[4],my_new_data$DaH2[4],my_new_data$DaH3[4],my_new_data$DaH4[4],my_new_data$DaH5[4],my_new_data$DaH6[4])
horse5 = c(my_new_data$DaH1[5],my_new_data$DaH2[5],my_new_data$DaH3[5],my_new_data$DaH4[5],my_new_data$DaH5[5],my_new_data$DaH6[5])
horse6 = c(my_new_data$DaH1[6],my_new_data$DaH2[6],my_new_data$DaH3[6],my_new_data$DaH4[6],my_new_data$DaH5[6],my_new_data$DaH6[6])
horse1.code = c(my_new_data$Code1[1],my_new_data$Code2[1],my_new_data$Code3[1],my_new_data$Code4[1],my_new_data$Code5[1],my_new_data$Code6[1])
horse2.code = c(my_new_data$Code1[2],my_new_data$Code2[2],my_new_data$Code3[2],my_new_data$Code4[2],my_new_data$Code5[2],my_new_data$Code6[2])
horse3.code = c(my_new_data$Code1[3],my_new_data$Code2[3],my_new_data$Code3[3],my_new_data$Code4[3],my_new_data$Code5[3],my_new_data$Code6[3])
horse4.code = c(my_new_data$Code1[4],my_new_data$Code2[4],my_new_data$Code3[4],my_new_data$Code4[4],my_new_data$Code5[4],my_new_data$Code6[4])
horse5.code = c(my_new_data$Code1[5],my_new_data$Code2[5],my_new_data$Code3[5],my_new_data$Code4[5],my_new_data$Code5[5],my_new_data$Code6[5])
horse6.code = c(my_new_data$Code1[6],my_new_data$Code2[6],my_new_data$Code3[6],my_new_data$Code4[6],my_new_data$Code5[6],my_new_data$Code6[6])
racetest = data.frame(horse1,horse1.code,horse2,horse2.code, horse3, horse3.code,
horse4,horse4.code,horse5,horse5.code, horse6, horse6.code)
Thanks in advance for any help that can be offered!
Graham

using loops in R is usually not the correct approach. Still I will give you something which might work.
There are two possible approaches I see here, I will address the simpler one:
if columns are ordered such that column 1:6 are named DaH1 to DaH6 and columns 7: 12 are the ones named horse1.code etc... in this case:
library(magrittr)
temp<- cbind(my_new_data[,1:6] %>% t,
my_new_data[,7:12]%>% t)
Odd = seq(1,12,2)
my_new_data[ , Odd] = temp[,1:6]
my new_data[ , -Odd] = temp[,7:12]
#cleanup
rm(temp,Odd)
my_new_data should now contain your desired output. Before you run this, make sure your data is backed up inside another object as this is untested code.

Actually we want to reshape the wide format of the data in a different wide format. But first let's look at your desired for loop approach, to understand what's going on.
Using a loop
For the loop we'll need two variables with sequences i and j.
## initialize matrix with dimnames
racetest <- matrix(NA, 3, 6,
dimnames=list(c("DaH1", "DaH2", "DaH3"),
c("horse1", "horse1.code", "horse2", "horse2.code",
"horse3", "horse3.code")))
## loop
for (i in 0:2) {
for (j in 1:3) {
racetest[j, 1:2+2*i] <- unlist(my_new_data[i+1, c(1, 4)])
}
}
# horse1 horse1.code horse2 horse2.code horse3 horse3.code
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Often for loops are discouraged in R, because they might be slow and doesn't use the vectorized features of the R language. Moreover they can also be tricky to program.
Transposing column sets
We also could do a different approach. Actually we want to transpose the DaH* and Code* column sets (identifiable using grep) and bring them in the appropriate order using substring of names, with nchar as first character.
rownames(my_new_data) <- paste0("horse.", seq(nrow(my_new_data)))
rr <- data.frame(DaH=t(my_new_data[, grep("DaH", names(my_new_data))]),
Code=t(my_new_data[, grep("Code", names(my_new_data))]))
rr <- rr[order(substring(names(rr), nchar(names(rr))))]
rr
# DaH.horse.1 Code.horse.1 DaH.horse.2 Code.horse.2 DaH.horse.3 Code.horse.3
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Reshaping data
Last but not least, we actually want to reshape the data. For this we give the data set an ID variable.
my_new_data <- transform(my_new_data, horse=1:nrow(my_new_data))
At first, we reshape the data into "long" format, using the new ID variable horse and put the two varying column sets into a list.
rr1 <- reshape(my_new_data, idvar="horse", varying=list(1:3, 4:6), direction="long", sep="",
v.names=c("DaH", "Code"))
rr1
# horse time DaH Code
# 1.1 1 1 1 1
# 2.1 2 1 2 2
# 3.1 3 1 3 3
# 1.2 1 2 1 1
# 2.2 2 2 2 2
# 3.2 3 2 3 3
# 1.3 1 3 1 1
# 2.3 2 3 2 2
# 3.3 3 3 3 3
Then, in order to get the desired wide format, what we want is to swap idvar and timevar, where our new idvar is "time" and our new timevar is "horse".
reshape(rr1, timevar="horse", idvar="time", direction= "wide")
# time DaH.1 Code.1 DaH.2 Code.2 DaH.3 Code.3
# 1.1 1 1 1 2 2 3 3
# 1.2 2 1 1 2 2 3 3
# 1.3 3 1 1 2 2 3 3
Benchmark
The benchmark reveals that of these three approaches transposing of the matrices is fastest, while the 'for' loop is actually by far the slowest.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# forloop 7191.038 7373.5890 8381.8036 7576.678 7980.4320 46677.324 100 c
# transpose 620.748 656.0845 707.7248 692.953 733.1365 944.773 100 a
# reshape 2791.710 2858.6830 3013.8372 2958.825 3118.4125 3871.960 100 b
Toy data:
my_new_data <- data.frame(DaH1=1:3, DaH2=1:3, DaH3=1:3, Code1=1:3, Code2=1:3, Code3=1:3)

Related

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }
Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

Programmatically generating a list of columns to be assigned to data.table with `:=` syntax

In data.table, I can generate a list of new columns that are immediately assigned to the table using the `:=` syntax, like so:
x <- data.table(x1=1:5, x2=1:5)
x[, `:=` (x3=x1+2, x4=x2*3)]
Alternatively, I could have done the following:
x[, c("x3","x4") := list(x1+2, x2*3)]
I would like to do something like the first method, but have the right hand side of the assignment statement be built up automatically using a custom function. For example, suppose I want a function that will accept a set of column names, then generate new columns that are the mean of the given columns, with the column name being equal to the original column plus some suffix. For example,
x[, `:=` MEAN(x1,x2)]
would yield the same result as
x[, `:=` (x1_mean=mean(x1), x2_mean=mean(x2))]
Is this possible in data.table? I realize this is possible if I'm willing to pass in a list of column names like in the c("x3","x4") := ... example, but I want to avoid this so I don't have to write as much code.
Just refer to the function by name:
myfun <- "mean"
x[,paste(names(x),myfun,sep="_"):=lapply(.SD,myfun)]
# x1 x2 x1_mean x2_mean
# 1: 1 1 3 3
# 2: 2 2 3 3
# 3: 3 3 3 3
# 4: 4 4 3 3
# 5: 5 5 3 3
Customization is straightforward:
divby2 <- function(x) x/2 # custom function
myfun <- "divby2"
mycols <- "x1" # custom columns
x[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
# x1 x2 x1_mean x2_mean x1_divby2
# 1: 1 1 3 3 0.5
# 2: 2 2 3 3 1.0
# 3: 3 3 3 3 1.5
# 4: 4 4 3 3 2.0
# 5: 5 5 3 3 2.5
We may some day have syntax like paste(.SDcols,myfun,sep="_"):=lapply(.SD,myfun), but .SDcols on the left-hand side is not supported currently.
Making a function. If you want a function to do this, there's
add_myfun <- function(DT,myfun,mycols){
DT[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
}
add_myfun(x,"median","x2")
Can a function be written that will work inside j of DT[i,j]? Maybe. But I think it's not a good idea.
Can you be sure your function will be robust to all the other uses of j, like by?
Can your function take advantage of data.table's optimization (e.g., of mean)?
Will anyone else be able to read your code?
Using [ can be slow. If you're doing this for many columns, you might be better off initializing the new columns and assigning with set.

Reshape data frame from wide to long with re-occuring column names in R

I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.
To reproduce my starting data frame:
conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"
The data frame looks like the following:
When I run the following melt function:
melt.df<-melt(df,id="conversion.id")
It drops the interaction.num == 2 column and looks something like this:
The data frame I want is the following:
I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.
How to reshape a dataframe with "reoccurring" columns?
And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.
Any advice would be greatly appreciated!
Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.
library(tidyr)
library(dplyr)
var_name <- "interaction.num"
problem_var <- df %>%
names %>%
equals(var_name) %>%
which
replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))
names(df)[problem_var] <- replaced_names
df %>%
gather_("prob_var_name",var_name,replaced_names) %>%
select(-prob_var_name)
conversion.id conversion interaction.num
1 1 1 1
2 2 1 1
3 3 1 1
4 1 1 2
5 2 1 2
6 3 1 2
Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?
Here's a solution using data.table. You just have to provide the index instead of names.
require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3,
value.name="interaction.num")[, variable := NULL]
# conversion.id conversion interaction.num
# 1: 1 1 1
# 2: 2 1 1
# 3: 3 1 1
# 4: 1 1 2
# 5: 2 1 2
# 6: 3 1 2
You can get the indices 2:3 by doing grep("interaction.num", names(df)).
Here's an approach in base R that should work for you:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")
## Reshape
reshape(df, direction = "long",
idvar=c("conversion.id", "conversion"),
varying = x, sep = "_")
# conversion.id conversion time interaction.num
# 1.1.1 1 1 1 1
# 2.1.1 2 1 1 1
# 3.1.1 3 1 1 1
# 1.1.2 1 1 2 2
# 2.1.2 2 1 2 2
# 3.1.2 3 1 2 2
Another possibility is stack instead of reshape:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))
The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.

Resources