Choose the right analysis - r

I have dataset in R that contains 2 group good and bad. The group good contains users that have a long lifetime and users in bad have a short lifetime.
So good contains game_id and game_played. For example good$game_id==1 (game 1) has been played good$game_played==12.5 hours.
I want to investigate if there are a difference between good and bad and see which game_id's that make the difference between good and bad.
I have 20 game_id's so I don't need Principal Component Analysis to make a reduction of game_id's. How should one make an analysis to see if some game_id's makes the difference between good and bad ?
So in R we get for good
an output like this:
game_id game_played
6 18.3
14 2.1
4 0.6
1 1.0
2 1.4
3 0.1
5 0.4
7 1.2
8 1.2
9 3.1
10 1.7
11 11.6
12 0.2
13 5.4
15 4.3
16 12.4
17 8.2
18 7.0
19 3.4
20 4.6
where game_id is the name of the game and game_played is the hours the game has been played in data good. For bad we have a similar output with difference values.

Related

How to pause and restart a loop when specific conditions are met?

This is what my data looks like:
ID <- c(rep(1.1, 10), c(rep("1.2", 7)))
behaviour <- c("stand", "still", "eat", "lie", "lick", "still", "rum",
"still", "stand", "walk", "stand", "rum", "walk", "still",
"lie", "still", "stand")
df_behav <- data.frame(ID, behaviour)
ID behaviour
1 1.1 stand
2 1.1 still
3 1.1 eat
4 1.1 lie
5 1.1 lick
6 1.1 still
7 1.1 rum
8 1.1 still
9 1.1 stand
10 1.1 walk
11 1.2 stand
12 1.2 rum
13 1.2 walk
14 1.2 still
15 1.2 lie
16 1.2 still
17 1.2 stand
I would like to built a loop which deletes "still"-rows based on following conditons:
I would like to run the loop for all events with the same "ID".
And within the IDs, I would like the loop to start with the first "stand", pause the loop if "lie" is reached and restart the loop as soon as there is, again, "stand" and so on.
Of course there are more than 2 IDs in my actual data frame and that's why I am looking for a automated approach.
What my data should look like:
ID behaviour
1 1.1 stand
2 1.1 still
3 1.1 eat
4 1.1 lie
5 1.1 lick
6 1.1 rum
7 1.1 stand
8 1.1 walk
9 1.2 stand
10 1.2 rum
11 1.2 walk
12 1.2 still
13 1.2 lie
14 1.2 stand
I am quite new to R, so what I have done so far is the 'structure' around the actual code. I am not sure if a for-loop is the right approach. That's why I appreciate any kind of helpful comment or solution on my problem.
mylist_3 <- list()
for(i in 1:length(unique(df_behav$ID))){
animal1 <- filter(df_behav,
ID == unique(df_behav$ID)[i])
#here I would like to the actual code
animal9 <- as.character(unique(df_behav$ID)[i]) [1]
mylist_3[[animal9]] <- animal1
}
df_behav <- ldply(mylist_3, data.frame, .id = NULL)

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

Separate Condition based coloring of different columns in bar-plot in R

I have following data of particular candidates assessment again benchmark data. Find the csv file here:
Competencies Desired Score
DRIVE 6 6.72
CUST ORIENTATION 6 6.58
INNOVATION 6 6.43
TEAM WORK 5 6.88
ANALYTICAL THINKING 6 7
LEADERSHIP 3 6.42
ASSERTIVENESS 3 6.73
PROBLEM SOLVING 4 6.73
IMPLEMENTATION & EXECUTION 6 6.85
WORKING KNOWLEDGE 6 5
BU KNOWLEDGE 3 4.58
FU KNOWLEDGE 3 4.7
KNOWLEDGE OF THE BUSINESS ENVIRONMENT 4 4.72
I have to make a horizontal bar plot like this:
http://i.stack.imgur.com/jleQf.png
Where color of bench mark column is fixed as blue and color of score column is based on the following condition:
cols <- ifelse(import1$Score>import1$Desired,"green",
ifelse(import1$Score>=(0.96*import1$Desired) & import1$Score<import1$Desired,
"yellow", "red"))
How do I do it in bar plot so barplot takes the colors as mentioned below (I have manually input it):
barplot(t(as.matrix(import1[,1:2])),horiz=TRUE,
col=c("blue","green","blue", "green","blue", "green","blue",
"green","blue","green","blue", "green","blue","green" ,"blue",
"green","blue", "green","blue", "red","blue","green","blue", "green","blue", "green" ),cex.names=0.5,las=1,cex.axis=0.6,beside=TRUE,border=NA)
I want some condition based approach in col.
Edit:
Additional Query
If I have multiple scores for same set of competencies, how do I make separate set of charts on above mentioned condition.
Competencies Desired Score 1 Score 2 Score 3
DRIVE 6 6.72 5.2 6.6
CUST ORIENTATION 6 6.58 6 7.6
INNOVATION 6 6.43 4.2 7
TEAM WORK 5 6.88 5.4 7.8
ANALYTICAL THINKING 6 7 4.6 7
LEADERSHIP 3 6.42 5.8 7
ASSERTIVENESS 3 6.73 4.8 6.4
PROBLEM SOLVING 4 6.73 6 6.6
IMPLEMENTATION & EXECUTION 6 6.85 6.2 6
WORKING KNOWLEDGE 6 5 3.6 5.4
BU KNOWLEDGE 3 4.58 3.8 4.4
FU KNOWLEDGE 3 4.7 4 4.6
KNOWLEDGE OF THE BUSINESS ENVIRONMENT 4 4.72 4 4.8
Since you already have the colors you want in cols you can create the vector you want by first generating enough "blue" values like this:
blues <- rep("blue",length(cols))
and then combine this vector with the other vector to get the colors you want using:
colors <- as.vector(rbind(blues,cols))
after which your plot can be created by:
barplot(t(as.matrix(import1[,2:3])),horiz=TRUE,
col=colors,cex.names=0.5,las=1,cex.axis=0.6,beside=TRUE,border=NA)
Edit:
Combining all of this into a function looks like this:
calcCols <- function(df,score) {
cols <- ifelse(df[,score]>df$Desired,"green",
ifelse(df[,score]>=(0.96*df$Desired) & df[,score]<df$Desired,
"yellow", "red"))
blues <- rep("blue",length(cols))
as.vector(rbind(blues,cols))
}
After which you can create the three charts you would want with the new data using:
barplot(t(as.matrix(import1[,2:3])),horiz=TRUE,
col=calcCols(import1,"Score.1"),cex.names=0.5,las=1,cex.axis=0.6,beside=TRUE,border=NA)
barplot(t(as.matrix(import1[,2:3])),horiz=TRUE,
col=calcCols(import1,"Score.2"),cex.names=0.5,las=1,cex.axis=0.6,beside=TRUE,border=NA)
barplot(t(as.matrix(import1[,2:3])),horiz=TRUE,
col=calcCols(import1,"Score.3"),cex.names=0.5,las=1,cex.axis=0.6,beside=TRUE,border=NA)

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

Manipulating a data frame with contents from a different data frame similar to a SQL join

Say I have a data frame with the contents:
Trial Person Time
1 John 1.2
2 John 1.3
3 John 1.1
1 Bill 2.3
2 Bill 2.5
3 Bill 2.7
and another data frame with the contents:
Person Offset
John 0.5
Bill 1.0
and I want to modify the original frame based on the appropriate value from the second. I could do this easily in any other language or in SQL, and I'm sure I could manage using for loops and what, but with everything else I see in R, I'm guessing it has special syntax to do this as a one-liner. So, if so, how? And if not, could you show how it could be done using loops. I haven't actually got around to learning looping in R yet since it has amazing things to simply extract and manipulate whatever values.
For reference, the output would:
Trial Person Time
1 John 0.7
2 John 0.8
3 John 0.6
1 Bill 1.3
2 Bill 1.5
3 Bill 1.7
There are many possibilities. Here is a simple one using merge() and a simple column-wise subtraction in the enlarged data.frame:
R> DF1 <- data.frame(trial=rep(1:3,2), \
Person=rep(c("John","Bill"), each=3), \
Time=c(1.2,1.3,1.1,2.3,2.5,2.7))
R> DF2 <- data.frame(Person=c("John","Bill"), Offset=c(0.5,1.0))
R> DF <- merge(DF1, DF2)
R> DF
Person trial Time Offset
1 Bill 1 2.3 1.0
2 Bill 2 2.5 1.0
3 Bill 3 2.7 1.0
4 John 1 1.2 0.5
5 John 2 1.3 0.5
6 John 3 1.1 0.5
R> DF$NewTime <- DF$Time - DF$Offset
R> DF
Person trial Time Offset NewTime
1 Bill 1 2.3 1.0 1.3
2 Bill 2 2.5 1.0 1.5
3 Bill 3 2.7 1.0 1.7
4 John 1 1.2 0.5 0.7
5 John 2 1.3 0.5 0.8
6 John 3 1.1 0.5 0.6
R>
One liner:
transform(merge(d1,d2), Time=Time - Offset, Offset=NULL)

Resources