Pair-wise manipulating rows in data.frame - r

I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!

s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2

The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))

This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1

Related

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Match group assignments between columns

I am trying to check the accuracy rate of a clustering algorithm, with a dataframe that looks like the one here. The orig.gp refers to the original grouping, which is the "correct" group assignment. The new.gp refers to the grouping assigned by the clustering algorithm.
df <- data.frame(id = 1:9,
orig.gp = c(rep(1:3, each = 3)),
new.gp = c(2, 2, 3, 3, 3, 1, 1, 1, 1) )
df
# id orig.gp new.gp
# 1 1 1 2
# 2 2 1 2
# 3 3 1 3
# 4 4 2 3
# 5 5 2 3
# 6 6 2 1
# 7 7 3 1
# 8 8 3 1
# 9 9 3 1
What I am trying to determine is whether the same ids are assigned the same grouping as the orig.gp. The group number itself is not that important, as the number is arbitrary. Ideally, I would like to achieve something like this:
# orig.gp new.gp correct
# 1 1 2 yes
# 2 1 2 yes
# 3 1 3 no
# 4 2 3 yes
# 5 2 3 yes
# 6 2 1 no
# 7 3 1 yes
# 8 3 1 yes
# 9 3 1 yes
To illustrate, in the original grouping, group 1 consists of ids 1, 2, 3; group 2 consists of ids 4, 5, 6; group 3 consists of 7, 8, 9. In the new grouping, ids 1, 2 are correctly assigned into the same group, thus the "yes" in the correct column. I would like to determine whether the same ids are assigned into the same groups as the original groupings.
Any suggestions would be appreciated!
The way I understand your problem, it is basically one of recoding. Namely, you want to identify observations that fall on the diagonal of a crosstabulation of new.gp and orig.gp, but the values of new.gp are mislabeled.
What I propose here is basically recoding the values of new.gp based on a simple crosstabulation (see tab below). The recoding is done by taking the modal value of orig.gp for each possible value of new.gp and assuming that this mode is the correct value label. I then use recode from car to perform the recoding.
library("car")
tab <- with(df, table(new.gp, orig.gp))
tab
## orig.gp
## new.gp 1 2 3
## 1 0 1 3
## 2 2 0 0
## 3 1 2 0
df$recoded <- recode(df$new.gp, paste(rownames(tab),colnames(tab)[max.col(tab)],sep='=',collapse=';'))
df$correct <- ifelse(df$orig.gp == df$recoded, "yes", "no")
The result:
> df
orig.gp new.gp recoded correct
1 1 2 1 yes
2 1 2 1 yes
3 1 3 2 no
4 2 3 2 yes
5 2 3 2 yes
6 2 1 3 no
7 3 1 3 yes
8 3 1 3 yes
9 3 1 3 yes

Add a column for counting unique tuples in the data frame [duplicate]

This question already has answers here:
How to get frequencies then add it as a variable in an array?
(3 answers)
Closed 8 years ago.
Suppose I have the following data frame:
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
df
# userID A B
# 1 1 2 2
# 2 1 3 3
# 3 3 2 1
# 4 5 1 0
# 5 3 2 1
# 6 5 1 0
I would like to create a data frame with the same columns but with an added final column that counts up the number of unique tuples / combinations of the other columns. The output should look like the following:
userID A B count
1 2 2 1
1 3 3 1
3 2 1 2
5 1 0 2
The meaning is the the tuple / combination of (1, 2, 2) occurs with count=1, while the tuple of (3, 2, 1) occurs twice so has count=2. I would prefer not to use any external packages.
1) aggregate
ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ] # sort the rows
giving:
userID A B count
3 1 2 2 1
4 1 3 3 1
2 3 2 1 2
1 5 1 0 2
The last line of code which sorts the rows could be omitted if the order of the rows is unimportant.
The remaining solutions use the indicated packages:
2) sqldf
library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")
giving:
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
The order by clause could be omitted if the order is unimportant.
3) dplyr
library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())
giving:
Source: local data frame [4 x 4]
Groups: userID, A
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
4) data.table
library(data.table)
data.table(df)[, list(count = .N), by = names(df)]
giving:
userID A B count
1: 1 2 2 1
2: 1 3 3 1
3: 3 2 1 2
4: 5 1 0 2
ADDED additional solutions. Also some small improvements.
Here's a fairly straightforward way (ave to the rescue!):
unique(cbind(df,
count = ave(rep(1, nrow(df)),
do.call(paste, df),
FUN = length)))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
Here's a variation of the above:
unique(within(df, {
counter <- rep(1, nrow(df))
count <- ave(counter, df, FUN = length)
rm(counter)
}))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
Make a quick factor of the tuples:
df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))
No external packages just taking advantage of summary() and storing it as a DF then merging the counts on the original data:
df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL
Almost final output, just has dupes:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 3 2 1 2
5 5 1 0 2
6 5 1 0 2
Lastly, clean up dupes:
df <- df[!duplicated(df), ]
Here you go:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
5 5 1 0 2
Been a while not doing that with sql or plyr. if you can use dplyr or a package later on do it. Bioconductor has a lot of great sequencing packages if it starts to get more complex.
Hope this helps.
This should do the trick, even if it is a little bit ugly:
vec <- table(apply(df,1,paste,collapse=""))
df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))
names(df2) <- names(df)
df2$count <- vec
# userID A B count
#1 1 2 2 1
#2 1 3 3 1
#3 3 2 1 2
#4 5 1 0 2

Cumulative count of each value [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I want to create a cumulative counter of the number of times each value appears.
e.g. say I have the column:
id
1
2
3
2
2
1
2
3
This would become:
id count
1 1
2 1
3 1
2 2
2 3
1 2
2 4
3 2
etc...
The ave function computes a function by group.
> id <- c(1,2,3,2,2,1,2,3)
> data.frame(id,count=ave(id==id, id, FUN=cumsum))
id count
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
I use id==id to create a vector of all TRUE values, which get converted to numeric when passed to cumsum. You could replace id==id with rep(1,length(id)).
Here is a way to get the counts:
id <- c(1,2,3,2,2,1,2,3)
sapply(1:length(id),function(i)sum(id[i]==id[1:i]))
Which gives you:
[1] 1 1 1 2 3 2 4 2
The dplyr way:
library(dplyr)
foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo
# A tibble: 8 x 2
# Groups: id [3]
id count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
That ends up grouped by id. If you want it not grouped, add %>% ungroup().
For completeness, adding a data.table way:
library(data.table)
DT <- data.table(id = c(1, 2, 3, 2, 2, 1, 2, 3))
DT[, count := seq(.N), by = id][]
Output:
id count
1: 1 1
2: 2 1
3: 3 1
4: 2 2
5: 2 3
6: 1 2
7: 2 4
8: 3 2
The dataframe I had was too large and the accepted answer kept crashing. This worked for me:
library(plyr)
df$ones <- 1
df <- ddply(df, .(id), transform, cumulative_count = cumsum(ones))
df$ones <- NULL
Function to get the cumulative count of any array, including a non-numeric array:
cumcount <- function(x){
cumcount <- numeric(length(x))
names(cumcount) <- x
for(i in 1:length(x)){
cumcount[i] <- sum(x[1:i]==x[i])
}
return(cumcount)
}

Resources