create new column based on values on previous rows - r

I hope somebody can help me.
I have a data like this:
subject choice
1 3
2 3
3 1
4 4
5 3
6 2
7 2
8 3
now I want to create a new column based on the value of 'choice' column. If the value on choice column is new (has never occurred before), the value on the new column will be 'No', otherwise, if the value has already occur on previous rows , than the value in new column will be 'Soc'. the new table will look like this:
subject choice newcolumn
1 3 No
2 3 Soc
3 1 No
4 4 No
5 3 Soc
6 2 No
7 2 Soc
8 3 Soc
can somebody help me? thanks in advance

Using example data
DF <- data.frame(subject = 1:8, choice = c(3, 3, 1, 4, 3, 2, 2, 3))
I would do
DF <- transform(DF, newcolumn = c("No","Soc")[duplicated(choice) + 1])
giving
subject choice newcolumn
1 1 3 No
2 2 3 Soc
3 3 1 No
4 4 4 No
5 5 3 Soc
6 6 2 No
7 7 2 Soc
8 8 3 Soc
Without transform() this would be
DF$newcolumn <- c("No","Soc")[duplicated(DF$choice) + 1])

Another option using duplicated and ifelse:
transform(DF, newcolumn = ifelse(!duplicated(choice),'No','Soc'))
## subject choice newcolumn
## 1 1 3 No
## 2 2 3 Soc
## 3 3 1 No
## 4 4 4 No
## 5 5 3 Soc
## 6 6 2 No
## 7 7 2 Soc
## 8 8 3 Soc

There are a bunch of ways to do this, but using bracket subsetting will teach you some useful things about R:
# Make your example reproducible
subject <- 1:8
choice <- c(3, 3, 1, 4, 3, 2, 2, 3)
d <- data.frame(subject, choice)
# Create a new column, set all teh values to "No
d$newColumn <- "No"
# Set those values for which choice is duplicated to "Soc"
d$newColumn[duplicated(d$choice)] <- "Soc"

Related

Pair-wise manipulating rows in data.frame

I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!
s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2
The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))
This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Double match in r

I have a huge data set in r with one row per individual. One of my columns shows a family identifier (note, sex==1, male, sex==2, female).
ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3
How can I do a "double matching" so I can match couples in the data set for many of the variables that I am interested? For example, let's say individual 2, female, married with individual 1, male, should receive an entry in a new column with his income (same goes for hw):
ind sex income hw family.id income.male hw.male
1 1 10 6 fam.1 10 6
2 2 8 7 fam.1 8 6
3 2 15 8 fam.2 - -
4 1 7 4 fam.3 7 7
5 2 9 5 fam.3 9 7
I've said "double matching" in the title because I don't need to match only the family.ID, but I need to find a male that matches this fam.id. The reason I am doing this is because later all males will be dropped from the data set and I will remain only with rows for females.
I am sorry I can't show any coding I've worked. I've tried many approaches using match, ifelse, lapply and even unlist but it is not worth to add it here as unfortunately I can't make it work.
Anyone has a clue? We can work with both data.frames or data.tables environments.
You should go with data.table package. Here is an example:
library(data.table)
dt <- data.table(ind = c(1, 2, 3, 4, 5), sex =c(1, 2, 2, 1, 2), income = c(10, 8, 15, 7, 9), hw = c(6, 7, 8, 4, 5), family.id = c('fam.1', 'fam.1', 'fam.2', 'fam.3', 'fam.3'))
setkeyv(dt, 'family.id')
dt2 <- dt[dt[sex == 1, list(family.id, income, hw)]]
It will take income and hw of males (dt[sex == 1, list(family.id, income, hw)]) and match all individuals on family.id. As a result you obtain:
ind sex income hw family.id i.income i.hw
1: 1 1 10 6 fam.1 10 6
2: 2 2 8 7 fam.1 10 6
3: 4 1 7 4 fam.3 7 4
4: 5 2 9 5 fam.3 7 4
columns with prefix i. containing values of males for every family. Note that if no male is present you will not receive any row. If you still need this you can do:
dt2 <- merge(dt, dt[sex == 1, list(family.id, income, hw)], by = 'family.id', suffixes = c('', '.i'), all = TRUE)
to receive
family.id ind sex income hw income.i hw.i
1: fam.1 1 1 10 6 10 6
2: fam.1 2 2 8 7 10 6
3: fam.2 3 2 15 8 NA NA
4: fam.3 4 1 7 4 7 4
5: fam.3 5 2 9 5 7 4
Later when you need to drop male data you do:
dt2[sex == 2]
Let's assume that the dataframe is named 'dat'. You can merge the males and females by family.id with the merge function. You proposed answeer didn't make sense to me or to the otehr commenters but you can reassign "income" or "hw" within this new object.
> merge( dat[ dat$sex==1, ], dat[dat$sex==2,] , by="family.id")
family.id ind.x sex.x income.x hw.x ind.y sex.y income.y hw.y
1 fam.1 1 1 10 6 2 2 8 7
2 fam.3 4 1 7 4 5 2 9 5
To follow up on my comment:
require(data.table)
dt[dt[sex == 1L], c("i.m", "hw.m") := .(i.income, i.hw), on="family.id"][]
Extract the row indices where sex == 'male' for each family.id and add two columns by reference with the corresponding income and hw values.
where dt is:
dt = fread('ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3')

Gather ragged data frame into key-value columns

I recently discovered how to create ragged data frames using the I function, but are having a hard time integrating them with tidyr, ggplot2 and the rest of the Hadleyverse. More specifically, how do you gather a column containing named vectors into key-value-columns?
Suppose I create a data frame like this
make.vector <- function(length.out){
x <- sample(9, length.out)
names(x) <- switch(length.out,
"Alice",
c("Bob", "Charlie"),
c("Dave", "Erin", "Frank"),
c("Gwen", "Harold", "Inez", "James"))
x
}
mydf <- data.frame(Game = gl(3, 3, labels=LETTERS[1:3]),
Set = rep(1:3, 3),
Score = I(lapply(rep(2:4, each=3), make.vector)))
producing
> print(mydf)
Game Set Score
1 A 1 8, 3
2 A 2 2, 8
3 A 3 3, 8
4 B 1 1, 5, 4
5 B 2 2, 3, 5
6 B 3 2, 8, 5
7 C 1 7, 2, 3, 4
8 C 2 1, 6, 3, 7
9 C 3 6, 9, 3, 7
The data frame can be manipulated with dplyr and tidyr in a straight forward manner as long as the results are of the expected length.
mydf %>%
mutate(nPlayers = sapply(Score, length))
mydf %>%
group_by(Game) %>%
summarize(TotalScore = list(Reduce("+", Score)))
However, I cannot figure out how to create multiple rows of result for each original row. Suppose I want to create the following data frame by manipulating mydf:
Game Set Player Score
1 A 1 Bob 8
2 A 1 Charlie 3
3 A 2 Bob 2
4 A 2 Charlie 8
5 A 3 Bob 3
6 A 3 Charlie 8
7 B 1 Dave 1
8 B 1 Erin 5
9 B 1 Frank 4
10 B 2 Dave 2
...
The only tool I know for doing so would be the gather function of the tidyr package, but it doesn't seem to play very well with non-atomic data.
mydf %>%
mutate(Player = lapply(Score, names)) %>%
gather(P = Player, S = Score)
I guess I could hack together a solution (as done in similar previous questions [1][2]),
cbind(
mydf[rep(1:nrow(mydf), sapply(mydf$Score, length)),
c("Game", "Set")],
data.frame(
Player = unlist(lapply(mydf$Score, names)),
Score = unlist(mydf$Score)
)
)
but I have a feeling I will have a hard time digesting it if look back at the code next week. Is there a "official" or at least smarter way to do this? Otherwise I'll make a general function for it and add to my personal library.
Update
In the light of David's answer below I figured out that the same result can be achieved with dplyr too.
mydf %>%
group_by(Game, Set) %>%
do(with(., data.frame(Player = names(unlist(Score)),
Score = unlist(Score))))
# Game Set Player Score
# 1 A 1 Bob 8
# 2 A 1 Charlie 6
# 3 A 2 Bob 7
# 4 A 2 Charlie 6
# 5 A 3 Bob 5
# 6 A 3 Charlie 8
# 7 B 1 Dave 1
# 8 B 1 Erin 9
# 9 B 1 Frank 3
# 10 B 2 Dave 8
# .. ... ... ... ...
# Warning message:
# In rbind_all(out[[1]]) : Unequal factor levels: coercing to character
I would try unlisting by group using data.table. You can run this only once per each group while storing it in a temporary variable using curly brackets (as you would do within a function) within the jth expression
library(data.table)
setDT(mydf)[, {
temp <- unlist(Score)
.(Player = names(temp), Score = temp)
}, by = .(Game, Set)]
# Game Set Player Score
# 1: A 1 Bob 2
# 2: A 1 Charlie 9
# 3: A 2 Bob 6
# 4: A 2 Charlie 3
# 5: A 3 Bob 2
# 6: A 3 Charlie 8
# 7: B 1 Dave 1
# 8: B 1 Erin 6
# 9: B 1 Frank 5
# 10: B 2 Dave 3
#...

Match group assignments between columns

I am trying to check the accuracy rate of a clustering algorithm, with a dataframe that looks like the one here. The orig.gp refers to the original grouping, which is the "correct" group assignment. The new.gp refers to the grouping assigned by the clustering algorithm.
df <- data.frame(id = 1:9,
orig.gp = c(rep(1:3, each = 3)),
new.gp = c(2, 2, 3, 3, 3, 1, 1, 1, 1) )
df
# id orig.gp new.gp
# 1 1 1 2
# 2 2 1 2
# 3 3 1 3
# 4 4 2 3
# 5 5 2 3
# 6 6 2 1
# 7 7 3 1
# 8 8 3 1
# 9 9 3 1
What I am trying to determine is whether the same ids are assigned the same grouping as the orig.gp. The group number itself is not that important, as the number is arbitrary. Ideally, I would like to achieve something like this:
# orig.gp new.gp correct
# 1 1 2 yes
# 2 1 2 yes
# 3 1 3 no
# 4 2 3 yes
# 5 2 3 yes
# 6 2 1 no
# 7 3 1 yes
# 8 3 1 yes
# 9 3 1 yes
To illustrate, in the original grouping, group 1 consists of ids 1, 2, 3; group 2 consists of ids 4, 5, 6; group 3 consists of 7, 8, 9. In the new grouping, ids 1, 2 are correctly assigned into the same group, thus the "yes" in the correct column. I would like to determine whether the same ids are assigned into the same groups as the original groupings.
Any suggestions would be appreciated!
The way I understand your problem, it is basically one of recoding. Namely, you want to identify observations that fall on the diagonal of a crosstabulation of new.gp and orig.gp, but the values of new.gp are mislabeled.
What I propose here is basically recoding the values of new.gp based on a simple crosstabulation (see tab below). The recoding is done by taking the modal value of orig.gp for each possible value of new.gp and assuming that this mode is the correct value label. I then use recode from car to perform the recoding.
library("car")
tab <- with(df, table(new.gp, orig.gp))
tab
## orig.gp
## new.gp 1 2 3
## 1 0 1 3
## 2 2 0 0
## 3 1 2 0
df$recoded <- recode(df$new.gp, paste(rownames(tab),colnames(tab)[max.col(tab)],sep='=',collapse=';'))
df$correct <- ifelse(df$orig.gp == df$recoded, "yes", "no")
The result:
> df
orig.gp new.gp recoded correct
1 1 2 1 yes
2 1 2 1 yes
3 1 3 2 no
4 2 3 2 yes
5 2 3 2 yes
6 2 1 3 no
7 3 1 3 yes
8 3 1 3 yes
9 3 1 3 yes

Resources