Double match in r - r

I have a huge data set in r with one row per individual. One of my columns shows a family identifier (note, sex==1, male, sex==2, female).
ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3
How can I do a "double matching" so I can match couples in the data set for many of the variables that I am interested? For example, let's say individual 2, female, married with individual 1, male, should receive an entry in a new column with his income (same goes for hw):
ind sex income hw family.id income.male hw.male
1 1 10 6 fam.1 10 6
2 2 8 7 fam.1 8 6
3 2 15 8 fam.2 - -
4 1 7 4 fam.3 7 7
5 2 9 5 fam.3 9 7
I've said "double matching" in the title because I don't need to match only the family.ID, but I need to find a male that matches this fam.id. The reason I am doing this is because later all males will be dropped from the data set and I will remain only with rows for females.
I am sorry I can't show any coding I've worked. I've tried many approaches using match, ifelse, lapply and even unlist but it is not worth to add it here as unfortunately I can't make it work.
Anyone has a clue? We can work with both data.frames or data.tables environments.

You should go with data.table package. Here is an example:
library(data.table)
dt <- data.table(ind = c(1, 2, 3, 4, 5), sex =c(1, 2, 2, 1, 2), income = c(10, 8, 15, 7, 9), hw = c(6, 7, 8, 4, 5), family.id = c('fam.1', 'fam.1', 'fam.2', 'fam.3', 'fam.3'))
setkeyv(dt, 'family.id')
dt2 <- dt[dt[sex == 1, list(family.id, income, hw)]]
It will take income and hw of males (dt[sex == 1, list(family.id, income, hw)]) and match all individuals on family.id. As a result you obtain:
ind sex income hw family.id i.income i.hw
1: 1 1 10 6 fam.1 10 6
2: 2 2 8 7 fam.1 10 6
3: 4 1 7 4 fam.3 7 4
4: 5 2 9 5 fam.3 7 4
columns with prefix i. containing values of males for every family. Note that if no male is present you will not receive any row. If you still need this you can do:
dt2 <- merge(dt, dt[sex == 1, list(family.id, income, hw)], by = 'family.id', suffixes = c('', '.i'), all = TRUE)
to receive
family.id ind sex income hw income.i hw.i
1: fam.1 1 1 10 6 10 6
2: fam.1 2 2 8 7 10 6
3: fam.2 3 2 15 8 NA NA
4: fam.3 4 1 7 4 7 4
5: fam.3 5 2 9 5 7 4
Later when you need to drop male data you do:
dt2[sex == 2]

Let's assume that the dataframe is named 'dat'. You can merge the males and females by family.id with the merge function. You proposed answeer didn't make sense to me or to the otehr commenters but you can reassign "income" or "hw" within this new object.
> merge( dat[ dat$sex==1, ], dat[dat$sex==2,] , by="family.id")
family.id ind.x sex.x income.x hw.x ind.y sex.y income.y hw.y
1 fam.1 1 1 10 6 2 2 8 7
2 fam.3 4 1 7 4 5 2 9 5

To follow up on my comment:
require(data.table)
dt[dt[sex == 1L], c("i.m", "hw.m") := .(i.income, i.hw), on="family.id"][]
Extract the row indices where sex == 'male' for each family.id and add two columns by reference with the corresponding income and hw values.
where dt is:
dt = fread('ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3')

Related

Pair-wise manipulating rows in data.frame

I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!
s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2
The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))
This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

R Compare Columns across Dataframes to Match Values

I have two dataframes, looking at houses (n=6) and certain dates (n=22).
ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.
SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 6 32 12 4.2
B 2 50 3 4.0
B 4 51 4 8.6
B 6 8 7 12.1
C 2 12 8 13.0
I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.
I tried this to no avail...
for(i in 1:nrow(original))
{
if ((sample$Day == original$Day) && (sample$House == original$House))
{
sample$Mongoose[i] <- original$Mongoose[i]
sample$Fruit[i] <- original$Fruit[i]
sample$Elephant[i] <- original$Elephant[i]
}
}
The following results:
I get the following 3 errors in sequence
In sample$Day == test$Day : longer object length is not a multiple of
shorter object length
In is.na(e1) | is.na(e2) :longer object length is not a multiple of
shorter object length
In ==.default(sample$House, test$House) :longer object length is
not a multiple of shorter object length
The data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.
I.e., it looks like this
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2 50 3 4.0
A 3 51 4 8.6
A 4 8 7 12.1
A 5 12 8 13.0
A 6 32 12 4.2
B 1
B 2
B 3 [...]
When it should (in essence) look like this:
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2
A 3
A 4
A 5
A 6 32 12 4.2 [rest of A houses have no data]
B 1
B 2 50 3 4.0
B 3
B 4 51 4 8.6
B 5
B 6 8 7 12.1 [rest of B houses have no data]
C 1
C 2 12 8 13.0
Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.
Thanks!
Sounds to me like this should work:
merge(sample, original, by = c("House", "Day"), all.x = TRUE)
But hard to tell without a reproducible example. You may also want to look into dplyr::left_join(). That is, assuming your data looks like the following:
sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
Day = rep(1:6, 3))
# head(sample)
# House Day
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
original <- data.frame(House = c("A", "A", "B", "B", "C"),
Day = c(1, 6, 2, 4, 2),
Mongoose = c(40, 32, 50, 51, 8),
Fruit = c(7, 12, 3, 4, 8),
Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))
# head(original)
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 6 32 12 4.2
# 3 B 2 50 3 4.0
# 4 B 4 51 4 8.6
# 5 C 2 8 8 12.1
We obtain:
# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 2 NA NA NA
# 3 A 3 NA NA NA
# 4 A 4 NA NA NA
# 5 A 5 NA NA NA
# 6 A 6 32 12 4.2
It could be a small tweak, look at this this line of your original code:
if ((sample$Day == original$Day) && (sample$House == original$House))
See if you can change it to this:
if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))
Because:
You are using a for loop with an i variable,
which you do very well with lines such as sample$Mongoose[i] <- original$Mongoose[i]
but in your example it seems the if statement is not actually making use of the i variable
so we revise it to make use of i so it will be comparing specifically that observation/rows's sample$Day with that observation/rows's original$Day, and the same for sample$House vs original$House

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

create new column based on values on previous rows

I hope somebody can help me.
I have a data like this:
subject choice
1 3
2 3
3 1
4 4
5 3
6 2
7 2
8 3
now I want to create a new column based on the value of 'choice' column. If the value on choice column is new (has never occurred before), the value on the new column will be 'No', otherwise, if the value has already occur on previous rows , than the value in new column will be 'Soc'. the new table will look like this:
subject choice newcolumn
1 3 No
2 3 Soc
3 1 No
4 4 No
5 3 Soc
6 2 No
7 2 Soc
8 3 Soc
can somebody help me? thanks in advance
Using example data
DF <- data.frame(subject = 1:8, choice = c(3, 3, 1, 4, 3, 2, 2, 3))
I would do
DF <- transform(DF, newcolumn = c("No","Soc")[duplicated(choice) + 1])
giving
subject choice newcolumn
1 1 3 No
2 2 3 Soc
3 3 1 No
4 4 4 No
5 5 3 Soc
6 6 2 No
7 7 2 Soc
8 8 3 Soc
Without transform() this would be
DF$newcolumn <- c("No","Soc")[duplicated(DF$choice) + 1])
Another option using duplicated and ifelse:
transform(DF, newcolumn = ifelse(!duplicated(choice),'No','Soc'))
## subject choice newcolumn
## 1 1 3 No
## 2 2 3 Soc
## 3 3 1 No
## 4 4 4 No
## 5 5 3 Soc
## 6 6 2 No
## 7 7 2 Soc
## 8 8 3 Soc
There are a bunch of ways to do this, but using bracket subsetting will teach you some useful things about R:
# Make your example reproducible
subject <- 1:8
choice <- c(3, 3, 1, 4, 3, 2, 2, 3)
d <- data.frame(subject, choice)
# Create a new column, set all teh values to "No
d$newColumn <- "No"
# Set those values for which choice is duplicated to "Soc"
d$newColumn[duplicated(d$choice)] <- "Soc"

Resources