Loop over multiple columns in a merge in R - r

I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:
box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
df1
box year rep5 rep7 rep2
1 5 1999 5 7 2
2 7 1999 5 7 2
3 2 1999 5 7 2
df2
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:
merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)
this gives the desired output:
box rep5 year rep7 rep2 distance
1 2 5 1999 7 2 200
2 5 5 1999 7 2 0
3 7 5 1999 7 2 100
However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:
reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3
When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining
reps <- c("rep7", "rep2")
When I run the same code using that definition, I get the result that df3 is NULL.
The output that I want (with the distance column renamed for clarity) is:
box year rep5 rep7 rep2 dist5 dist7 dist2
1 2 1999 5 7 2 200 300 0
2 5 1999 5 7 2 0 100 200
3 7 1999 5 7 2 100 0 300
What am I doing wrong? Any help you can give me would be very much appreciated!

My R life became so much easier when I learned about the libraries dplyr and tidyr, and the concept of tidy data sets. What you're trying to do above can be expressed as a pivot, and is pretty easy to do with dplyr and tidyr.
I'm assuming what you really want, is to turn df2:
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
into your output, with all those strange repetitions removed:
box year dist5 dist7 dist2
1 2 1999 200 300 0
2 5 1999 0 100 200
3 7 1999 100 0 300
So you should pivot box2 into columns, with your distance as the value. using dplyr and tidyr:
library(tidyr)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
# reshape it as desired
spread(df2, box2, distance,fill=0)
#Source: local data frame [4 x 7]
# box1 year2 2 4 5 7 9
#1 2 1999 0 0 200 300 300
#2 5 1999 200 0 0 100 0
#3 5 2000 0 0 0 0 0
#4 7 1999 300 200 100 0 0
My recommendation: learn to use dplyr and tidyr. It makes life so, so much easier.

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

Recreating a dataframe by using conditions from two different columns

I have a massive dataframe seems like this:
df = data.frame(year = c(rep(1998,5),rep(1999,5)),
loc = c(10,rep(14,4),rep(10,2),rep(14,3)),
sitA = c(rep(0,3),1,1,0,1,0,1,1),
sitB = c(1,0,1,0,1,rep(0,4),1),
n = c(2,13,2,9,4,7,2,7,7,4))
df
year loc sitA sitB n
1 1998 10 0 1 2
2 1998 14 0 0 13
3 1998 14 0 1 2
4 1998 14 1 0 9
5 1998 14 1 1 4
6 1999 10 0 0 7
7 1999 10 1 0 2
8 1999 14 0 0 7
9 1999 14 1 0 7
10 1999 14 1 1 4
As you can see, there are years, localities, two different situation (denoted as sitA and sitB) and finally the counts of these records (column n).
I wanted to create a new data frame which reflects the counts for only year and localities where counts for situation A and B stored in the columns conditionally such as desired output below:
df.new
year loc sitB.0.sitA.0 sitB.0.sitA.1 sitB.1.sitA.0 sitB.1.sitA.1
1 1998 10 0 0 2 0
2 1998 14 13 9 2 4
3 1999 10 7 2 0 0
4 1999 14 7 7 0 4
The tricky part as you can realize is that the original dataframe doesn't include all of the conditions. It only has the ones where the count is above 0. So the new dataframe should have "0" for the missing conditions in the original dataframe. Therefore, well known functions such as melt (reshape) or aggregate failed to solve my issue. A little help would be appreciated.
A tidyverse method, we first append the column names to the values for sit.. columns. Then we unite and combine them into one column and finaly spread the values.
library(tidyverse)
df[3:4] <- lapply(names(df)[3:4], function(x) paste(x, df[, x], sep = "."))
df %>%
unite(key, sitA, sitB, sep = ".") %>%
spread(key, n, fill = 0)
# year loc sitA.0.sitB.0 sitA.0.sitB.1 sitA.1.sitB.0 sitA.1.sitB.1
#1 1998 10 0 2 0 0
#2 1998 14 13 2 9 4
#3 1999 10 7 0 2 0
#4 1999 14 7 0 7 4
If the position of the columns is not fixed you can use grep first
cols <- grep("^sit", names(df))
df[cols] <- lapply(names(df)[cols], function(x) paste(x, df[, x], sep = "."))

Remove both rows that duplicate in R

I'm trying to remove all rows that have a duplicate value. Hence, in the example I want to remove both rows that have a 2 and the three rows that have 6 under the x column. I have tried df[!duplicated(xy$x), ] however this still gives me the first row that duplicates, where I do not want either row.
x <- c(1,2,2,4,5,6,6,6)
y <- c(1888,1999,2000,2001,2004,2005,2010,2011)
xy <- as.data.frame(cbind(x,y))
xy
x y
1 1 1888
2 2 1999
3 2 2000
4 4 2001
5 5 2004
6 6 2005
7 6 2010
8 6 2011
What I want is
x y
1 1888
4 2001
5 2004
Any help is appreciated. I need to avoid specifying the value to get rid of since I am dealing with a dataframe with thousands of records.
we can do
xy[! xy$x %in% unique(xy[duplicated(xy$x), "x"]), ]
# x y
#1 1 1888
#4 4 2001
#5 5 2004
as
unique(xy[duplicated(xy$x), "x"])
gives the values of x that are duplicated. Then we can just filter those out.
You can count and include only the singletons
xy[1==ave(xy$x,xy$x,FUN=length),]
x y
1 1 1888
4 4 2001
5 5 2004
Or like this:
xy[xy$x %in% names(which(table(xy$x)==1)),]
x y
1 1 1888
4 4 2001
5 5 2004

fill the time gap in data frame in r

I have a data set including the following info:
id class year n
25 A63 2006 3
25 F16 2006 1
39 0901 2001 1
39 0903 2001 3
39 0903 2003 2
39 1901 2003 1
...
There are about 100k different ids and more than 300 classes. The year varies from 1998 to 2007.
What I want to do, is to fill the time gap, after some id and classes happened, with n=0 by id and class.
And then calculate the sum of n and the quantity of classes.
For example, the above 6 lines data should expand to the following table:
id class year n sum Qc Qs
25 A63 2006 3 3 2 2
25 F16 2006 1 1 2 2
25 A63 2007 0 3 0 2
25 F16 2007 0 1 0 2
39 0901 2001 1 1 2 2
39 0903 2001 3 3 2 2
39 0901 2002 0 1 0 2
39 0903 2002 0 3 0 2
39 0901 2003 0 1 2 3
39 0903 2003 2 5 2 3
39 1901 2003 1 1 2 3
39 0901 2004 0 1 0 3
39 0903 2004 0 5 0 3
39 1901 2004 0 1 0 3
...
39 0901 2007 0 1 0 3
39 0903 2007 0 5 0 3
39 1901 2007 0 1 0 3
I can solve it by the ugly for loop and it will takes one hour to get the result. Is there any better way to do that? Vectorize or using the data.table?
Using dplyr you could try:
library(dplyr)
df%>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))
It groups the data by class and id, and merges each group to a dataframe containing all the years with the id and class of that group.
Edit: if you want to do this only after a certain id you could do:
as.data.frame(rbind(df[df$id<=25,],df%>% filter(id>25) %>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))))
Use expand.grid to get the cartesian product of class and year.
Then merge your current data frame to this new one. Then do the classic subset replacement.
df <- data.frame(class = as.factor(c("A63","F16","0901","0903","0903","1901")),
year = c(2006,2006,2001,2001,2003,2003),
n=c(3,1,1,3,2,1))
df2 <- expand.grid(class = levels(df$class),
year= 1997:2006)
df2 <- merge(df2,df, all.x=TRUE)
df2$n[is.na(df2$n)] <- 0

R Example - ddply, ave, and merge

I have written a code. It would be great if you guys can suggest better way of doing the stuff I am trying to do. The dt is given as follows:
SIC FYEAR AU AT
1 1 2003 6 212.748
2 1 2003 5 3987.884
3 1 2003 4 100.835
4 1 2003 4 1706.719
5 1 2003 5 9.159
6 1 2003 7 60.069
7 1 2003 5 100.696
8 1 2003 4 113.865
9 1 2003 6 431.552
10 1 2003 7 309.109 ...
My job is to create a new column for a given SIC, and FYEAR, the AU which has highest percentage AT and the difference between highest AT and second highest AT will get a value 1, otherwise 0. Here, is my attempt to do the stuff mentioned.
a <- ddply(dt,.(SIC,FYEAR),function(x){ddply(x,.(AU),function(x) sum(x$AT))});
SIC FYEAR AU V1
1 1 2003 4 3412.619
2 1 2003 5 13626.241
3 1 2003 6 644.300
4 1 2003 7 1478.633
5 1 2003 9 0.003
6 1 2004 4 3976.242
7 1 2004 5 9383.516
8 1 2004 6 457.023
9 1 2004 7 456.167
10 1 2004 9 238.282
where V1 represnts the sum AT for all the rows for a given AU for a given SIC and FYEAR. Next I do :
a$V1 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) x/sum(x));
SIC FYEAR AU V1
1 1 2003 4 1.780949e-01
2 1 2003 5 7.111150e-01
3 1 2003 6 3.362420e-02
4 1 2003 7 7.716568e-02
5 1 2003 9 1.565615e-07
6 1 2004 4 2.740114e-01
7 1 2004 5 6.466382e-01
8 1 2004 6 3.149444e-02
9 1 2004 7 3.143545e-02
10 1 2004 9 1.642052e-02
The column V1 now represents the percentage value for each AU for AT contribution for a given SIC, and FYEAR. Next,
a$V2 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) {t<-((sort(x, TRUE))[2]);
ifelse((x-t)> 0.1,1,0)});
SIC FYEAR AU V1 V2
1 1 2003 4 1.780949e-01 0
2 1 2003 5 7.111150e-01 1
3 1 2003 6 3.362420e-02 0
4 1 2003 7 7.716568e-02 0
5 1 2003 9 1.565615e-07 0
6 1 2004 4 2.740114e-01 0
7 1 2004 5 6.466382e-01 1
8 1 2004 6 3.149444e-02 0
9 1 2004 7 3.143545e-02 0
10 1 2004 9 1.642052e-02 0
The AU for a given SIC, and FYEAR, which has highest percentage contribution to AT, and f the difference is greater than 10%, the that AU gets 1 else gets 0.
Then I merge the result with original data dt.
dt <- merge(dt,a,key=c("SIC","FYEAR","AU"));
SIC FYEAR AU AT V1 V2
1 1 2003 4 1706.719 1.780949e-01 0
2 1 2003 4 100.835 1.780949e-01 0
3 1 2003 4 113.865 1.780949e-01 0
4 1 2003 4 1491.200 1.780949e-01 0
5 1 2003 5 3987.884 7.111150e-01 1
6 1 2003 5 100.696 7.111150e-01 1
7 1 2003 5 67.502 7.111150e-01 1
8 1 2003 5 9461.000 7.111150e-01 1
9 1 2003 5 9.159 7.111150e-01 1
10 1 2003 6 212.748 3.362420e-02 0
What I did is very cumbersome. Is there a better way to do the same stuff? Thanks.
I'm not sure if the deleted answer was the same as this, but you can effectively do it in a couple of lines.
# Simulate data
set.seed(1)
n<-1000
dt<-data.frame(SIC=sample(1:10,n,replace=TRUE),FYEAR=sample(2003:2007,n,replace=TRUE),
AU=sample(1:7,n,replace=TRUE),AT=abs(rnorm(n)))
# Cacluate proportion.
dt$prop<-ave(dt$AT,dt$SIC,dt$FYEAR,FUN=prop.table)
# Find AU with max proportion.
dt$au.with.max.prop<-
ave(dt,dt$SIC,dt$FYEAR,FUN=function(x)x$AU[x$prop==max(x$prop)])[,1]
It is all in base, and avoids merge so it won't be that slow.
Here's a version using data.table:
require(data.table)
DT <- data.table(your_data_frame)
setkey(DT, SIC, FYEAR, AU)
DT[setkey(DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1),
by=list(SIC, FYEAR)])[, V2 := (V1 - V1[.N-1] > 0.1) * 1,
by=list(SIC, FYEAR)]]
The part DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1), by=list(SIC, FYEAR)] first sums AT by all three columns and then replaces V1 by V1/sum(V1) by columns SIC, FYEAR by reference. The setkey wrapping this code orders all four columns. Therefore, the last but one value will always be the second highest value (under the condition that there are no duplicated values). Using this, we can create V2 as: [, V2 := (V1 - V1[.N-1] > 0.1) * 1, by=list(SIC, FYEAR)]] by reference. Once we've this, we can perform a join by using DT[.].
Hope this helps.

Resources