I'm trying to remove all rows that have a duplicate value. Hence, in the example I want to remove both rows that have a 2 and the three rows that have 6 under the x column. I have tried df[!duplicated(xy$x), ] however this still gives me the first row that duplicates, where I do not want either row.
x <- c(1,2,2,4,5,6,6,6)
y <- c(1888,1999,2000,2001,2004,2005,2010,2011)
xy <- as.data.frame(cbind(x,y))
xy
x y
1 1 1888
2 2 1999
3 2 2000
4 4 2001
5 5 2004
6 6 2005
7 6 2010
8 6 2011
What I want is
x y
1 1888
4 2001
5 2004
Any help is appreciated. I need to avoid specifying the value to get rid of since I am dealing with a dataframe with thousands of records.
we can do
xy[! xy$x %in% unique(xy[duplicated(xy$x), "x"]), ]
# x y
#1 1 1888
#4 4 2001
#5 5 2004
as
unique(xy[duplicated(xy$x), "x"])
gives the values of x that are duplicated. Then we can just filter those out.
You can count and include only the singletons
xy[1==ave(xy$x,xy$x,FUN=length),]
x y
1 1 1888
4 4 2001
5 5 2004
Or like this:
xy[xy$x %in% names(which(table(xy$x)==1)),]
x y
1 1 1888
4 4 2001
5 5 2004
Related
I'm trying to add the values from one column of a dataframe to another dataframe by matching values from one column or another.
For example:
I have 2 df's with different length and df2 does not have all the pairs listed in df1:
df1
Year Territory Pair_ID
1 1999 BGD 1 5
2 2000 TAR 6 2
3 2001 JAM 3 7
4 2002 TER 9 2
df2
ID1 ID2 pair pair1 type detail
1 1 5 1 5 5 1 PO N/A
2 2 6 2 6 6 2 SB N/A
3 3 7 3 7 7 3 PO N/A
4 4 8 4 8 8 4 SB N/A
5 4 3 4 3 3 4 SB N/A
I want this:
Year Territory Pair_ID type
1 1999 BGD 1 5 PO
2 2000 TAR 6 2 SB
3 2001 JAM 3 7 PO
4 2002 TER 9 2 N/A
I don't want to completely merge the 2 dataframes. I just want to add the "type" column from df2 to df1 by matching the "Pair" column from df1 to either the "pair" column or "pair1" column in df2. I would also like it to fill in with "N/A" for Pairs that are not found in df2.
I could not find anything that addresses this specific problem.
I've tried this:
df1$type <- df2$type[match(df1$Pairs, c(df2$pair,df2$pair1))]
But it only matches with the "pair" column and ignores the "pair1" column.
Good case for sqldf:
library(sqldf)
sqldf("select df1.Year
,df1.Territory
,df1.Pair_ID
,df2.type
from df1
left join df2
on df1.Pair_ID = df2.pair
or df1.Pair_ID = df2.pair1
")
Results
Year Territory Pair_ID type
1 1999 BGD 1 5 PO
2 2000 TAR 6 2 SB
3 2001 JAM 3 7 PO
4 2002 TER 9 2 <NA>
Try something like
typeA <- df2$type[match(df1$Pairs, df2$pair)]
typeB <- df2$type[match(df1$Pairs, df2$pair1)]
df1$type <- ifelse(is.na(typeA), typeB, typeA)
I keep getting a 'subscript out of bounds' error when I try to populate a matrix using a for loop that I have scripted below. My data are a large csv file that look similar to the following dummy dataset:
Sample k3 Year
1 B92028UUU 1 1990
2 B93001UUU 1 1993
3 B93005UUU 1 1993
4 B93006UUU 1 1993
5 B93010UUU 1 1993
6 B93011UUU 1 1994
7 B93022UUU 1 1994
8 B93035UUU 1 2014
9 B93036UUU 1 2014
10 B95015UUU 2 2013
11 B95016UUU 2 2013
12 B98027UUU 2 1990
13 B05005FUS 2 1990
14 B06006FIS 2 2001
15 B06010MUS 2 2001
16 B05023FUN 2 2001
17 B05024FUN 3 2001
18 B05025FIN 3 2001
19 B05034MMN 3 2002
20 B05037MMS 3 1996
21 B05041MUN 3 1996
22 B06047FUS 3 2007
23 B05048MUS 3 2000
24 B06059FUS 3 2000
25 B05063MUN 3 2000
My script is as follows:
Year.Matrix = matrix(1:75,nrow=25,byrow=T)
colnames(Year.Matrix)=c("Group 1","Group 2","Group 3")
rownames(Year.Matrix)=1990:2014
for(i in 1:3){
x=subset(data2,k3==i)
for(j in 1990:2014){
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
Not sure why I am getting the error message but from other posts I gather that the issue arises when I try to populate my matrix, and perhaps because I do not have an entry for each year from each of my k3 levels?
Any commentary would be helpful!
No need to use a loop here. You are just computing length by year and k3 columns:
library(data.table)
setDT(dat)[,.N,"Year,k3"]
Year k3 N
1: 1990 1 1
2: 1993 1 4
3: 1994 1 2
4: 2014 1 2
5: 2013 2 2
6: 1990 2 2
7: 2001 2 3
8: 2001 3 2
9: 2002 3 1
10: 1996 3 2
11: 2007 3 1
12: 2000 3 3
You can also use dplyr to do this. A dplyr solution would be the following:
dat %>%
group_by(Year, k3) %>%
summarize(N=n())
Not sure what you are trying to do but as Hubert L said. Your value of j index should be an integer while populating Year.Matrix it should be values like 1..2..3.. since you have done (j in 1990:2014) it will give j values as 1990..1991..1992.....2014
to fix this offset your row index as below. Your for loop
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in seq_along(1990:2014)){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
keep using print statement to debug your function. Running this loop will immediately tell you data you are going to index Year.Matrix[1990,1] which will through out of bound exception.
Fix this for loop by offsetting the index as:
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in 1990:2014){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[1990-j+1,i]=z
}
}
I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:
box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
df1
box year rep5 rep7 rep2
1 5 1999 5 7 2
2 7 1999 5 7 2
3 2 1999 5 7 2
df2
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:
merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)
this gives the desired output:
box rep5 year rep7 rep2 distance
1 2 5 1999 7 2 200
2 5 5 1999 7 2 0
3 7 5 1999 7 2 100
However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:
reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3
When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining
reps <- c("rep7", "rep2")
When I run the same code using that definition, I get the result that df3 is NULL.
The output that I want (with the distance column renamed for clarity) is:
box year rep5 rep7 rep2 dist5 dist7 dist2
1 2 1999 5 7 2 200 300 0
2 5 1999 5 7 2 0 100 200
3 7 1999 5 7 2 100 0 300
What am I doing wrong? Any help you can give me would be very much appreciated!
My R life became so much easier when I learned about the libraries dplyr and tidyr, and the concept of tidy data sets. What you're trying to do above can be expressed as a pivot, and is pretty easy to do with dplyr and tidyr.
I'm assuming what you really want, is to turn df2:
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
into your output, with all those strange repetitions removed:
box year dist5 dist7 dist2
1 2 1999 200 300 0
2 5 1999 0 100 200
3 7 1999 100 0 300
So you should pivot box2 into columns, with your distance as the value. using dplyr and tidyr:
library(tidyr)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
# reshape it as desired
spread(df2, box2, distance,fill=0)
#Source: local data frame [4 x 7]
# box1 year2 2 4 5 7 9
#1 2 1999 0 0 200 300 300
#2 5 1999 200 0 0 100 0
#3 5 2000 0 0 0 0 0
#4 7 1999 300 200 100 0 0
My recommendation: learn to use dplyr and tidyr. It makes life so, so much easier.
Sample data
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002,2003))
Merge doesn't give good solution merge(x,y,by="id"), which gives duplicates.
Now for the above sample data simple cbind works cbind(x,y) and this is what I'm after, just paring the year with corresponding id.
Problem arrises when the two data.frames do not match! So that the data.frame containing variable year is shorter. Someting like this:
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002))
So I need paring the two data.frames and the corresponding unmatched rows of data.frame x could be NA's so that I would remove that row.
Desired output for the shorter sample data:
id year dna
1 1 2001 232
2 1 2002 424
3 1 2003 5345
4 2 2005 45345
5 2 2006 45
6 7 2000 345
7 7 2001 4543
8 7 2002 345345
You should add a record number to each id so you can work with merge:
x <- transform(x, rec = ave(id, id, FUN = seq_along))
y <- transform(y, rec = ave(id, id, FUN = seq_along))
merge(x, y, c("id", "rec"))
# id rec dna year
# 1 1 1 232 2001
# 2 1 2 424 2002
# 3 1 3 5345 2003
# 4 2 1 45345 2005
# 5 2 2 45 2006
# 6 7 1 345 2000
# 7 7 2 4543 2001
# 8 7 3 345345 2002
I have written a code. It would be great if you guys can suggest better way of doing the stuff I am trying to do. The dt is given as follows:
SIC FYEAR AU AT
1 1 2003 6 212.748
2 1 2003 5 3987.884
3 1 2003 4 100.835
4 1 2003 4 1706.719
5 1 2003 5 9.159
6 1 2003 7 60.069
7 1 2003 5 100.696
8 1 2003 4 113.865
9 1 2003 6 431.552
10 1 2003 7 309.109 ...
My job is to create a new column for a given SIC, and FYEAR, the AU which has highest percentage AT and the difference between highest AT and second highest AT will get a value 1, otherwise 0. Here, is my attempt to do the stuff mentioned.
a <- ddply(dt,.(SIC,FYEAR),function(x){ddply(x,.(AU),function(x) sum(x$AT))});
SIC FYEAR AU V1
1 1 2003 4 3412.619
2 1 2003 5 13626.241
3 1 2003 6 644.300
4 1 2003 7 1478.633
5 1 2003 9 0.003
6 1 2004 4 3976.242
7 1 2004 5 9383.516
8 1 2004 6 457.023
9 1 2004 7 456.167
10 1 2004 9 238.282
where V1 represnts the sum AT for all the rows for a given AU for a given SIC and FYEAR. Next I do :
a$V1 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) x/sum(x));
SIC FYEAR AU V1
1 1 2003 4 1.780949e-01
2 1 2003 5 7.111150e-01
3 1 2003 6 3.362420e-02
4 1 2003 7 7.716568e-02
5 1 2003 9 1.565615e-07
6 1 2004 4 2.740114e-01
7 1 2004 5 6.466382e-01
8 1 2004 6 3.149444e-02
9 1 2004 7 3.143545e-02
10 1 2004 9 1.642052e-02
The column V1 now represents the percentage value for each AU for AT contribution for a given SIC, and FYEAR. Next,
a$V2 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) {t<-((sort(x, TRUE))[2]);
ifelse((x-t)> 0.1,1,0)});
SIC FYEAR AU V1 V2
1 1 2003 4 1.780949e-01 0
2 1 2003 5 7.111150e-01 1
3 1 2003 6 3.362420e-02 0
4 1 2003 7 7.716568e-02 0
5 1 2003 9 1.565615e-07 0
6 1 2004 4 2.740114e-01 0
7 1 2004 5 6.466382e-01 1
8 1 2004 6 3.149444e-02 0
9 1 2004 7 3.143545e-02 0
10 1 2004 9 1.642052e-02 0
The AU for a given SIC, and FYEAR, which has highest percentage contribution to AT, and f the difference is greater than 10%, the that AU gets 1 else gets 0.
Then I merge the result with original data dt.
dt <- merge(dt,a,key=c("SIC","FYEAR","AU"));
SIC FYEAR AU AT V1 V2
1 1 2003 4 1706.719 1.780949e-01 0
2 1 2003 4 100.835 1.780949e-01 0
3 1 2003 4 113.865 1.780949e-01 0
4 1 2003 4 1491.200 1.780949e-01 0
5 1 2003 5 3987.884 7.111150e-01 1
6 1 2003 5 100.696 7.111150e-01 1
7 1 2003 5 67.502 7.111150e-01 1
8 1 2003 5 9461.000 7.111150e-01 1
9 1 2003 5 9.159 7.111150e-01 1
10 1 2003 6 212.748 3.362420e-02 0
What I did is very cumbersome. Is there a better way to do the same stuff? Thanks.
I'm not sure if the deleted answer was the same as this, but you can effectively do it in a couple of lines.
# Simulate data
set.seed(1)
n<-1000
dt<-data.frame(SIC=sample(1:10,n,replace=TRUE),FYEAR=sample(2003:2007,n,replace=TRUE),
AU=sample(1:7,n,replace=TRUE),AT=abs(rnorm(n)))
# Cacluate proportion.
dt$prop<-ave(dt$AT,dt$SIC,dt$FYEAR,FUN=prop.table)
# Find AU with max proportion.
dt$au.with.max.prop<-
ave(dt,dt$SIC,dt$FYEAR,FUN=function(x)x$AU[x$prop==max(x$prop)])[,1]
It is all in base, and avoids merge so it won't be that slow.
Here's a version using data.table:
require(data.table)
DT <- data.table(your_data_frame)
setkey(DT, SIC, FYEAR, AU)
DT[setkey(DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1),
by=list(SIC, FYEAR)])[, V2 := (V1 - V1[.N-1] > 0.1) * 1,
by=list(SIC, FYEAR)]]
The part DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1), by=list(SIC, FYEAR)] first sums AT by all three columns and then replaces V1 by V1/sum(V1) by columns SIC, FYEAR by reference. The setkey wrapping this code orders all four columns. Therefore, the last but one value will always be the second highest value (under the condition that there are no duplicated values). Using this, we can create V2 as: [, V2 := (V1 - V1[.N-1] > 0.1) * 1, by=list(SIC, FYEAR)]] by reference. Once we've this, we can perform a join by using DT[.].
Hope this helps.