I keep getting a 'subscript out of bounds' error when I try to populate a matrix using a for loop that I have scripted below. My data are a large csv file that look similar to the following dummy dataset:
Sample k3 Year
1 B92028UUU 1 1990
2 B93001UUU 1 1993
3 B93005UUU 1 1993
4 B93006UUU 1 1993
5 B93010UUU 1 1993
6 B93011UUU 1 1994
7 B93022UUU 1 1994
8 B93035UUU 1 2014
9 B93036UUU 1 2014
10 B95015UUU 2 2013
11 B95016UUU 2 2013
12 B98027UUU 2 1990
13 B05005FUS 2 1990
14 B06006FIS 2 2001
15 B06010MUS 2 2001
16 B05023FUN 2 2001
17 B05024FUN 3 2001
18 B05025FIN 3 2001
19 B05034MMN 3 2002
20 B05037MMS 3 1996
21 B05041MUN 3 1996
22 B06047FUS 3 2007
23 B05048MUS 3 2000
24 B06059FUS 3 2000
25 B05063MUN 3 2000
My script is as follows:
Year.Matrix = matrix(1:75,nrow=25,byrow=T)
colnames(Year.Matrix)=c("Group 1","Group 2","Group 3")
rownames(Year.Matrix)=1990:2014
for(i in 1:3){
x=subset(data2,k3==i)
for(j in 1990:2014){
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
Not sure why I am getting the error message but from other posts I gather that the issue arises when I try to populate my matrix, and perhaps because I do not have an entry for each year from each of my k3 levels?
Any commentary would be helpful!
No need to use a loop here. You are just computing length by year and k3 columns:
library(data.table)
setDT(dat)[,.N,"Year,k3"]
Year k3 N
1: 1990 1 1
2: 1993 1 4
3: 1994 1 2
4: 2014 1 2
5: 2013 2 2
6: 1990 2 2
7: 2001 2 3
8: 2001 3 2
9: 2002 3 1
10: 1996 3 2
11: 2007 3 1
12: 2000 3 3
You can also use dplyr to do this. A dplyr solution would be the following:
dat %>%
group_by(Year, k3) %>%
summarize(N=n())
Not sure what you are trying to do but as Hubert L said. Your value of j index should be an integer while populating Year.Matrix it should be values like 1..2..3.. since you have done (j in 1990:2014) it will give j values as 1990..1991..1992.....2014
to fix this offset your row index as below. Your for loop
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in seq_along(1990:2014)){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
keep using print statement to debug your function. Running this loop will immediately tell you data you are going to index Year.Matrix[1990,1] which will through out of bound exception.
Fix this for loop by offsetting the index as:
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in 1990:2014){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[1990-j+1,i]=z
}
}
Related
I'm trying to remove all rows that have a duplicate value. Hence, in the example I want to remove both rows that have a 2 and the three rows that have 6 under the x column. I have tried df[!duplicated(xy$x), ] however this still gives me the first row that duplicates, where I do not want either row.
x <- c(1,2,2,4,5,6,6,6)
y <- c(1888,1999,2000,2001,2004,2005,2010,2011)
xy <- as.data.frame(cbind(x,y))
xy
x y
1 1 1888
2 2 1999
3 2 2000
4 4 2001
5 5 2004
6 6 2005
7 6 2010
8 6 2011
What I want is
x y
1 1888
4 2001
5 2004
Any help is appreciated. I need to avoid specifying the value to get rid of since I am dealing with a dataframe with thousands of records.
we can do
xy[! xy$x %in% unique(xy[duplicated(xy$x), "x"]), ]
# x y
#1 1 1888
#4 4 2001
#5 5 2004
as
unique(xy[duplicated(xy$x), "x"])
gives the values of x that are duplicated. Then we can just filter those out.
You can count and include only the singletons
xy[1==ave(xy$x,xy$x,FUN=length),]
x y
1 1 1888
4 4 2001
5 5 2004
Or like this:
xy[xy$x %in% names(which(table(xy$x)==1)),]
x y
1 1 1888
4 4 2001
5 5 2004
I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:
box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
df1
box year rep5 rep7 rep2
1 5 1999 5 7 2
2 7 1999 5 7 2
3 2 1999 5 7 2
df2
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:
merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)
this gives the desired output:
box rep5 year rep7 rep2 distance
1 2 5 1999 7 2 200
2 5 5 1999 7 2 0
3 7 5 1999 7 2 100
However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:
reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3
When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining
reps <- c("rep7", "rep2")
When I run the same code using that definition, I get the result that df3 is NULL.
The output that I want (with the distance column renamed for clarity) is:
box year rep5 rep7 rep2 dist5 dist7 dist2
1 2 1999 5 7 2 200 300 0
2 5 1999 5 7 2 0 100 200
3 7 1999 5 7 2 100 0 300
What am I doing wrong? Any help you can give me would be very much appreciated!
My R life became so much easier when I learned about the libraries dplyr and tidyr, and the concept of tidy data sets. What you're trying to do above can be expressed as a pivot, and is pretty easy to do with dplyr and tidyr.
I'm assuming what you really want, is to turn df2:
box1 box2 year2 distance
1 5 5 1999 0
2 5 7 1999 100
3 5 2 1999 200
4 5 5 2000 0
5 7 5 1999 100
6 7 7 1999 0
7 7 2 1999 300
8 7 4 1999 200
9 2 5 1999 200
10 2 7 1999 300
11 2 2 1999 0
12 2 9 1999 300
into your output, with all those strange repetitions removed:
box year dist5 dist7 dist2
1 2 1999 200 300 0
2 5 1999 0 100 200
3 7 1999 100 0 300
So you should pivot box2 into columns, with your distance as the value. using dplyr and tidyr:
library(tidyr)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)
# reshape it as desired
spread(df2, box2, distance,fill=0)
#Source: local data frame [4 x 7]
# box1 year2 2 4 5 7 9
#1 2 1999 0 0 200 300 300
#2 5 1999 200 0 0 100 0
#3 5 2000 0 0 0 0 0
#4 7 1999 300 200 100 0 0
My recommendation: learn to use dplyr and tidyr. It makes life so, so much easier.
I'm trying to calculate several new variables in my dataframe. Take initial values for example:
Say I have:
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
time geo var1 var2
1 1990 AT 1 7
2 1991 AT 2 8
3 1992 AT 3 9
4 1990 DE 4 10
5 1991 DE 5 11
6 1992 DE 6 12
And I want:
time geo var1 var2 var1_1990 var1_1991 var2_1990 var2_1991
1 1990 AT 1 7 1 2 7 8
2 1991 AT 2 8 1 2 7 8
3 1992 AT 3 9 1 2 7 8
4 1990 DE 4 10 4 5 10 11
5 1991 DE 5 11 4 5 10 11
6 1992 DE 6 12 4 5 10 11
So both time and the variable are changing for the new variables. Here is my attempt:
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
# ideally, I want code where I only have to change these two vectors
# and where it's possible to change their dimensions
for (i in initialyears){
lapply(initialvars,function(x){
rep(Dataset[time==i,x],each=length(unique(Dataset$time)))
})}
Which runs without error but yields nothing. I would like to assign the variable names in the example (eg. "var1_1990") and immediately make the new variables part of the dataframe. I would also like to avoid the for loop but I don't know how to wrap two lapply's around this function. Should I rather have the function use two arguments? Is the problem that the apply function does not carry the results into my environment? I've been stuck here for a while so I would be grateful for any help!
p.s.: I have the solution to do this combination by combination without apply and the likes but I'm trying to get away from copy and paste:
Dataset$var1_1990 <- c(rep(Dataset$var1[which(Dataset$time==1990)],
each=length(unique(Dataset$time))))
This can be done with subset(), reshape(), and merge():
merge(Dataset,reshape(subset(Dataset,time%in%c(1990,1991)),dir='w',idvar='geo',sep='_'));
## geo time var1 var2 var1_1990 var2_1990 var1_1991 var2_1991
## 1 AT 1990 1 7 1 7 2 8
## 2 AT 1991 2 8 1 7 2 8
## 3 AT 1992 3 9 1 7 2 8
## 4 DE 1990 4 10 4 10 5 11
## 5 DE 1991 5 11 4 10 5 11
## 6 DE 1992 6 12 4 10 5 11
The column order isn't exactly what you have in your question, but you can fix that up after-the-fact with an index operation, if necessary.
Here's a data.table method:
require(data.table)
dt <- as.data.table(Dataset)
in_cols = c("var1", "var2")
out_cols = do.call("paste", c(CJ(in_cols, unique(dt$time)), sep="_"))
dt[, (out_cols) := unlist(lapply(.SD, as.list), FALSE), by=geo, .SDcols=in_cols]
# time geo var1 var2 var1_1990 var1_1991 var1_1992 var2_1990 var2_1991 var2_1992
# 1: 1990 AT 1 7 1 2 3 7 8 9
# 2: 1991 AT 2 8 1 2 3 7 8 9
# 3: 1992 AT 3 9 1 2 3 7 8 9
# 4: 1990 DE 4 10 4 5 6 10 11 12
# 5: 1991 DE 5 11 4 5 6 10 11 12
# 6: 1992 DE 6 12 4 5 6 10 11 12
This assumes that the time variable is identical (and in the same order) for each geo value.
With dplyr and tidyr and using a custom function try the following:
Data
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
Code
library(dplyr); library(tidyr)
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
#create this function
myTranForm <- function(dataSet, varName, years){
temp <- dataSet %>% select(time, geo, eval(parse(text=varName))) %>%
filter(time %in% years) %>% mutate(time=paste(varName, time, sep="_"))
names(temp)[names(temp) %in% varName] <- "someRandomStringForVariableName"
temp <- temp %>% spread(time, someRandomStringForVariableName)
return(temp)
}
#Then lapply on intitialvars using the custom function
DatasetList <- lapply(intitialvars, function(x) myTranForm(Dataset, x, intitialyears))
#and loop over the data frames in the list
for(i in 1:length(intitialvars)){
Dataset <- left_join(Dataset, DatasetList[[i]])
}
Dataset
I have written a code. It would be great if you guys can suggest better way of doing the stuff I am trying to do. The dt is given as follows:
SIC FYEAR AU AT
1 1 2003 6 212.748
2 1 2003 5 3987.884
3 1 2003 4 100.835
4 1 2003 4 1706.719
5 1 2003 5 9.159
6 1 2003 7 60.069
7 1 2003 5 100.696
8 1 2003 4 113.865
9 1 2003 6 431.552
10 1 2003 7 309.109 ...
My job is to create a new column for a given SIC, and FYEAR, the AU which has highest percentage AT and the difference between highest AT and second highest AT will get a value 1, otherwise 0. Here, is my attempt to do the stuff mentioned.
a <- ddply(dt,.(SIC,FYEAR),function(x){ddply(x,.(AU),function(x) sum(x$AT))});
SIC FYEAR AU V1
1 1 2003 4 3412.619
2 1 2003 5 13626.241
3 1 2003 6 644.300
4 1 2003 7 1478.633
5 1 2003 9 0.003
6 1 2004 4 3976.242
7 1 2004 5 9383.516
8 1 2004 6 457.023
9 1 2004 7 456.167
10 1 2004 9 238.282
where V1 represnts the sum AT for all the rows for a given AU for a given SIC and FYEAR. Next I do :
a$V1 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) x/sum(x));
SIC FYEAR AU V1
1 1 2003 4 1.780949e-01
2 1 2003 5 7.111150e-01
3 1 2003 6 3.362420e-02
4 1 2003 7 7.716568e-02
5 1 2003 9 1.565615e-07
6 1 2004 4 2.740114e-01
7 1 2004 5 6.466382e-01
8 1 2004 6 3.149444e-02
9 1 2004 7 3.143545e-02
10 1 2004 9 1.642052e-02
The column V1 now represents the percentage value for each AU for AT contribution for a given SIC, and FYEAR. Next,
a$V2 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) {t<-((sort(x, TRUE))[2]);
ifelse((x-t)> 0.1,1,0)});
SIC FYEAR AU V1 V2
1 1 2003 4 1.780949e-01 0
2 1 2003 5 7.111150e-01 1
3 1 2003 6 3.362420e-02 0
4 1 2003 7 7.716568e-02 0
5 1 2003 9 1.565615e-07 0
6 1 2004 4 2.740114e-01 0
7 1 2004 5 6.466382e-01 1
8 1 2004 6 3.149444e-02 0
9 1 2004 7 3.143545e-02 0
10 1 2004 9 1.642052e-02 0
The AU for a given SIC, and FYEAR, which has highest percentage contribution to AT, and f the difference is greater than 10%, the that AU gets 1 else gets 0.
Then I merge the result with original data dt.
dt <- merge(dt,a,key=c("SIC","FYEAR","AU"));
SIC FYEAR AU AT V1 V2
1 1 2003 4 1706.719 1.780949e-01 0
2 1 2003 4 100.835 1.780949e-01 0
3 1 2003 4 113.865 1.780949e-01 0
4 1 2003 4 1491.200 1.780949e-01 0
5 1 2003 5 3987.884 7.111150e-01 1
6 1 2003 5 100.696 7.111150e-01 1
7 1 2003 5 67.502 7.111150e-01 1
8 1 2003 5 9461.000 7.111150e-01 1
9 1 2003 5 9.159 7.111150e-01 1
10 1 2003 6 212.748 3.362420e-02 0
What I did is very cumbersome. Is there a better way to do the same stuff? Thanks.
I'm not sure if the deleted answer was the same as this, but you can effectively do it in a couple of lines.
# Simulate data
set.seed(1)
n<-1000
dt<-data.frame(SIC=sample(1:10,n,replace=TRUE),FYEAR=sample(2003:2007,n,replace=TRUE),
AU=sample(1:7,n,replace=TRUE),AT=abs(rnorm(n)))
# Cacluate proportion.
dt$prop<-ave(dt$AT,dt$SIC,dt$FYEAR,FUN=prop.table)
# Find AU with max proportion.
dt$au.with.max.prop<-
ave(dt,dt$SIC,dt$FYEAR,FUN=function(x)x$AU[x$prop==max(x$prop)])[,1]
It is all in base, and avoids merge so it won't be that slow.
Here's a version using data.table:
require(data.table)
DT <- data.table(your_data_frame)
setkey(DT, SIC, FYEAR, AU)
DT[setkey(DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1),
by=list(SIC, FYEAR)])[, V2 := (V1 - V1[.N-1] > 0.1) * 1,
by=list(SIC, FYEAR)]]
The part DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1), by=list(SIC, FYEAR)] first sums AT by all three columns and then replaces V1 by V1/sum(V1) by columns SIC, FYEAR by reference. The setkey wrapping this code orders all four columns. Therefore, the last but one value will always be the second highest value (under the condition that there are no duplicated values). Using this, we can create V2 as: [, V2 := (V1 - V1[.N-1] > 0.1) * 1, by=list(SIC, FYEAR)]] by reference. Once we've this, we can perform a join by using DT[.].
Hope this helps.
I have a data frame like this:
FisherID Year Month VesselID
1 2000 1 56
1 2000 1 81
1 2000 2 81
1 2000 3 81
1 2000 4 81
1 2000 5 81
1 2000 6 81
1 2000 7 81
1 2000 8 81
1 2000 9 81
1 2000 10 81
1 2001 1 56
1 2001 2 56
1 2001 3 81
1 2001 4 56
1 2001 5 56
1 2001 6 56
1 2001 7 56
1 2002 3 81
1 2002 4 81
1 2002 5 81
1 2002 6 81
1 2002 7 81
...and I need the number of time that ID changes per year, so the output that I want to is:
FisherID Year DiffVesselUsed
1 2000 1
1 2001 2
1 2002 0
I tried to get that using aggregate():
aggregate(vesselID, by=list(FisherID,Year,Month ), length)
but what I got was:
FisherID Year DiffVesselUsed
1 2000 2
1 2001 1
1 2002 1
because aggregate() counted those different vessels when those only appeared in the same month. I have tried different way to aggregate without success. Any help will be deeply appreciated. Cheers, Rafael
First a question: Your expected output does't seem to reflect what you ask for. You ask for the number of times an ID changes per year, but your expected output seems to indicate that you want to know how many unique VesselIDs are observed per year. For example, in 2000, the ID changes once, and in 2001 the ID changes twice. In both years, two unique IDs are observed.
So to get the result you posted,
If you're looking for a statistic by FisherID and Year, then there's no reason to look by Month as well. Instead, you should look at the unique values of VesselID for each combination of FisherID and Year.
aggregate(VesselID, by = list(FisherID, Year), function(x) length(unique(x)))
# Group.1 Group.2 x
# 1 1 2000 2
# 2 1 2001 2
# 3 1 2002 1
If you really want the number of times ID changes, use the rle function.
aggregate(VesselID, by = list(FisherID, Year),
function(x) length(rle(x)$values) - 1)
# Group.1 Group.2 x
# 1 1 2000 1
# 2 1 2001 2
# 3 1 2002 0