I have a named vector filled with zeros
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
I want to populate the vector with count data from a dataframe
size count
37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024
I need help finding a way to match the value for size to the vector name and then input the corresponding count value into that vector position
Might be as simple as:
toy1[ as.character(dat$size) ] <- dat$count
toy1
# 37 38 39 40 41 42 43 44 45
#1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
R's indexing for assignments can have character values. If you had just tried to index with the raw column:
toy1[ dat$size ] <- dat$count
You would have gotten (as did I initially):
> toy1
37 38 39 40 41 42 43 44 45
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.181 0.421
0.054 0.005 0.031 0.582 NA NA 0.024
That occurred because numeric indexing occurred and there was default extension of the length of the vector to accommodate the numbers up to 45.
With a version of the dataframe that had a number that was not in the range 37:45, I did get a warning from using match with a nomatch of 0, but I also got the expected results:
toy1[ match( as.character( dat$size), names(toy1) , nomatch=0) ] <- dat$count
#------------
Warning message:
In toy1[match(as.character(dat$size), names(toy1), nomatch = 0)] <- dat$count :
number of items to replace is not a multiple of replacement length
> toy1
37 38 39 40 41 42 43 44 45
1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.000
The match function is at the core of the merge function but this application would be much faster than a merge of dataframes
Lets say your data frame is df, then you can just update the records in toy1 for records available in your data frame:
toy1[as.character(df$size)] <- df$count
Edit: To check for a match m before updating the records. m are the matched indices in size column of df:
m <- match(names(toy1), as.character(df$size))
Then, for the indices in toy1 which have a match, it can be updated as below:
toy1[which(!is.na(m))] <- df$count[m[!is.na(m)]]
PS: Efficient way would be to define toy1 as a data frame and perform an outer join by size column.
First, let's get the data loaded in.
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
df = read.table(text="37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024")
names(df) = c("size","count")
Now, I present a really ugly solution. We only update toy1 where the name of toy1 appears in df$size. We return df$count by obtaining the index of the match in df. I use sapply to get a vector of the index back. On both sizes we only look for places where names(toy1) appear in df$size.
toy1[names(toy1) %in% df$size] = df$count[sapply(names(toy1)[names(toy1) %in% df$size],function(x){which(x == df$size)})]
But, this isn't very elegant. Instead, you could turn toy1 into a data.frame.
toydf = data.frame(toy1 = toy1,name = names(toy1),stringsAsFactors = FALSE)
Now, we can use merge to get the values.
updated = merge(toydf,df,by.x = "name",by.y="size",all.x=T)
This returns a 3 column data.frame. You can then extract the count column from this, replace NA with 0 and you're done.
updated$count[is.na(updated$count)] = 0
updated$count
#> [1] 1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:
And here's it is for your console:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!
UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.
To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.
The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum with any other function that is suitable for use with ave.
Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)
I have a matrix that has 3 columns and 47,772 rows. Within the rows there are 64 parameters.
Currently the data frame looks like:
SAMPLE_DATE PARAMETER RESULT
8/2/1954 Alkalinity, total as CaCO3(mg/L) 112.5
8/2/1954 Depth, Secchi disk depth(m) 2.44
8/2/1954 Nutrient-nitrogen as N(mg/L) 0.87
8/2/1954 Phosphorus as P(mg/L) 0.001
8/2/1954 Sulfate as SO4(mg/L) 11
3/7/1962 Alkalinity, total as CaCO3(mg/L) 140
3/7/1962 Alkalinity, total as CaCO3(mg/L) 320
3/7/1962 Alkalinity, total as CaCO3(mg/L) 130
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.02
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.26
3/7/1962 Ammonia-nitrogen as N(mg/L) 0.02
3/7/1962 Apparent color(PCU) 10
3/7/1962 Apparent color(PCU) 10
....
and I want transform it into something that looks like:
Date Alkalinity, total as CaCO3(mg/L) Depth, Secchi disk depth(m).....etc
8/2/1954 112.5 2.44 ..... etc
note: not every date has every parameter
Any Ideas?
Here's one approach. I've added a "time" variable since there are duplicated "SAMPLE_DATE" + "PARAMETER" combinations.
library(reshape2) # for dcast
library(splitstackshape) # for getanID
x2 <- getanID(x, id.vars = c("SAMPLE_DATE", "PARAMETER"))
dcast(x2, .id + SAMPLE_DATE ~ PARAMETER, value.var = "RESULT")
# .id SAMPLE_DATE Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L)
# 1 1 3/7/1962 140.0 0.02
# 2 1 8/2/1954 112.5 NA
# 3 2 3/7/1962 320.0 0.26
# 4 3 3/7/1962 130.0 0.02
# Apparent color(PCU) Depth, Secchi disk depth(m) Nutrient-nitrogen as N(mg/L)
# 1 10 NA NA
# 2 NA 2.44 0.87
# 3 10 NA NA
# 4 NA NA NA
# Phosphorus as P(mg/L) Sulfate as SO4(mg/L)
# 1 NA NA
# 2 0.001 11
# 3 NA NA
# 4 NA NA
As above, but with the "data.table" package:
library(data.table)
packageVersion("data.table")
# [1] ‘1.8.11’
DT <- data.table(x)
DT[, .id := sequence(.N), by = list(SAMPLE_DATE, PARAMETER)]
dcast.data.table(DT, .id + SAMPLE_DATE ~ PARAMETER, value.var="RESULT")
If you don't want separate rows for duplicated combinations, you will have to aggregate the data in some way first.
This will be (sort of) a contingency table (zeros if no value and teh sum of values when there are overlapping category values):
xtabs(RESULT~ SAMPLE_DATE+PARAMETER, data=dat)
PARAMETER
SAMPLE_DATE Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L)
3/7/1962 590.000 0.300
8/2/1954 112.500 0.000
PARAMETER
SAMPLE_DATE Apparent color(PCU) Depth, Secchi disk depth(m)
3/7/1962 20.000 0.000
8/2/1954 0.000 2.440
PARAMETER
SAMPLE_DATE Nutrient-nitrogen as N(mg/L) Phosphorus as P(mg/L)
3/7/1962 0.000 0.000
8/2/1954 0.870 0.001
PARAMETER
SAMPLE_DATE Sulfate as SO4(mg/L)
3/7/1962 0.000
8/2/1954 11.000
If you have a different desire than the sum() of repeated categories then the tapply function can deliver. E.g. with the mean as the target function:
with( dat, tapply(RESULT, list( SAMPLE_DATE, PARAMETER), FUN=mean, na.rm=TRUE))
Alkalinity, total as CaCO3(mg/L) Ammonia-nitrogen as N(mg/L) Apparent color(PCU) Depth, Secchi disk depth(m)
3/7/1962 196.6667 0.1 10 NA
8/2/1954 112.5000 NA NA 2.44
Nutrient-nitrogen as N(mg/L) Phosphorus as P(mg/L) Sulfate as SO4(mg/L)
3/7/1962 NA NA NA
8/2/1954 0.87 0.001 11