How to extend a hash with multiple values in R - r

So I understand that in R, a hash() is similar to a dictionary. I would like to extract specific values from my dataframe and put them in to a hash.
The componentindex column is were I have my keys and the cluster.index + UniqueFileSourceCounts columns contain my values. So for the same key I have multiple values. e.g: hash {91: [1,15],[22,99] etc..
So I would like to create a hash that contains each key, with multiple values. But im not sure how to do that.
mini_df <- head(df,10) #using a small df
compID <- unique(mini_df$componentindex) #list with unique keys
h1 <- hash()
for (i in 1:length(mini_df)){
if(compID == mini_df[i,"componentindex"]){
h1 <- hash(mini_df[i,"componentindex"] ,c(mini_df[i,"cluster.index"],mini_df[i,"UniqueFileSourcesCount"]))
}
#h2 <- append(h2,h1)
}
if I print h1 , I end up having only the last value:
<hash> containing 1 key-value pair(s).
91 : 42 5
Which I understand since I don't append to this hash but overwrite it. Im not sure how to append/expand hashes in R and I have not been able to find a solution yet.
mini_df:
UniqueFileSourcesCount cluster.index componentindex
1 15 1 91
2 15 10 -1
3 99 22 91
4 63 23 1675
5 12 25 91
6 6 27 91
7 50 37 91
8 5 42 91
9 2 43 -1
10 2 69 -1

Related

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Print only the name of variable with the class factor

I have the data set in which some are factors as well some are numerical/integer, so what should be the command to return only the name of factor class.
aa bb cc dd
1 12 P 43
4 23 Q 78
8 34 Q 89
9 86 P 78
7 67 P 98
9 76 Q 74
So, now if I want to print only the name of variable with the class factor, i.e. cc, so what should be my R command?
Thanks in advance
You can do:
names(Filter(is.factor, data))
This also is a little longer but might be using less memory:
names(data)[sapply(data, is.factor)]

Cumulative count of values in R

I hope you are doing very well. I would like to know how to calculate the cumulative sum of a data set with certain conditions. A simplified version of my data set would look like:
t id
A 22
A 22
R 22
A 41
A 98
A 98
A 98
R 98
A 46
A 46
R 46
A 46
A 46
A 46
R 46
A 46
A 12
R 54
A 66
R 13
A 13
A 13
A 13
A 13
R 13
A 13
Would like to make a new data set where, for each value of "id", I would have the cumulative number of times that each id appears , but when t=R I need to restart the counting e.g.
t id count
A 22 1
A 22 2
R 22 0
A 41 1
A 98 1
A 98 2
A 98 3
R 98 0
A 46 1
A 46 2
R 46 0
A 46 1
A 46 2
A 46 3
R 46 0
A 46 1
A 12 1
R 54 0
A 66 1
R 13 0
A 13 1
A 13 2
A 13 3
A 13 4
R 13 0
A 13 1
Any ideas as to how to do this? Thanks in advance.
Using rle:
out <- transform(df, count = sequence(rle(do.call(paste, df))$lengths))
out$count[out$t == "R"] <- 0
If your data.frame has more than these two columns, and you want to check only these two columns, then, just replace df with df[, 1:2] (or) df[, c("t", "id")].
If you find do.call(paste, df) dangerous (as #flodel comments), then you can replace that with:
as.character(interaction(df))
I personally don't find anything dangerous or clumsy with this setup (as long as you have the right separator, meaning you know your data well). However, if you do find it as such, the second solution may help you.
Update:
For those who don't like using do.call(paste, df) or as.character(interaction(df)) (please see the comment exchanges between me, #flodel and #HongOoi), here's another base solution:
idx <- which(df$t == "R")
ww <- NULL
if (length(idx) > 0) {
ww <- c(min(idx), diff(idx), nrow(df)-max(idx))
df <- transform(df, count = ave(id, rep(seq_along(ww), ww),
FUN=function(y) sequence(rle(y)$lengths)))
df$count[idx] <- 0
} else {
df$count <- seq_len(nrow(df))
}

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources