R: many nested loops to remove rows in multiple data frames - r

I have 18 data frames called regular55, regular56, regular57, collar55, collar56, etc. In each data frame, I want to delete the first row of each nest.
Each data frame looks like this:
nest interval
1 17 -8005
2 17 183
3 17 186
4 17 221
5 17 141
6 17 30
7 17 158
8 17 23
9 17 199
10 17 51
11 17 169
12 17 176
13 31 905
14 31 478
15 31 40
16 31 488
17 31 16
18 31 203
19 31 54
20 31 341
21 31 54
22 50 -14164
23 50 98
24 50 1438
25 71 240
26 71 725
27 71 819
28 85 -13935
29 85 45
30 85 589
31 85 47
32 85 161
33 85 67
The solution I came up with to avoid writing out the function for each one of the 18 data frames includes many nested loops:
for (i in 5:7){
for (j in 5:7) {
for (k in c("regular","collar")){
for (l in c(unique(paste0(k,i,j,"$nest")))){
paste0(k,i,j)=paste0(k,i,j)[(-c(which((paste0(k,i,j,"$nest")) == l )
[1])),]
}}}}
I'm basically selecting the first value at "which" there is a "unique" value of nest. However, I get:
Error in paste0(k, i, j)[(-c(which((paste0(k, i, j, "$nest")) == l)[1])), :
incorrect number of dimensions
It might be because "paste0(k,i,j)" is only considered as a character and not recognized as the name for a data frame.
Any ideas on how to fix this? Or any other ways to delete the first rows for each nest in every data frame?

Thanks to help from the comments, my problem was solved.
Originally, I divided my data frame using a for loop and then grouped it into one list:
for (i in 5:7) {
for (j in 5:7) {
for (k in c("regular","collar")){
assign(paste0(k,i,j),
df[df$x == i & df$y == j & df$z == k,])
}}}
df.list=mget(ls(pattern=("[regular,collar][5-7][5-7]")))
I later found a way to split my data frame directly into a list based on multiple columns (R subsetting a data frame into multiple data frames based on multiple column values):
df.list= split(df, with(df, interaction(df$x, df$y, df$z)), drop = TRUE)
Finally, I was able to apply the function to remove the first rows of each nest:
df.list.updated = lapply(df.list, function(d) d %>% group_by(nest) %>%
slice(2:n()))
It is definitely easier to work from a list of data frames.

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

Reducing the number of nodes of a tree, to obtain nodes with more than one child node

The following tree:
has been obtained from the following matrix
> mat
7 23 47 41 31
7 23 53 41 31
7 23 53 41 37
7 29 47 41 31
7 29 47 41 37
7 29 53 41 31
7 29 53 41 37
11 29 53 41 31
11 29 53 41 37
taking each columns of 'mat' as a level of the tree. If 'data' is the dataframe where the matrix 'mat' is stored
V1 V2 V3 V4 V5
7 23 47 41 31
7 23 53 41 31
7 23 53 41 37
7 29 47 41 31
7 29 47 41 37
7 29 53 41 31
7 29 53 41 37
11 29 53 41 31
11 29 53 41 37
the code that produces above tree is the following
> data$pathString<-paste("0", data$V1,data$V2,data$V3,data$V4,data$V5,sep = "/")
> p_tree <- as.Node(data)
> export_graph(ToDiagrammeRGraph(p_tree), "tree.png")
I would like to modify the tree as follows: (1) if a node at level 'n', labelled by number x, has only one child node at level 'n+1', labelled by number y, then the program brings together these two nodes in one node labelled by the result of the product x*y; 2) if the node at level 'n+1' does not have child nodes, the program does nothing and starts again from another branch; 3) if the node at level 'n+1' has more than one child node, the program apply point (1) and starts again from each of child nodes.
For example, for the tree of our example, the code should:
replace the nodes circled in red with a node labelled by 31*41*47=59737
replace the nodes circled in orange with a node labelled by 53*41=2173
replace the nodes circled in green with a node labelled by 47*41=1927
replace the nodes circled in blue with a node labelled by 11*29*53*41=693187
Try this:
freq <- sapply(1:ncol(data), function(x) {
df <- data[, 1:x, drop = FALSE]
cc <- aggregate(df[, 1], as.list(df), FUN = length)
merge(df, cc, by = colnames(df), sort = FALSE)[, "x"]
})
data$pathString <- sapply(1:nrow(data), function(x) {
g <- 1
for(i in 2:ncol(freq)) g <- c(g,
if(freq[x, i] == freq[x, i - 1]) g[i - 1] else g[i - 1] + 1)
paste0(c("0", tapply(unlist(data[x, , drop = TRUE]), g, prod)), collapse = "/")
})
p_tree <- as.Node(data)
plot(p_tree)

R sum multiple columns with multiple row

So i have this data
10 21 22 23 23 43
20 12 26 43 23 65
21 54 64 73 25 75
My expected outcome is:
142
189
312
I tried to use:
df = data.matrix(df)
df = colSums(df)
df = as.data.frame(df)
However, the sum of values are wrong. I would like to know how to improve or correct this solution?
We can use rowSums
rowSums(df)
#[1] 142 189 312
Your data is stored as factors. You must convert it to numeric using as.numeric(as.character()).
In your situation I suggest to do:
for(i in 1:nrow(df)){
df[i,]<-as.numeric(as.character(df[i,]))
}
rowSums(df)

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources