R: Splitting dataset by pre-determined values - r

I have data that looks like this (but larger):
Pos Value
0 66.81967
1 66.36885
2 65.79508
3 65.27049
4 64.88525
5 64.97541
6 65.39344
7 65.99181
8 66.63115
9 66.95901
10 66.89344
11 66.44262
12 65.90984
13 65.49181
14 65.35246
I have already determined the maxima and saved the position values of each to a vector like so:
9 19 30 42 56 69 80 92 107 118 130 143 154 164 176 188 199 211
222 234 245
I now want to split the data based on the value of the maxima, so for the sample data I'd want to split the dataset into the values for Positions 0->9 and into the values for Positions 10-15, and save each of these sub-sets into vectors of their own.
I'm new to R (and coding) and was wondering how to best go about this.

Suppose your data frame is dat and your maxima values are in a vector maxima, you might use
split(dat, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
For your example data frame:
dat <-
structure(list(Pos = 0:14, Value = c(66.81967, 66.36885, 65.79508,
65.27049, 64.88525, 64.97541, 65.39344, 65.99181, 66.63115, 66.95901,
66.89344, 66.44262, 65.90984, 65.49181, 65.35246)), .Names = c("Pos",
"Value"), class = "data.frame", row.names = c(NA, -15L))
and the first few values of your maxima in the range:
maxima <- c(0, 10, 19)
my code gives you a list of data frames
#$`[0,10]`
# Pos Value
#1 0 66.81967
#2 1 66.36885
#3 2 65.79508
#4 3 65.27049
#5 4 64.88525
#6 5 64.97541
#7 6 65.39344
#8 7 65.99181
#9 8 66.63115
#10 9 66.95901
#11 10 66.89344
#
#$`(10,19]`
# Pos Value
#12 11 66.44262
#13 12 65.90984
#14 13 65.49181
#15 14 65.35246
If you don't want data frames, but just Value, use
split(dat$Value, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
#$`[0,10]`
# [1] 66.81967 66.36885 65.79508 65.27049 64.88525 64.97541 65.39344 65.99181
# [9] 66.63115 66.95901 66.89344
#
#$`(10,19]`
# [1] 66.44262 65.90984 65.49181 65.35246
Thanks! How would I go about saving these as separate data frames/sets (not sure on the correct terminology) so that I can then fit them individually?
How about
lst <- split(dat, cut(dat$Pos, breaks = maxima, include.lowest = TRUE))
dir <- getwd()
lapply(seq_len(length(lst)),
function (i) write.csv(lst[[i]], file = paste0(dir,"/",names(lst[i]), ".csv"), row.names = FALSE))
This will save each data frame into a .csv file under directory dir. I have used getwd() to test the code; you may change it to a specific folder.

Not sure if that is the best approach, but I would work with a list and use a for loop like this (untested):
maxpos <- c(9, 19, 30)
ans <- list()
prev <- 1
for (i in seq.int(length(maxpos))) {
ans[[i]] <- dataset[seq(prev, maxpos[i]),]
prev <- maxpos[i+1]
}
ans[[length(maxpos)+1]] <- dataset[seq(maxpos[length[maxpos]]+1,nrow(dataset)),]

Related

In R, excel COUNTA equivalent, or another way to calculate averages in "char" columns?

I have a spreadsheet where some of the values are entered as "N/A", and some of the cells are blank.
joe
pete
mark
Average ⬇️ (the average per row)
90
85
N/A
87.5
N/A
92
92
88
90
89
3
2
2
3
<-- This row Counts all non-blank values in each column
I want to import these into R to do two things:
Get an average of these values for each row across multiple columns and
get a count of the values per column (see below for example)
The problem is: I want to be able to count all the non-blank cells (including those with "N/A" values, as they are actually important part of the data and are different from blanks
What I tried:
Replaced the "N/A" values in Excel before I imported into R by changing the "n/a"s to 0's, so I can import the columns as numbers, but the problem is, then my averages are messed up. If I add a 0 to the first row, for example, then my average is 58.33 (90+85+0)/3 = 58.33
That is not what I want. I want an average of only those that are not "N/A".
The other issue I have, is that if I leave those as N/A, then I can get a count, but my columns are not numeric anymore and I can't perform an average calculaiton.
I know I can do this easily in excel with =COUNTA and =AVERAGE, but I would prefer to do as much wrangling as possible in R.
Any suggesitons??
Thanks!!
try something like this. The na.rm=TRUE should be what you want:
example_data = c(90, 85, NA)
MEAN = mean(example_data, na.rm=TRUE)
base R
dat$avg <- mapply(function(...) {
dots <- unlist(list(...))
mean(suppressWarnings(as.numeric(dots[nzchar(dots)])),
na.rm = TRUE)
}, dat$joe, dat$pete, dat$mark)
dat
# joe pete mark avg
# 1 90 85 <NA> 87.5
# 2 NA 92 92.0
# 3 88 90 89.0
as.data.frame(lapply(dat, function(z) sum(nzchar(z))))
# joe pete mark avg
# 1 3 2 2 3
dat <- rbind(dat, as.data.frame(lapply(dat, function(z) sum(nzchar(z)))))
dat
# joe pete mark avg
# 1 90 85 <NA> 87.5
# 2 NA 92 92.0
# 3 88 90 89.0
# 4 3 2 2 3.0
Data
dat <- structure(list(joe = c(90, NA, 88), pete = c("85", "", "90"), mark = c(NA, "92", "")), class = "data.frame", row.names = c(NA, -3L))

R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language. Suppose I have the following data:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?
Right now, I am trying to manually do this, e.g. (for histogram #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?
Thanks!
hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks
First, make the problem reproducible by setting the RNG seed.
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist and have findInterval tell the bins where each data points are in.
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").
As a final check, table the values returned by findInterval and see if they match the histogram's counts.
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10

Wide to long with many different columns

I have used pivot_longer before but this time I have a much more complex wide dataframe and I cannot sort it out. The example code will provide you a reproducible dataframe. I haven't dealt with such thing before so I'm not sure it's correct to try to format this type of df in long format?
df <- data.frame(
ID = as.numeric(c("7","8","10","11","13","15","16")),
AGE = as.character(c("45 – 54","25 – 34","25 – 34","25 – 34","25 – 34","18 – 24","35 – 44")),
GENDER = as.character(c("Female","Female","Male","Female","Other","Male","Female")),
SD = as.numeric(c("3","0","0","0","3","2","0")),
GAMING = as.numeric(c("0","0","0","0","2","2","0")),
HW = as.numeric(c("2","2","0","2","2","2","2")),
R1_1 = as.numeric(c("10","34","69","53","79","55","28")),
M1_1 = as.numeric(c("65","32","64","53","87","55","27")),
P1_1 = as.numeric(c("65","38","67","54","88","44","26")),
R1_2 = as.numeric(c("15","57","37","54","75","91","37")),
M1_2 = as.numeric(c("90","26","42","56","74","90","37")),
P1_2 = as.numeric(c("90","44","33","54","79","95","37")),
R1_3 = as.numeric(c("5","47","80","27","61","19","57")),
M1_3 = as.numeric(c("30","71","80","34","71","15","57")),
P1_3 = as.numeric(c("30","36","81","35","62","8","56")),
R2_1 = as.numeric(c("10","39","75","31","71","80","59")),
M2_1 = as.numeric(c("90","51","74","15","70","75","61")),
P2_1 = as.numeric(c("90","52","35","34","69","83","60")),
R2_2 = as.numeric(c("10","45","31","54","39","95","77")),
M2_2 = as.numeric(c("60","70","40","78","5","97","75")),
P2_2 = as.numeric(c("60","40","41","58","9","97","76")),
R2_3 = as.numeric(c("5","38","78","45","25","16","22")),
M2_3 = as.numeric(c("30","34","84","62","33","52","20")),
P2_3 = as.numeric(c("30","34","82","45","32","16","22")),
R3_1 = as.numeric(c("10","40","41","42","62","89","41")),
M3_1 = as.numeric(c("90","67","37","40","27","89","42")),
P3_1 = as.numeric(c("90","34","51","44","38","84","43")),
R3_2 = as.numeric(c("10","37","20","54","8","93","69")),
M3_2 = as.numeric(c("60","38","21","62","5","95","71")),
P3_2 = as.numeric(c("60","38","23","65","14","92","69")),
R3_3 = as.numeric(c("5","30","62","11","60","32","52")),
M3_3 = as.numeric(c("30","67","34","55","45","25","45")),
P3_3 = as.numeric(c("30","28","41","24","53","23","52")),
R1_4 = as.numeric(c("10","40","61","17","39","72","25")),
M1_4 = as.numeric(c("45","20","63","25","62","70","23")),
P1_4 = as.numeric(c("45","52","56","16","26","72","27")),
R2_4 = as.numeric(c("5","21","70","33","80","68","30")),
M2_4 = as.numeric(c("35","21","69","27","85","69","23")),
P2_4 = as.numeric(c("35","32","34","25","79","63","29")),
R3_4 = as.numeric(c("10","29","68","21","8","71","41")),
M3_4 = as.numeric(c("50","37","66","28","33","65","41")),
P3_4 = as.numeric(c("50","38","47","28","24","71","41"))
)
I would like to sort it out like in the following table
the new column names are extracted from the old ones such that (example) in R1_1:
R is the namer of the column containing the value previously stored
in R1_1
1 (the first character after 'R' in R1_1) is the value used
in column Speed
1 (last character of 'R1_1') is the value used in
column Sound
basically each row corresponds to 1 question answered by 1 person, and each question was answered through 3 different ratings (R, M, P)
thank you!
If I understood you correctly, the following should work:
df %>%
pivot_longer(
cols = matches('[RMP]\\d_\\d'),
names_to = c('RMP', 'Speed', 'Sound'),
values_to = 'Data',
names_pattern = '([RMP])(\\d)_(\\d)'
) %>%
pivot_wider(names_from = RMP, values_from = Data)
This assumes that both “speed” and “sound” are single-digit values. If there’s the possibility of multiple digits, the occurrences of \\d in the patterns above need to be replaced by \\d+.
Solution using our good ol' workhorse reshape. At first we grep the names with a "Wd_d" pattern, as well as their suffixes "d_d" for following use in reshape.
nm <- names(df[grep("_\\d", names(df))])
times <- unique(substr(nm, 2, 4))
res <- reshape(df, idvar="ID", varying=7:42, v.names=unique(substr(nm, 1, 1)),
times=times,direction="long")
Getting us close to the result, we just need to strsplit the newly created "time" variable at the "_" and rbind it to the former.
res <- cbind(res, setNames(type.convert(do.call(rbind.data.frame,
strsplit(res$time, "_"))),
c("Speed", "Sound")))
res <- res[order(res$AGE), ] ## some ordering
Result
head(res)
# ID AGE GENDER SD GAMING HW time R M P Speed Sound
# 15.1_1 15 18 – 24 Male 2 2 2 1_1 55 44 55 1 1
# 15.1_2 15 18 – 24 Male 2 2 2 1_2 90 95 91 1 2
# 15.1_3 15 18 – 24 Male 2 2 2 1_3 15 8 19 1 3
# 15.2_1 15 18 – 24 Male 2 2 2 2_1 75 83 80 2 1
# 15.2_2 15 18 – 24 Male 2 2 2 2_2 97 97 95 2 2
# 15.2_3 15 18 – 24 Male 2 2 2 2_3 52 16 16 2 3

Network Trip Assignment with igraph

My problem:
I have a street network (df.net) and a list containing the Origins and Destinations of trips (df.trips).
I need to find the flow on all links?
library(dplyr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
This is my solution:
library(igraph)
# I construct a directed igraph network:
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
plot(graph)
# I make a vector of edge_ids:
edges = paste0(df.net$from,":",df.net$to)
# and an empty vector of same length to fill with the flow afterwards:
N = integer(length(edges))
# I loop through all Origin-Destination-pairs:
for(i in 1:nrow(df.trips)){
# provides one shortest path between one Origin & one Destination:
path = shortest_paths(graph = graph,
from = as.character(df.trips$from[i]),
to = as.character(df.trips$to[i]),
mode = "out",
weights = NULL)
# Extract the names of vetices on the path:
a = names(path$vpath[[1]])
# Make a vector of the edge_ids:
a2 = a[2:length(a)]
a = a[1:(length(a)-1)]
a = paste0(a,":",a2)
# and fill the vector with the trips
v = integer(length(edges))
v[edges %in% a] = pull(df.trips[i,3])
# adding the trips of this iteration to the sum
N = N + v
}
# attach vector to network-dataframe:
df.net = data.frame(df.net, N)
Theoretically it works. It just takes approx. 8h for my real network to finish (about 500 000 Origin-Destination-pairs on a network with a bit less than 50 000 links).
I am pretty sure my for-loop is the culprit.
So my questions concerning optimization are:
1) Is there a igraph-function which simply does what I want to do? I could not find it...
2) Maybe there is another package better suited to my needs which I haven't stumbled upon?
3) If not, should I go for loop-performance improvement by rewriting it with the Rcpp-package?
Anyways, I am grateful for any help you can provide me.
Thanks in advance!
I have what I hope is a faster solution, although I get slightly different results from you.
This approach multithreads with data.table, calls igraph::shorest_paths only once per from vertex, and avoids using the names attributes of the graph until the trivial last step.
library(igraph)
library(tibble)
library(data.table)
library(zoo)
library(purrr)
df.net = tribble(~from, ~to, ~weight,1,2,1,2,1,1,1,9,3,9,1,2,2,10,1,10,2,2,9,10,8,10,9,15,9,8,1,8,9,2,7,8,2,12,7,3,9,12,10,12,9,9,12,6,2,6,12,5,11,12,3,12,11,3,5,6,1,11,5,4,5,11,3,11,4,3,4,3,5,3,10,4,10,11,10)
graph = igraph::graph_from_data_frame(d=df.net, directed=T)
df.trips = tribble(~from, ~to, ~N,1,2,45,1,4,24,1,5,66,1,9,12,1,11,54,2,3,63,2,4,22,2,7,88,2,12,44,3,2,6,3,8,43,3,10,20,3,11,4,4,1,9,4,5,7,4,6,35,4,9,1,5,7,55,5,8,21,5,1,23,5,7,12,5,2,18,6,2,31,6,3,6,6,5,15,6,8,19,7,1,78,7,2,48,7,3,92,7,6,6,8,2,77,8,4,5,8,5,35,8,6,63,8,7,22)
l.trips <- split(df.trips,1:nrow(df.trips))
setDT(df.trips)
Result <- df.trips[,setnames(lapply(shortest_paths(graph = graph,from= from,to = to,weights=NULL,mode = "out")$vpath,
function(x){zoo::rollapply(x,width=2,c)}) %>% map2(.,N,~ {.x %x% rep(1,.y)} %>% as.data.frame) %>%
rbindlist %>% .[,.N,by = c("V1","V2")],c("new.from","new.to","N")),by=from][,sum(N),by = c("new.from","new.to")]
Result[,`:=`(new.from = V(graph)$name[Result$new.from],
new.to = V(graph)$name[Result$new.to])]
# new.from new.to V1
# 1: 1 2 320
# 2: 2 10 161
# 3: 1 9 224
# 4: 9 8 73
# 5: 10 11 146
# 6: 11 4 102
# 7: 2 1 167
# 8: 9 12 262
# 9: 4 3 44
#10: 9 1 286
#11: 12 6 83
#12: 12 11 24
#13: 11 5 20
#14: 10 2 16
#15: 11 12 35
#16: 12 7 439
#17: 8 9 485
#18: 7 8 406
#19: 6 12 202

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

Resources