AreaCode Name Rank
1001108 HA - 2326
1001247 HA - 2327
1003063 GC - 2328
1000957 DG - 2329
1001290 EA - 2330
1003305 GC - 2331
1003417 GC - 2332
1006442 WL - 2333
1005076 PK - 2334
1004581 NL - 2335
I am new to R and am having some issues. I have a data set where I want to subset the closest higher/lower ranked AreaCodes to GC in order to do a case-control study.
So I want AreaCode 1001247, 1000957, 1001290, 1006442 in a seperate data frame. How do I do this? I’m assuming through a loop, but have no experience with these. This data has ~6000 observations so doing it by hand becomes exhausting. Is there any way to do this?
An alternative would be something like this (assuming Name is a character variable):
df2 = df %>%
mutate(newcol = ifelse(!Name=="GC"&(lag(Name)=="GC"|lead(Name)=="GC"),1,0) ) %>%
filter(newcol==1)
cumsum and rle are useful here
brks <- cumsum(rle(df$Name)$lengths)
# [1] 2 3 4 5 7 8 9 10
equalsGC <- which(rle(df$Name)$values=="GC")
# [1] 2 5
ans <- df$AreaCode[sort(brks[c(equalsGC+1, equalsGC-1)])]
# [1] 1001247 1003063 1001290 1003417
As a single block
brks <- cumsum(rle(df$Name)$lengths)
equalsGC <- which(rle(df$Name)$values=="GC")
ans <- df$AreaCode[sort(brks[c(equalsGC+1, equalsGC-1)])]
Related
I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!
You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860
I have and excel with data for 2 DBs and multiple Measures (Msr) for each. There is classic ratio data Num/Denom=Ratio for each. Can anybody suggest what visualization I can use in R to graphically find big differences (let say 10%+) for each of measure between Test and X1 databases and then for each Measure.
So we compare Denom, Num, Rate between line 1 and 2.
..and then 3,4
..and then 5,6 etc
Tried to do in Excel but read that R could be much better for this purposes. But for now I can see most paired viz works for scattered display. I need something more traditional e.g. in my sample we can mark X1.SRB.Rare as low
In my example I have 3 measures, in reality could be 30. Thanks much for info.
M
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
df
db msr denom num rate
1 test BCS 11848 5255 44.35
2 x1 BCS 11049 6376 57.71
3 test CCS 35836 16908 47.18
4 x1 CCS 38458 18124 47.13
5 test SRB 54160 26253 48.47
6 x1 SRB 56387 15000 26.60
If I understood correctly, this should do what you want. I reshaped the data so you have one row per msr with separate columns for each db. I used data.table for it's performance.
library(data.table)
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
#set as a data.table
setDT(df)
#cast into one row per MSR - fill in with the "rate" variable
out <- dcast(msr ~ db, data = df, value.var = "rate")
#Compute difference
out[, test_x1_diff := test - x1]
#filter out diff >= 10
out[abs(test_x1_diff) >= 10]
#> msr test x1 test_x1_diff
#> 1: BCS 44.35 57.71 -13.36
#> 2: SRB 48.47 26.60 21.87
Created on 2019-01-11 by the reprex package (v0.2.1)
I am new to R and have been stuck with a problem for quite a while now ...
I have a big dataset(gridded data originally) with more than 1,000,000 observations and have to make a group variable for my elements.
My dataset looks like follows:
ID Var1
1 0,5
2 0,6
3 0,2
4 0,15
... ...
1029600 0,43
What I want now is to make groups according to the following scheme:
1 2 3 4 5 6 ... 4320
4321 4322 4322 4322 4322 4322 ... 8640
8641 8642 8643 8644 8645 8646 ... 12960
12961 12962 12963 12964 12965 12966 ... 17280
17281 17282 17283 17284 17285 17286 ... 21600
21601 21602 21603 21604 21605 21606 ... 25920
... ... ... ... ... ... ... ...
1025281 1025282 1025283 1025284 1025285 1025286... 1029600
Where the 36 numbers {1,2,3,4,5,6,4321,4322,4323,4324,4325,4326,8641,8642,...,21060} are the first group .
The second group would be {7,8,9,10,11,12,4327,4328,...,21612}. The third group would start with {13,14,15...}. And so on for all observations. I hope i could make it clear what my goal is here. I wanted to visualize it with a picture, but as a new member, this is not possible.
So far i managed to do it with a really ugly loop function, which looks as follows:
for(k in 0:40) {
nk <- 25920 * k
mk <- 720 * k
for (j in 0:719) {
cj <- j * 6
for (i in 0:5) {
ai <- i * 4320 + 1 + cj + nk
bi <- i * 4320 + 6 + cj + nk
group[ai:bi] <- 1 + j + mk
}
}
}
I am aware that this is pretty inefficient and it takes a very long time to compute this with loops. I am pretty sure that there is an easier way to solve my problem, but as I am new to R, I cannot find it myself.
Any help would be really appreciated. Thank you in advance!
You can get the group from the ID with a simple formula:
group <- (((ID-1) %% 4320) %/% 6) +1
Note that %% is the modulo operation and %/% is the integer division. The formula should give you groups numbered from 1. No need to include it in a loop, it is a vectorized operation.
There are plenty of ways to do it (like reshaping 1:1029600 into a matrix with 4320 columns and taking the 6*N:6*(N+1) columns and do a match or something) but this is why you should always stop and think about what, really, you want to do. And realize it comes down to a little arithmetic :)
Create sample data
dtf <- data.frame(ID = 1:1e4, Var1 = rnorm(1:1e4))
Grouping as explained by #antine-sac:
group <- (((dtf$ID-1) %% 4320) %/% 6) +1
Split the data
dtfsplit <- split(dtf, group)
First group
> dtfsplit[1]
$`1`
ID Var1
1 1 0.56655
2 2 0.87645
3 3 -1.41986
4 4 -1.84881
5 5 0.03233
6 6 3.06512
4321 4321 -1.57179
4322 4322 -1.09958
4323 4323 0.55980
4324 4324 0.32390
4325 4325 0.85438
4326 4326 -0.10311
8641 8641 2.08886
8642 8642 1.19836
8643 8643 0.52592
8644 8644 0.20571
8645 8645 1.08429
8646 8646 0.69648
Second group
dtfsplit[2]
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107
I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))