I have a data frame which stores a hierarchical data, a part of which is as shown below:
print(data)
level Name
1 WRG ASM ENGINE
2 MOUNT CLAMP
3 Carbon Steel
4 Carbon
3 PA
4 F-Fibre
Now say I want to find the immediate parent of row with name "Carbon". I am currently using the below code:
1.Finding the level value for Carbon,the immediate parent will have level
value 1 less than Carbon's.
level_carbon <-data[which(data$Name=="Carbon"),"level"]
2.Finding position of carbon in the data frame
row_num_carbon <-which(data$Name=="Carbon)
3.Getting index of all the possible immediate parents
Parents_Carbon_index <- which(data$level==level_carbon-1 )
4.Index of immediate parent will be less than that of carbon and it will be
closest to carbon in the data frame
Act_Parent_Carbon <- (which.min(Parents_Carbon_index < row_num_carbon))-1
Carbon_Parent <- data[Act_Parent_Carbon ,"Name"]
print(Carbon_Parent)
"Carbon Steel"
The above code serves the purpose, but I am looking for a shorter code which looks cleaner and takes less execution time.
# create an identifier for order
data <- tibble::rowid_to_column(data)
# update following up #r2evans' comment:
# below is a base R option to get rowids since rowid_to_column requires tibble
# data$rowid <- seq_len(nrow(data))
# conditions: one level up + before the given row + closest to given row
tail(data$Name[data$level == data$level[data$Name == "Carbon"] - 1 & data$rowid < data$rowid[data$Name == "Carbon"]], 1)
You can create a function to find the parent of a given item:
data$rowid <- seq_len(nrow(data)) # using base R option as #r2evans suggested
find_parent <- function(item) {
tail(data$Name[data$level == data$level[data$Name == item] - 1 & data$rowid < data$rowid[data$Name == item]], 1)
}
find_parent("Carbon")
# [1] "Carbon Steel"
with(data,Name[(s<-which(level==level["Carbon"==Name]-1))[max(s<which("Carbon"==Name))]])
[1] "Carbon Steel"
Related
This is a follow-up from a previous post (R: Running multiple tests by selecting (and increasing) number of fixed data points selected):
I have a dataframe (saved as data.csv) that looks something like this:
person
outcome
baseline_post
time
1
0
baseline
BL_1
1
1
baseline
BL_2
1
0
baseline
BL_3
1
2
baseline
BL_4
1
4
post
post_1
1
3
post
post_2
1
4
post
post_3
1
6
post
post_4
2
1
baseline
BL_1
2
2
baseline
BL_2
2
0
baseline
BL_3
2
1
baseline
BL_4
2
3
post
post_1
2
2
post
post_2
2
4
post
post_3
2
3
post
post_4
And same as the previous post, the purpose is to try iterate a same test (can be any test) over the desired fixed combinations arranged across time,
i.e., For each participant, compare outcome(s) at BL_1 against post_1, then BL_1 and BL_2 against post_1 ... BL_1, BL_2, BL_3 and BL_4 against post_1 etc.
Basically all combinations increasing in the number of weeks tested before (BL_1 to 4) and after (post_1 to 2) treatment.
I tried modifying from #Caspar V.'s codes (thanks #Caspar V. for your previous response):
#creating pre/post data frames for later use
df <- read.csv("C:/Users/data.csv")
df_baseline <- filter(df, baseline_post == "baseline") %>%
rename(baseline = baseline_post) %>%
rename(time_baseline = time)
df_post <- filter(df, baseline_post == "post") %>%
rename(post = baseline_post) %>%
rename(time_post = time)
#generate a list of desired comparisons
comparisons = list()
for(a_len in seq_along(df_baseline$baseline)) for(b_len in seq_along(df_post$post)){
comp = list(baseline = head(df_baseline$time_baseline, a_len), post = head(df_post$time_post, b_len))
comparisons = append(comparisons, list(comp))
}
#KIV create combined df for time if required
df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
colnames(df_baseline_post) = c("time_baseline", "time_post")
#iterate through list of comparisons
for(df_baseline_post in comparisons) {
cat(df_baseline_post$time_baseline, 'versus', df_baseline_post$time_post, '\n')
#this is where your analysis goes, poisson_frequencies being a test function I created
poisson_frequencies(df)
}
This is unfortunately my output, which are 16 "versus-es", because there are 16 possible combinations based on the above data:
versus
versus
versus
versus
versus
versus
...
versus
I am not sure what went wrong. Appreciate any input. I am new when it comes to programming in R.
There's a number of problems; the following should get you back on track. Good luck!
1)
You're getting 64 comparisons in comparisons, not 16. If you would just look at the contents of comparisons you'd see that. It's because you have duplicates in df$time. You'll need to remove them first:
#generate a list of desired comparisons
groupA = unique(df_baseline$time_baseline)
groupB = unique(df_post$time_post)
comparisons = list()
for(a_len in seq_along(groupA)) for(b_len in seq_along(groupB)) {
comp = list(baseline = head(groupA, a_len), post = head(groupB, b_len))
comparisons = append(comparisons, list(comp))
}
2)
The following block is not used, and the variable df_baseline_post is overwritten in the for-loop after it, so you can just remove this:
#KIV create combined df for time if required
# df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
# colnames(df_baseline_post) = c("time_baseline", "time_post")
3)
You're executing poisson_frequencies(df) every time, but not doing anything with the output. That's why you're not seeing anything. You'll need to put a print() around it: print(poisson_frequencies(df)). Of course df is also not the data you want to work with, but I hope you already knew that.
4)
df_baseline_post$time_baseline and df_baseline_post$time_post don't exist. The loop should be:
for(df_baseline_post in comparisons) {
cat(df_baseline_post$baseline, 'versus', df_baseline_post$post, '\n')
print(poisson_frequencies(df))
}
I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).
I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data
I have a data set which, when plotted, produces a graph that looks like this:
Plot
The head of this data is:
> head(data_frame)
score position
73860 10 43000
73859 10 43001
73858 10 43002
73857 10 43003
73856 10 43004
73855 10 43005
I've uploaded the whole file as a tab delimited text file here.
As you can see, the plot has regions which have a score of around 10, but there's one region in the middle that "dips". I would like to identify these dips.
Defining a dip as:
Starting when the score is below 7
Ending when the score rises to 7 or above and stays at 7 or above for at least 500 positions
I would like to identify all the regions which meet the above definition, and output their start and end positions. In this case that would only be the one region.
However, I'm at a bit of a loss as to how to do this. Looks like the rle() function could be useful, but I'm not too sure how to implement it.
Expected output for the data frame would be something like:
[1] 44561 46568
(I haven't actually checked that everything in between these would qualify under the definition, but from the plot this looks about right)
I would be very grateful for any suggestions!
Andrei
So I've come up with one solution that uses a series of loops. I do realise this is inefficient, though, so if you have a better answer, please let me know.
results <- data.frame(matrix(ncol=2,nrow=1))
colnames(results) <- c("start","end")
state <- "out"
count <- 1
for (i in 1:dim(data_frame)[1]){
print(i/dim(data_frame)[1])
if (data_frame[i,3] < 7 & state=="out") {
results[count,1] <- data_frame[i,2]
state <- "in"
next
}
if (data_frame[i,3] >= 7 & state=="in") {
if ((i+500)>dim(data_frame)[1]){
results[count,2] <- data_frame[dim(data_frame)[1],2]
state <- "out"
break
}
if (any(data_frame[(i+1):(i+500),3] < 7)) {
next
} else {
results[count,2] <- data_frame[i-1,2]
count <- count+1
state <- "out"
next
}
}
if ((i+500)>dim(data_frame)[1] & state == "out") {
break
}
}
Something like this is a tidyverse solution and uses rle as OP suggested...
below7 <- data_frame$score < 7
x <- rle(below7)
runs <- tibble(
RunLength=x$lengths,
Below7=x$values,
RunStart=df$position[1]
) %>%
mutate(
RunStart=ifelse(
row_number() == 1,
data_frame$position[1],
data_frame$position[1] + cumsum(RunLength)-RunLength+1
),
RunEnd=RunStart + RunLength-1,
Dip=Below7,
Dip=Dip | Below7 | (!Below7 & RunLength < 500)
)
as.data.frame(runs)
Giving
RunLength Below7 RunStart RunEnd Dip
1 1393 FALSE 43000 44392 FALSE
2 84 TRUE 44394 44477 TRUE
3 84 FALSE 44478 44561 TRUE
...
19 60 FALSE 46338 46397 TRUE
20 171 TRUE 46398 46568 TRUE
21 2433 FALSE 46569 49001 FALSE
So to get OP's final answer
runs %>%
filter(Dip) %>%
summarise(
DipStart=min(RunStart),
DipEnd=max(RunEnd)
)
# A tibble: 1 x 2
DipStart DipEnd
<dbl> <dbl>
1 44394 46568
If the original data.frame might contain more than one dip, you'd have to do a little more work when creating the runs tibble: having indentified each individual run, you'd need to create an additional column, DipIndex say, which indexed each individual Dip.
I'd like to classify the values of a data frame according to two columns. Let's say, I've got the following data frame:
my.df <- data.frame(a=c(1:20), b=c(61:80))
And now I want to subdivide it into 8 areas by dividing the 2D-scatterplot into 4 equal parts and then overlaying a rectangle in the middle that would consist of a quarter of each of the 4 parts. So far I've been using the following tedious way:
ar <- range(my.df$a)
br <- range(my.df$b)
aint <- seq(ar[1], ar[2], by=(ar[2]-ar[1])/4)
bint <- seq(br[1], br[2], by=(br[2]-br[1])/4)
my.df$z <- NA
my.df[which(my.df$a < aint[3] & my.df$b < bint[3]),"z"] <- 1
my.df[which(my.df$a < aint[3] & my.df$b >= bint[3]),"z"] <- 2
...
my.df[which(my.df$z == 1 & my.df$a >= aint[2] & my.df$b >= bint[2]),"z"] <- 5
...
I am sure there must be a way to do it in a neater and more general way, i.e. by writing a general function, but I am struggling to write one myself.
Also, I was surprised to see that after all of this, the class of the column z is automatically set to shingle. Why that? How does R "know" that this is a shingle?
I'd approach it by cutting it into 16 groups first (x and y into 4 groups independently) and then combining them back together into fewer groups.
my.df$a.q <- cut(my.df$a, breaks=4, labels=1:4)
my.df$b.q <- cut(my.df$b, breaks=4, labels=1:4)
my.df$a.b.q <- paste(my.df$a.q, my.df$b.q, sep=".")
my.df$z <- c("1.1"=1, "1.2"=1, "1.3"=2, "1.4"=2,
"2.1"=1, "2.2"=3, "2.3"=4, "2.4"=2,
"3.1"=5, "3.2"=6, "3.3"=7, "3.4"=8,
"4.1"=5, "4.2"=5, "4.3"=8, "4.4"=8)[my.df$a.b.q]
This seems reasonable
plot(my.df$a, my.df$b, col=my.df$z)
With some data with more coverage:
set.seed(1234)
my.df <- data.frame(a=runif(1000, 1, 20), b=runif(1000, 61, 80))