I am looking to extract the longest ordered portion of a vector. So for example with this vector:
x <- c(1,2,1,0.5,1,4,2,1:10)
x
[1] 1.0 2.0 1.0 0.5 1.0 4.0 2.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
I'd apply some function, get the following returned:
x_ord <- some_func(x)
x_ord
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
I've been trying to leverage is.unsorted() to determine at what point the vector is no longer sorted. Here is my messy attempt and what I have so far:
for(i in 1:length(x)){
if( is.unsorted(x[i:length(x)])==TRUE ){
cat(i,"\n")}
else{x_ord=print(x[i])}
}
However, this clearly isn't right as x_ord is producing a 10. I am also hoping to make this more general and cover non increasing numbers after the ordered sequence as well with a vector something like this:
x2 <- c(1,2,1,0.5,1,4,2,1:10,2,3)
Right now though I am stuck on identifying the increasing sequence in the first vector mentioned.
Any ideas?
This seems to work:
s = 1L + c(0L, which( x[-1L] < x[-length(x)] ), length(x))
w = which.max(diff(s))
x[s[w]:(s[w+1]-1L)]
# 1 2 3 4 5 6 7 8 9 10
s are where the runs start, plus length(x)+1, for convenience:
the first run starts at 1
subsequent runs starts where there is a drop
we tack on length(x)+1, where the next run would start if the vector continued
diff(s) are the lengths of the runs and which.max takes the first maximizer, to break ties.
s[w] is the start of the chosen run; s[w+1L] is the start of the next run; so to get the numbers belonging to the chosen run: s[w]:(s[w+1]-1L).
Alternately, split and then select the desired subvector:
sp = split(x, cumsum(x < c(-Inf, x[-length(x)])))
sp[[which.max(lengths(sp))]]
# 1 2 3 4 5 6 7 8 9 10
Related
So, I have a data frame containing 100 different variables. Now, I want to create 100 new variables corresponding to each of the variable in the original data frame. Currently, I am trying loops and lapply to figure out the way out of it, but haven't had much luck so far.
Here is just a snapshot of how the data frame looks like(suppose my data frame has name er):
a b c d
1 2 3 4
5 6 7 8
9 0 1 2
and using each of these 4 variable I have to create a new variable. Hence, total of 4 new variables. My variable should be like lets suppose a1=0.5+a, b1=0.5+b and so on.
I am doing trying the following two approaches:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
and alternatively, I am trying lapply as follows:
dep <- lapply(er, function(x) {
x<-0.5+er
}
But, none of them are working. Can anyone let me know what's the problem with these codes or suggest an efficient code to do this. I have show just 4 variables here for demonstration. I have around 100 of them.
You could directly add 0.5 (or any number) to the dataframe.
er[paste0(names(er), '1')] <- er + 0.5
er
# a b c d a1 b1 c1 d1
#1 1 2 3 4 1.5 2.5 3.5 4.5
#2 5 6 7 8 5.5 6.5 7.5 8.5
#3 9 0 1 2 9.5 0.5 1.5 2.5
Ronak's answer provides the most efficient way of solving your problem. I'll focus on why your attempts didn't work.
er <- data.frame(a = c(1, 5, 9), b = c(2, 6, 0), c = c(3, 7, 1), d = c(4, 8, 2))
A. for loop:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
Thinking of how R is interpreting each element of your loop. It will go from 1 to however many columns of er, and use the i placeholder, so on the first iteration it will do:
[[1]] <- 0.5 + [[1]]
Which doesn't make sense because you're not indicating what you are indexing at all. Instead, what you would want is:
for (i in 1:ncol(er)) {
er[[i]] <- 0.5 + er[[i]]
}
Here, each iteration will mean "assign to the ith column of er, the ith column of er + 0.5". If you want to further add that you want to create new variables, you would do the following (which is somewhat similar to Ronak's answer, just less efficient):
for (i in 1:ncol(er)) {
er[[paste0(names(er)[i], "1")]] <- 0.5 + er[[i]]
}
As a side note, it is preferred to use seq_along(er) instead of 1:ncol(er).
B. lapply:
dep <- lapply(er, function(x) {
x<-0.5+er
}
When creating a function, whatever you need to specify what you want to return by calling it. Here, function(x) { x + 0.5 } is sufficient to indicate that you want to return the variable + 0.5. Since lapply() returns a list (the function's name is short for "list apply"), you'll want to use as.data.frame():
as.data.frame(lapply(er, function(x) { x + 0.5 }))
However, this doesn't change the variable names, and there's no easy efficient way to change that here:
dep <- as.data.frame(lapply(er, function(x) { x + 0.5 }))
names(dep) <- paste0(names(dep), "1")
cbind(er, dep)
a b c d a1 b1 c1 d1
1 1 2 3 4 1.5 2.5 3.5 4.5
2 5 6 7 8 5.5 6.5 7.5 8.5
3 9 0 1 2 9.5 0.5 1.5 2.5
C. Another way would be using dplyr syntax, which is more elegant and readable:
library(dplyr)
mutate(er, across(everything(), ~ . + 0.5, .names = "{.col}1"))
I am the beginner of R language. I want to write different size vector into csv. Here is code:
library(igraph)
library(DirectedClustering)
my_list = readLines("F://RR//listtest.csv")
eigen <-c()
for(i in 1:length(my_list))
{
my_data <- read.csv(my_list[i],head=TRUE, row.names =1 )
my_matrix <-as.matrix(my_data)
g1 <- graph_from_adjacency_matrix(my_matrix, weighted=TRUE,diag = FALSE)
e1 <- eigen_centrality(g1,directed = TRUE)
eigen[[i]] <-e1[["vector"]]
}
df = data.frame(eigenvalue,eigen)
df
write.csv(df, "F://RR//outtest.csv")
The first question is due to different size of vector (the max is 14), the data.frame can not be used.
The second question is when i use the same size of vector to write into some csv file,it will display
like
- Vec1 Vec2 Vec3
1. 2.5 3.5 4.5
2. 1.8 1.6 1.4
3. 1.3 5.8 9.9
but i wanna it display row by row, something like:
1 2.5
2 3.5
3 4.5
4 1.8
5 1.6
6 1.4
7 1.3
8 5.8
9 9.9
I really need your help, thanks lot.
I infrequently use Access to update one table with another table using an inner join and some selection conditions and am trying to find a method to do this sort of operation in R.
# Example data to be updated
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
dfA <- data.frame(ID,Fr,To)
dfA$Vl <- NA
I wish to update dfA$Vl using the Vl field ina second data frame as below
# Example data to do the updating
ID <- c('A','A','B','B','B','C','C','C')
Fr <- c(0,3,0,1,3,0,4,7)
To <- c(3,6,1,3,9,4,7,9)
Vl <- c(1,2,3,4,5,6,7,8)
dfB <- data.frame(ID,Fr,To,Vl)
The following is the Access SQL syntax I would use for this type of update
UPDATE DfA INNER JOIN DfB ON DfA.ID = DfB.ID SET DfA.Vl = [DfB].[Vl]
WHERE (((DfA.Fr)<=[DfB].[To]) AND ((DfA.To)>[DfB].[Fr]));
This reports that 14 rows are being updated (even through there are only 9 in dfA) as some of the rows will meet the selection conditions more than once and are applied sequentially. I'm not concerned about this inconsistency as the result is sufficient for the intended purpose -- however, it would be best to match the longest overlapping(To-Fr) from DfB to the (To-Fr) of DfA to be more precise - bonus points for that solution)
The result I end up with from Access is as follows
# Result
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
Vl <- c(1,1,2,4,5,5,6,7,8)
dfC <- data.frame(ID,Fr,To,Vl)
So the question is the best R way to addressing this operation or alternatively (or additionally) how to reproduce the Access SQL in the R sql packages? Also (for extra credit) how to make sure the majority To-Fr overlap is the number updated not necessary the last update operation?
A possible approach using data.table:
library(data.table)
setDT(dfA); setDT(dfB); setDT(dfC);
dfA[, rn:=.I]
#non equi join like your ACCESS sql
dfB[dfA, on=.(ID, To>=Fr, Fr<To), .(rn, i.ID, i.Fr, i.To, x.Vl, x.Fr, x.To)][,
#calculate overlapping range
rng := pmin(x.To, i.To) - pmax(x.Fr, i.Fr)][,
#find the rows with max overlapping range and in case of dupes, choose the first row
first(.SD[rng==max(rng), .(ID=i.ID, Fr=i.Fr, To=i.To, Vl=x.Vl)]), by=.(rn)]
output:
rn ID Fr To Vl
1: 1 A 0.0 1.5 1
2: 2 A 1.5 3.0 1
3: 3 A 3.0 6.0 2
4: 4 B 0.0 1.5 3 #diff from dfC as Vl=3 has a bigger overlap
5: 5 B 1.5 4.5 4 #diff from dfC. both overlaps by 1.5 so either 4/5 works
6: 6 B 4.5 9.0 5
7: 7 C 0.0 3.0 6
8: 8 C 3.0 6.0 7
9: 9 C 6.0 9.0 8
The objective is to fill the upper triangular matrix with the count calculate from the rating dataset. Each value is calculated and stored by finding the correct index. It is not stored sequentially.The below R code works perfectly, but takes too much time for large datasets.
ratings <- read.csv("ratings.csv", header=TRUE, sep=",")
>> head(ratings)
userId movieId rating timestamp
1 1 16 4.0 1217897793
2 1 24 1.5 1217895807
3 1 32 4.0 1217896246
4 1 47 4.0 1217896556
5 1 50 4.0 1217896523
6 1 110 4.0 1217896150
no_nodes <- nrow(movies)*2
temp <- movies$movieId
nodes_name <- c(paste(temp,"-L",sep=""),paste(temp,"-D",sep=""))
ac_graph <- matrix(NA,nrow=length(nodes_name),ncol=length(nodes_name),dimnames = list(nodes_name,nodes_name))
for(i in 1:nrow(movies)){
for(j in (i+1):nrow(movies)){
ac_graph[which(nodes_name==paste(i,"-L",sep="")),which(nodes_name==paste(j,"-L",sep=""))] <- length(intersect(ratings[ratings$movieId==i&ratings$rating>2.5,1],ratings[ratings$movieId==j&ratings$rating>2.5,1]))
ac_graph[which(nodes_name==paste(i,"-D",sep="")),which(nodes_name==paste(j,"-D",sep=""))] <- length(intersect(ratings[ratings$movieId==i&ratings$rating<=2.5,1],ratings[ratings$movieId==j&ratings$rating<=2.5,1]))
}
}
Is it possible to do the same using apply,sapply, outer or some function?
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Averaging column values for specific sections of data corresponding to other column values
I would like to analyze a dataset by group. The data is set up like this:
Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0
I have a function that calculates the following
sumStats = function(obs, cens) {
detects = obs[cens==0]
nondetects= obs[cens=1]
mean.detects=mean(detects)
return(mean.detects) }
This of course is a simple function for illustration purpose. Is there a function in R that will allow me to use this home-made function that needs 2 variables input to analyze the data by groups.
I looked into the by function but it seems to take in 1 column data at a time.
Import your data:
test <- read.table(header=TRUE,textConnection("Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0"))
Though there are many ways to do this, using by specifically you could do something like this (assuming your dataframe is called test):
by(test,test$Group,function(x) mean(x$Result[x$cens==1]))
which will give you the mean of all the Results values within each group which have cens==1
Output looks like:
test$Group: A
[1] 1.3
----------------------------------------------------------------------
test$Group: B
[1] 1.2
To help you understand how this might work with your function, consider this:
If you just ask the by statement to return the contents of each group, you will get:
> by(test,test$Group,function(x) return(x))
test$Group: A
Group Result cens
1 A 1.3 1
2 A 2.4 0
3 A 2.1 0
-----------------------------------------------------------------------
test$Group: B
Group Result cens
4 B 1.2 1
5 B 1.7 0
6 B 1.9 0
...which is actually 2 data frames with only the rows for each group, stored as a list:
This means you can access parts of the data.frames for each group as you would before they they were split up. The x in the above functions is referring to the whole sub-dataframe for each of the groups. I.e. - you can use individual variables as part of x to pass to functions - a basic example:
> by(test,test$Group,function(x) x$Result)
test$Group: A
[1] 1.3 2.4 2.1
-------------------------------------------------------------------
test$Group: B
[1] 1.2 1.7 1.9
Now, to finally get around to answering your specific query!
If you take an example function which gets the mean of two inputs separately:
sumStats = function(var1, var2) {
res1 <- mean(var1)
res2 <- mean(var2)
output <- c(res1,res2)
return(output)
}
You could call this using by to get the mean of both Result and cens like so:
> by(test,test$Group,function(x) sumStats(x$Result,x$cens))
test$Group: A
[1] 1.9333333 0.3333333
----------------------------------------------------------------------
test$Group: B
[1] 1.6000000 0.3333333
Hope that is helpful.
The aggregate function is designed for this.
aggregate(dfrm$cens, dfrm["group"], FUN-mean)
You can get the mean value os several columns at once, each within 'group'
aggregate(dfrm[ , c("Result", "cens") ], dfrm["group"], FUN=mean)