Lookup of entries with multiplicities

Lookup of entries with multiplicities - r

Suppose I have a vector data <- c(1,2,2,1) and a reference table, say : ref <- cbind(c(1,1,2,2,2,2,4,4), c(1,2,3,4,5,6,7,8))
I would like my code to return the following vector : result <- c(1,2,3,4,5,6,3,4,5,6,1,2). It's like using the R function match(). But match() only returns the first occurrence of the reference vector. Similar for %in%.
I have tried functions like merge(), join() but I would like something with only the combination of rep() and seq() R functions.

You can try
ref[ref[,1] %in% data,2]
To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply:
unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))

You can get the indices you are looking for like this:
indices <- sapply(data,function(xx)which(ref[,1]==xx))
Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this:
ref[unlist(indices),2]
[1] 1 2 3 4 5 6 3 4 5 6 1 2

Related

R: retrieve dataframe name from another dataframe

I have a dataframe dataselect that tells me what dataframe to use for each case of an analysis (let's call this the relevant dataframe).
The case is assigned dynamically, and therefore which dataframe is relevant depends on that case.
Based on the case, I would like to assign the relevant dataframe to a pointer "relevantdf". I tried:
datasetselect <- data.frame(case=c("case1","case2"),dataset=c("df1","df2"))
df1 <- data.frame(var1=letters[1:3],var2=1:3)
df2 <- data.frame(var1=letters[4:10],var2=4:10)
currentcase <- "case1"
relevantdf <- get(datasetselect[datasetselect$case == currentcase,"dataset"]) # relevantdf should point to df1
I don't understand if I have a problem with the get() function or the subsetting process.

You are almost there, the problem is that the dataset column from datasetselect is a factor, you just need to convert it to character
You can add this line after the definition of datasetselect:
datasetselect$dataset <- as.character(datasetselect$dataset)
And you get your expected output
> relevantdf
var1 var2
1 a 1
2 b 2
3 c 3

in R: remove rows containing no integer (such as characters i.e.) from a data frame

My data frame df looks like follow:
Variable A Variable B Variable C
9 2 1
2 0 don't know
maybe 1 1
? 0 3
I need to remove all rows, where non-numerical values are used. It should look like this afterwards:
Variable A Variable B Variable C
9 2 1
I thought about something like
df[! grepl(*!= numerical*, df),]
or
df[! df %in% *!= numerical*, ]
but I don't find anything I could use as input for "take all that doesn't match numerical values". Could you please help me?
Thanks a lot!

One option would be to loop through the columns, convert to numeric so that all non-numeric elements convert to NA, check for NA with is.na , negate (!) it, compare the corresponding elements of list with Reduce and &, use that to subset the rows.
df[Reduce(`&`, lapply(df, function(x) !is.na(as.numeric(x)))),]

This might not be the best way to do it, but works.
s is the df that contains your data-
contains <- lapply(seq_len(nrow(s)), function(i){
yes <- grep("[^0-9.]" , s[i,]) #regex for presence of non-digits
ifelse(identical(yes, integer(0)),F,T)
}) %>% unlist
s <- s[which(!contains),]
Thanks!

Countif with multiple criterias in R

I would like to count all rows where two criterias are matched. My approach was:
a <- c(2,1,2,2,3,4,2,1,9)
b <- c(2,1,2,1,4,4,5,6,7)
c <- data.frame(a,b)
sum((c['a']==2) && (c['b']==2))
but for some reason this gives 1 instead of two. How could I count rows if multiple criterias are matched?

I think you are using the wrong ampersand operator. Try this:
sum(c['a']==2 & c['b']==2)
[1] 2
If you might have NAs in column a or b you might also try:
length(intersect(which(c['a']==2), which(c['b']==2)))

You can also subset within the data.frame and then find the rows for it.
nrow(c[a==2 & b==2, ])
# [1] 2
P.S : It is advised not to use c as a variable as it is also a base R function.

Find duplicated values from list o vector in R

I have this example data
a<-c(1,5,7,8,10,15)
b<-c(2,6,7,9,10,20,31)
I need to find the duplicated values (the values which are in both vectors) and create new vector which include these number. It should looks like
c<-c(7,10)
Because of different length of vector I have tried to give them into list of vectors
l<-list(a=a,b=b)
and tried
duplicated(l)
or
duplicated(a,b)
but it gives nonsense output. I'm looking for correct solution but I cannot still find. Any advices?

Looks like a job for intersect()
a<-c(1,5,7,8,10,15)
b<-c(2,6,7,9,10,20,31)
c<-intersect(a,b)
c
[1] 7 10

c(a, b)[duplicated(c(a, b))]
produces:
[1] 7 10
duplicated applied to a vector returns a logical vector of the same length, with TRUE for every value that has already appeared earlier in the vector. You can use that to subset the original vector.
Note that if you don't care if values are duplicated within a single vector, then you should do:
a.b <- c(unique(a), unique(b))
a.b[duplicated(a.b)]

Keeping within the scope of the original question,
You could use match
> b[!is.na(match(a, b))]
# [1] 7 10
Or more simply, %in%
> b[a %in% b]
# [1] 7 10

Solved by creating function like
duplicated_values<-function(x){
if(x%in%b){
return(x)
}
}
values<-mclapply(c(1:length(a)),duplicated_values)

Colwise eats column names within ddply

I'm trying to chunk through a data frame, find instances where the sub-data frames are unbalanced, and add 0 values for certain levels of a factor that are missing. To do this, within ddply, I did a quick comparison to a set vector of what levels of a factor should be there, and then create some new rows, replicating the first row of the subdata set but modifying their values, and then rbinding them to the old data set.
I use colwise to do the replication.
This works great outside of ddply. Inside of ddply...identifying rows get eaten, and rbind borks on my. It's curious behavior. See the following code with some debugging print statements thrown in to see the difference in results:
#a test data frame
g <- data.frame(a=letters[1:5], b=1:5)
#repeat rows using colwise
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
#if I want to do this with just one row, I get all of the columns
rep.row(g[1,],5)
is fine. It prints
a b
1 a 1
2 a 1
3 a 1
4 a 1
5 a 1
#but, as soon as I use ddply to create some new data
#and try and smoosh it to the old data, I get errors
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
rbind(x, newrows)
})
This gives
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
You can see the problem with this debugged version
#So, what is going on here?
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
print(x)
print("\n\n")
print(newrows)
rbind(x, newrows)
})
You can see that x and newrows have different columns - they differ in a.
a b
1 a 1
[1] "\n\n"
b
1 0
2 0
3 0
4 0
5 0
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
What is going on here? Why when I use colwise on a subdata frame do the identifying rows get eaten?

It's a funny interaction between ddply and colwise, it seems. More specifically, the problem occurs when colwise calls strip_splits and finds a vars attribute that was given by ddply.
As a workaround, try putting this first line in your function,
attr(x, "vars") <- NULL
# your code follows
newrows <- rep.row(x[1,],5)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Lookup of entries with multiplicities - r

You can try ref[ref[,1] %in% data,2] To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply: unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))

You can get the indices you are looking for like this: indices <- sapply(data,function(xx)which(ref[,1]==xx)) Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this: ref[unlist(indices),2] [1] 1 2 3 4 5 6 3 4 5 6 1 2

Related

R: retrieve dataframe name from another dataframe

in R: remove rows containing no integer (such as characters i.e.) from a data frame

Countif with multiple criterias in R

Find duplicated values from list o vector in R

Colwise eats column names within ddply

Categories

Resources