Which() not matching given row to row of dataframe - r

I want to match a one-row data.frame to another data.frame. The values in the one_row data.frame are definitely present in the other data frame. I wanted to use the function which(), to get the index of the row at which it matches but it is not working. (see the code below)
x y
4 53
x y
13 69
97 122
4 53
33 154
idx= which(medoids==a, arr.ind=TRUE)
Error in Ops.data.frame(medoids, a) :
‘==’ only defined for equally-sized data frames
But i expect :idx= 3

You could use interaction inside which to join the two columns and allow a comparison.
medoids <- read.table(header = TRUE, text = "x y
4 53")
a <- read.table(header = TRUE, text = "x y
13 69
97 122
4 53
33 154")
idx <- which(interaction(medoids)==interaction(a))

Since your data frames are not the same size, you need to use mapply in order to map the columns one-by-one and compare, i.e.
mapply(function(x, y)which(x == y), medoids, a)
#x y
#3 3
NOTE: You do not need arr.ind since you are comparing 1 dimensional vectors (individual columns)

Related

Apply function to specific columns of data frames in a list

I am working with a large list in R. Each element has 37 columns and more than 100 rows. I am trying to create a new list with the same dimensions but for specific columns I want to calculate a new value using the equation x*0.6/0.02587.
Here is a simplified example of what I'm trying to do
my_list
$data1
X Y Z
1 2 3
2 4 2
3 5 7
new_list
$data1
X Y Z
1 60 90
2 120 60
3 150 210
I tried:
out<-lapply(data, FUN = function(x)(x*0.6)/0.02)
But this code applied the function to all columns of all elements in the list.
fun=function(x) {x = x*0.6/0.02587}
out <- lapply(data , function(x) {x <- fun(x[c(2:37)]);
x})
the code 2 did work too but in my new list only has 36 columns each element, it did drop the first column (the column I don't want to change).
Any help how to do that is very much appreciated! Thank you!
You need to save the data to the same columns back. Try -
fun=function(x) {x*0.6/0.02587}
out <- lapply(data , function(x) {x[2:37] <- fun(x[2:37]);x})
out

Subset a data frame for specific row names in R [duplicate]

I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)
I have read the following:
Selecting columns in R data frame based on those *not* in a vector
http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
How to combine multiple conditions to subset a data-frame using "OR"?
But I'm still not able to get this to work right. Here's my code:
bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID)
#attempt 1
eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg)
#attempt 2
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg]
The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:
Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg)
: undefined columns selected
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.
However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.
If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:
data_A[match(data_B$index, data_A$index, nomatch=0),]
data_B[match(data_A$index, data_B$index, nomatch=0),]
This is, though, the same as:
data_A[data_A$index %in% data_B$index,]
data_B[data_B$index %in% data_A$index,]
Here is a demo:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data sets.
data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))
# Subset data of each data frame by the index in the other.
t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]
# Make sure they match.
data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]
# index value index.1 value.1
# 27 3 0.7155661 3 0.65887761
# 10 12 0.6049333 12 0.14362694
# 88 14 0.7410786 14 0.42021589
# 56 15 0.4525708 15 0.78101754
# 38 18 0.2075451 18 0.70277874
# 24 23 0.4314737 23 0.78218212
# 34 32 0.1734423 32 0.85508236
# 22 38 0.7317925 38 0.56426384
# 84 39 0.3913593 39 0.09485786
# 5 40 0.7789147 40 0.31248966
# 74 43 0.7799849 43 0.10910096
# 71 45 0.2847905 45 0.26787813
# 57 46 0.1751268 46 0.17719454
# 25 48 0.1482116 48 0.99607737
# 81 53 0.6304141 53 0.26721208
# 60 58 0.8645449 58 0.96920881
# 30 59 0.6401010 59 0.67371223
# 75 61 0.8806190 61 0.69882454
# 63 64 0.3287773 64 0.36918946
# 19 70 0.9240745 70 0.11350771
Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data frames.
data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))
# compare data frames by specific columns and keep only
# the rows with equal values
data_A[data_A$id %in% data_B$id,] # will keep data in data_A
data_B[data_B$id %in% data_A$id,] # will keep data in data_b
Results:
> data_A[data_A$id %in% data_B$id,]
id value
1 1 1
2 2 2
3 3 3
> data_B[data_B$id %in% data_A$id,]
id value
1 1 5
2 2 6
3 3 7
Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff statement unnecessary.
Using the data from Dinre's example:
In base R:
cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)
Using the dplyr package:
library(dplyr)
cleanedA <- inner_join(data_A, data_B %>% select(index))
cleanedB <- inner_join(data_B, data_A %>% select(index))
To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.

In R, reorganize list based on element names (rbind and indicator variable)

I am trying to reorganize my data, basically a list of data.frames.
Its elements represent subjects of interest (A and B), with observations on x and y, collected on two occasions (1 and 2).
I am trying to make this a list that contains data.frames referring to the subjects, with the information on which occasion x and y were collected being stored in the respective data.frames as new variable, as opposed to the element name:
library('rlist')
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
list <- list(A1=A1,A2=A2,B1=B1,B2=B2)
A <- do.call(rbind,list.match(list,"A"))
B <- do.call(rbind,list.match(list,"B"))
list <- list(A=A,B=B)
list <- lapply(list,function(x) {
y <- data.frame(x)
y$class <- c(rep.int(1,2),rep.int(2,2))
return(y)
})
> list
$A
x y class
A1.1 66 96 1
A1.2 76 58 1
A2.1 50 93 2
A2.2 57 12 2
$B
x y class
B1.1 58 56 1
B1.2 69 15 1
B2.1 77 77 2
B2.2 9 9 2
In my real world problem there are about 500 subjects, not always two occasions, differing numbers of observations.
So my example above is just to illustrate where I want to get, and I am stuck at how to pass to the do.call-rbind that it should, based on elements names, bind subject-specific elements as new list elements together, while assigning a new variable.
To me, this is a somewhat fuzzy task, and the closest I got was the rlist package. This question is related but uses unique to identify elements, whereas in my case it seems to be more a regex problem.
I'd be happy even for instructions on how to use google, any keywords for further research etc.
From the data you provided:
subj <- sub("[A-Z]*", "", names(lst))
newlst <- Map(function(x, y) {x[,"class"] <- y;x}, lst, subj)
First we do the regular expression call to isolate the number that will go in the class column. In this case, I matched on capital letters and erased them leaving the number. Therefore, "A1" becomes "1". Please note that the real names will mean a different regex pattern.
Then we use Map to create a new column for each data frame and save to a new list called newlst. Map takes the first element of each argument and carries out the function then continues on with each object element. So the first data frame in lst and the first number in subj are used first. The anonymous function I used is function(x,y) {x[, "class"] <- y; x}. It takes two arguments. The first is the data frame, the second is the column value.
Now it's much easier to move forward. We can create a vector called uniq.nmes to get the names of the data frames that we will combine. Where "A1" will become "A". Then we can rbind on that match:
uniq.nmes <- unique(sub("\\d", "", names(lst)))
lapply(uniq.nmes, function(x) {
do.call(rbind, newlst[grep(x, names(newlst))])
})
# [[1]]
# x y class
# A1.1 1 79 1
# A1.2 30 13 1
# A2.1 90 39 2
# A2.2 43 22 2
#
# [[2]]
# x y class
# B1.1 54 59 1
# B1.2 83 90 1
# B2.1 85 36 2
# B2.2 91 28 2
Data
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
lst <- list(A1=A1,A2=A2,B1=B1,B2=B2)
It sounds like you're doing a lot of gymnastics because you have a specific form in mind. What I would suggest is first trying to make the data tidy. Without reading the link, the quick summary is to put your data into a single data frame, where it can be easily processed.
The quick version of the answer (here I've used lst instead of list for the name to avoid confusion with the built-in list) is to do this:
do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
What this will do is create a single data frame, with a column, "type", that contains the name of the list item in which that row appeared.
Using a slightly simplified version of your initial data:
lst <- list(A1=data.frame(x=rnorm(5)), A2=data.frame(x=rnorm(3)), B=data.frame(x=rnorm(5)))
lst
$A1
x
1 1.3386071
2 1.9875317
3 0.4942179
4 -0.1803087
5 0.3094100
$A2
x
1 -0.3388195
2 1.1993115
3 1.9524970
$B
x
1 -0.1317882
2 -0.3383545
3 0.8864144
4 0.9241305
5 -0.8481927
And then applying the magic function
df <- do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
df
x type
1 1.3386071 A1
2 1.9875317 A1
3 0.4942179 A1
4 -0.1803087 A1
5 0.3094100 A1
6 -0.3388195 A2
7 1.1993115 A2
8 1.9524970 A2
9 -0.1317882 B
10 -0.3383545 B
11 0.8864144 B
12 0.9241305 B
13 -0.8481927 B
From here we can process to our hearts content; with operations like df$subject <- gsub("[0-9]*", "", df$type) to extract the non-numeric portion of type, and tools like split can be used to generate the sub-lists that you mention in your question.
In addition, once it is in this form, you can use functions like by and aggregate or libraries like dplyr or data.table to do more advanced split-apply-combine operations for data analysis.

function (x,y), both x and y to vary

I have a data frame consisting of about 22 fields, some system ids and some measurements, such as
bsystemid dcesystemid lengthdecimal heightquantity
2218 58 22 263
2219 58 22 197
2220 58 22 241
What I want:
1 . loop through a list of field ids
2 . define a function to test for a condition
3 . such that both x and y can vary
Where does the y variable definition belong, for varying both x and y? Other different structures?
This code block works for a single field and value of y:
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
This code block executes, but the counts are off:
varlist4<-names(brg) [c(6,8,10,12)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5,130,150,0))
For example,
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
returns (correctly),
x freq
1 FALSE 9490
2 TRUE 309
3 NA 41
whereas the multiple x,y block of code above returns this for the first case,
x freq
1 FALSE 4828
2 TRUE 4971
3 NA 41
Thanks for any comments.
Update:
What I would like is to automate counting of occurances of values in specified fields in a df, meeting some condition. The conditions are numeric constants or text strings, one for each field. For example, I might want to count occurances meeting the condition >360 in field1, >0 in field2, etc. What I thus mean by allowing x and y to vary is reading x and y vectors with the field names and corresponding conditions into a looping structure.
I'd like to automate this task because it involves around 30 tables, each with up to 50 or so fields. And I'll need to do it twice, scanning once for values exceeding a maximum and once for values less than a minimum. Better still might be loading the conditions into a table and referencing that in the loop. That may be the next step but I'd like to understand this piece first.
This working example
t1<-18:29
t2<-c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
t3<-c(1.2,-0.2,-0.3,1.2, 2.2,0.4,0.6,0.4,-0.8,-0.1,5.0,3.1)
t<-data.frame(v1=t1,v2=t2,v3=t3)
varlist<-names(t) [c(1)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27))
illustrates the correct answer for the first field, returning
x freq
1 FALSE 10
2 TRUE 2
But if I add in other fields and the corresponding conditions (the y's) I get something different for the first case:
varlist<-names(t) [c(1,2,3)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27,83,3))
x freq
1 FALSE 8
2 TRUE 4
[[2]]
x freq
1 FALSE 1
2 TRUE 11
[[3]]
x freq
1 FALSE 11
2 TRUE 1
My sense is I'm not going about structuring the y part correctly.
Thanks for any comments.
You can use mapply. Let's create some data:
set.seed(123) # to get exactly the same results
brg = data.frame(x = rnorm(100), y=rnorm(100), z=rnorm(100))
brg$x[c(10, 15)] = NA # some NAs
brg$y[c(12, 21)] = NA # more NAs
Then you need to define the function to do the job. The function .f1 counts the data, and ensure there are always three levels (TRUE, FALSE, NA). Then, f1 uses .f1 in an mapply context to be able to vary x and y. Finally, some improvements in the output (changing the names of the columns).
f1 = function(x, y, data) {
.f1 = function(x, y, data) {
out = factor(data[, x] < y,
levels=c("TRUE", "FALSE", NA), exclude=NULL)
return(table(out))
}
out = mapply(.f1, x, y, MoreArgs = list(data = data)) # check ?mapply
colnames(out) = paste0(x, "<", y) # more clear names for the output
return(out)
}
Finally, the test:
varlist = names(brg)
threshold = c(0, 1, 1000)
f1(x=varlist, y=threshold, data=brg)
And you should get
x<0 y<1 z<1000
TRUE 46 87 100
FALSE 52 11 0
<NA> 2 2 0

Remove the rows of data frame whose cells match a given vector

I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops.
for example
ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data.frame")
remove all rows that have cells that contain the values 8,9,10
I guess i could use ff[ !ff[,1] %in% c(8, 9, 10), ] or subset(ff, !ff[,1] %in% c(8,9,10) )
but in order to remove all the values from the dataset i have to parse each column (probably with a for loop, something i wish to avoid).
Is there any other (cleaner) way?
Thanks a lot
apply your test to each row:
keeps <- apply(ff, 1, function(x) !any(x %in% 8:10))
which gives a boolean vector. Then subset with it:
ff[keeps,]
j.1 j.2 j.3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
11 11 12 13
12 12 13 14
13 13 14 15
>
I suppose the apply strategy may turn out to be the most economical but one could also do either of these:
ff[ !rowSums( sapply( ff, function(x) x %in% 8:10) ) , ]
ff[ !Reduce("+", lapply( ff, function(x) x %in% 8:10) ) , ]
Vector addition of logical vectors, (equivalent to any) followed by negation. I suspect the first one would be faster.

Resources