I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)
I have read the following:
Selecting columns in R data frame based on those *not* in a vector
http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
How to combine multiple conditions to subset a data-frame using "OR"?
But I'm still not able to get this to work right. Here's my code:
bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID)
#attempt 1
eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg)
#attempt 2
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg]
The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:
Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg)
: undefined columns selected
This will give you what you want:
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
The error in your second attempt is because you forgot the ,
In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification
object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.
However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.
If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:
data_A[match(data_B$index, data_A$index, nomatch=0),]
data_B[match(data_A$index, data_B$index, nomatch=0),]
This is, though, the same as:
data_A[data_A$index %in% data_B$index,]
data_B[data_B$index %in% data_A$index,]
Here is a demo:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data sets.
data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))
# Subset data of each data frame by the index in the other.
t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]
# Make sure they match.
data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]
# index value index.1 value.1
# 27 3 0.7155661 3 0.65887761
# 10 12 0.6049333 12 0.14362694
# 88 14 0.7410786 14 0.42021589
# 56 15 0.4525708 15 0.78101754
# 38 18 0.2075451 18 0.70277874
# 24 23 0.4314737 23 0.78218212
# 34 32 0.1734423 32 0.85508236
# 22 38 0.7317925 38 0.56426384
# 84 39 0.3913593 39 0.09485786
# 5 40 0.7789147 40 0.31248966
# 74 43 0.7799849 43 0.10910096
# 71 45 0.2847905 45 0.26787813
# 57 46 0.1751268 46 0.17719454
# 25 48 0.1482116 48 0.99607737
# 81 53 0.6304141 53 0.26721208
# 60 58 0.8645449 58 0.96920881
# 30 59 0.6401010 59 0.67371223
# 75 61 0.8806190 61 0.69882454
# 63 64 0.3287773 64 0.36918946
# 19 70 0.9240745 70 0.11350771
Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:
# Set seed for reproducibility.
set.seed(1)
# Create two sample data frames.
data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))
# compare data frames by specific columns and keep only
# the rows with equal values
data_A[data_A$id %in% data_B$id,] # will keep data in data_A
data_B[data_B$id %in% data_A$id,] # will keep data in data_b
Results:
> data_A[data_A$id %in% data_B$id,]
id value
1 1 1
2 2 2
3 3 3
> data_B[data_B$id %in% data_A$id,]
id value
1 1 5
2 2 6
3 3 7
Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff statement unnecessary.
Using the data from Dinre's example:
In base R:
cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)
Using the dplyr package:
library(dplyr)
cleanedA <- inner_join(data_A, data_B %>% select(index))
cleanedB <- inner_join(data_B, data_A %>% select(index))
To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.
Related
I want to match a one-row data.frame to another data.frame. The values in the one_row data.frame are definitely present in the other data frame. I wanted to use the function which(), to get the index of the row at which it matches but it is not working. (see the code below)
x y
4 53
x y
13 69
97 122
4 53
33 154
idx= which(medoids==a, arr.ind=TRUE)
Error in Ops.data.frame(medoids, a) :
‘==’ only defined for equally-sized data frames
But i expect :idx= 3
You could use interaction inside which to join the two columns and allow a comparison.
medoids <- read.table(header = TRUE, text = "x y
4 53")
a <- read.table(header = TRUE, text = "x y
13 69
97 122
4 53
33 154")
idx <- which(interaction(medoids)==interaction(a))
Since your data frames are not the same size, you need to use mapply in order to map the columns one-by-one and compare, i.e.
mapply(function(x, y)which(x == y), medoids, a)
#x y
#3 3
NOTE: You do not need arr.ind since you are comparing 1 dimensional vectors (individual columns)
I'm using R right now where I'm scaling the original data, removing all outliers with a Z-Score of 3 or more, and then filtering out the unscaled data so that it contains only non-outliers. I want to be left with a data frame that contains non-scaled numbers after removing outliers. These were my steps:
Steps
1. Create two data frames (x, y) of the same data
2. Scale x and leave y unscaled.
3. Filter out all rows that have greater than 3 Z-Score in x
4. Currently, for example, x may have 95,000 rows while y still has 100,000
5. Truncate y based on a unique column called Row ID, which I made sure was unscaled in x. This unique column will help me match up the remaining rows in x and the rows in y.
6. y should now have the same number of rows as x, but with the data unscaled. x has the scaled data.
At the moment I can't get the data to be unscaled. I tried using the unscale method or data frame comparison tools but R complains I cannot work on data frames of two different sizes. Is there a workaround to this?
Tries
I've tried dataFrame <- dataFrame[dataFrame$Row %in% remainingRows] but that left nothing in my data frame.
I would also provide data, but it has sensitive information, so any data frame will do so long as it has a unique row ID that won't change during scaling.
If I understood correctly what you want to do, I'm suggesting a different approach. You could use two data.frames for that, but if you use the dplyrpackage, you can do everything within a single line of code ... and presumably faster as well.
First I'm generating a data.frame with 100k rows, which has an ID column (just 1:100000 sequence) and a value (random numbers).
Here's the code:
library(dplyr)
#generate data
x <- data.frame(ID=1:100000,value=runif(100000,max=100)*runif(10000,max=100))
#take a look
> head(x)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
To filter out the outliers, I'm using a dplyr pipe, where I chain multiple operations together with the pipe (%>%) operator. First calculate the zscore, then filter the observations with a zscore bigger than three, and finally drop the zscore column again to go back to your original format (of course you can keep it as well):
xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>%
filter(zscore < 3) %>% select(-matches('zscore'))
If you look at the rows, you'll see that the filtering worked
> cat('Rows of X:',nrow(x),'- Rows of xclean:',nrow(xclean))
Rows of X: 100000 - Rows of xclean: 99575
while the data looks like the original data.frame:
> head(xclean)
ID value
1 1 853.67941
2 2 632.17472
3 3 3089.60716
4 4 8448.89408
5 5 5307.75684
6 6 19.07485
Finally, you can see that observations have been filtered out by comparing the IDs of the two data.frames:
> head(x$ID[!is.element(x$ID,xclean$ID)],50)
[1] 68 90 327 467 750 957 1090 1584 1978 2106 2306 3415 3511 3801 3855 4051
[17] 4148 4244 4266 4511 4875 5262 5633 5944 5975 6116 6263 6631 6734 6773 7320 7577
[33] 7619 7731 7735 7889 8073 8141 8207 8966 9200 9369 9994 10123 10538 11046 11090 11183
[49] 11348 11371
EDIT:
Of course, the 2 data frames version is also possible:
y <- x
# calculate zscore
x$value <- (x$value - mean(x$value))/sd(x$value)
#subset y
y <- y[x$value<3,]
# initially 100k rows
> nrow(y)
[1] 99623
Edit2:
Accounting for multiple value columns:
#generate data
set.seed(21)
x <- data.frame(ID=1:100000,value1=runif(100000,max=100)*runif(10000,max=100),
value2=runif(100000,max=100)*runif(10000,max=100),
value3=runif(100000,max=100)*runif(10000,max=100))
> head(x)
ID value1 value2 value3
1 1 2103.9228 5861.33650 713.885222
2 2 341.8342 3940.68674 578.072141
3 3 5346.2175 458.07089 1.577347
4 4 400.1950 5881.05129 3090.618355
5 5 7346.3321 4890.56501 8989.248186
6 6 5305.5105 38.93093 517.509465
The dplyr solution:
# make sure you got a recent version of dplyr
> packageVersion('dplyr')
[1] ‘0.7.2’
# define zscore function:
zscore <- function(x){(x-mean(x))/sd(x)}
# select variables (could also be manually with c())
vars_to_process <- grep('value',colnames(x),value=T)
# calculate zscores and filter
xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(ZS = zscore(.))) %>%
filter_at(vars(matches('ZS')),all_vars(.<3)) %>%
select(-matches('ZS'))
> nrow(xclean)
[1] 98832
Now the solution without dplyr (instead of using 2 dataframes, I'll generate a boolean index based on x:
# select variables
vars_to_process <- grep('value',colnames(x),value=T)
# create index ZS < 3
ix <- apply(x[vars_to_process],2,function(x) (x-mean(x))/sd(x) < 3)
#filter rows
xclean <- x[rowSums(ix) == length(vars_to_process),]
> nrow(xclean)
[1] 98832
I have a function, remove_fun, that removes rows from a data frame based on some conditions (this function is too verbose to include, so here's a simplified example:).
Let's say I have a data frame called block_2, with two columns:
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
For the sake of this example, let's say my function removes 1 row from block_2 at a time based on the highest value of seq in block_2$seq. This function works well when I run it once, i.e. remove_fun(block_2) would return the following output:
Treatment seq
1 29
1 23
1 6
2 41
1 5
2 44
However, what I'm not figuring out is how to repeatedly implement my remove_fun until I reduce block_2 to a certain dimension.
My idea is to do something like this:
while (dim(block_2_df)[1]>1)#The number of rows of block_2_df{
remove_fun(block_2_df)
}
This would theoretically reduce block_2_df until only the observation corresponding to the lowest seq number remains.
However, this doesn't work. I think my problem relates to me not knowing how to use my 'updated' block_2_df iteratively. What I'd like to accomplish is some code that does something like this:
new_df_1<-remove_fun(block_2)
new_df_2<-remove_fun(new_df_1)
new_df_3<-remove_fun(new_df_2)
etc...
I'm not necessarily looking for an exact solution to this problem (as I didn't provide remove_fun), but I'd appreciate some insight re: a general approach to the problem.
Edit: here's my actual code with some example data:
#Start from a block of 10*6 balls, with lambda*(wj) balls of each class
#Allocation ratios
class_1<-"a"
class_2<-"b"
class_3<-"c"
ratio_a<-3
ratio_b<-2
ratio_c<-1
#Min_set
min_set<-c(rep(class_1,ratio_a),rep(class_2,ratio_b),rep(class_3,ratio_c))
min_set_num<-ifelse(min_set=='a',1,ifelse(min_set=='b',2,3))
table_key <- table(min_set_num)
#Number of min_sets
lamb<-10
#Active urn
block_1<-matrix(0,lamb,length(min_set))
for (i in 1:lamb){
block_1[i,]<-min_set
}
#Turn classes into a vector
block_1<-as.vector(block_1)
block_1<-ifelse(block_1=='a',1,ifelse(block_1=='b',2,3))
#Turn into a df w/ identifying numbers:
block_1_df<-data.frame(block_1,seq(1:length(block_1)))
#Enumerate all sampling outcome permutations
library('dplyr')
#Create inactive urn
#Sample from block_1 until min_set is achieved, store in block_2#####
#Random sample :
block_2<-sample(block_1,length(block_1),replace=F)
block_2_df<-block_1_df[sample(nrow(block_1_df), length(block_1)), ]
colnames(block_2_df)<-c('Treatment','seq')
#Generally:####
remove_fun<-function(dat){
#For df
min_set_obs_mat<-matrix(0,length(block_1),2)
min_set_obs_df<-as.data.frame(min_set_obs_mat)
colnames(min_set_obs_df)<-c('Treatment','seq')
for (i in 1:length(block_1)){
if ((sum(min_set_obs_df[,1]==1)<3) || (sum(min_set_obs_df[,1]==2)<2) || (sum(min_set_obs_df[,1]==3)<1)){
min_set_obs_df[i,]<-dat[i,]
}
}
#Get rid of empty rows in df:
min_set_obs_df<-min_set_obs_df%>%filter(Treatment>0)
#Return the sampled 'balls' which satisfy the minimum set into block_2_df (randomized block_!), ####
#keeping the 'extra' balls in a new df: extra_df:####
#Question: does the order of returning matter?####
#Identify min_set
outcome_df<-min_set_obs_df %>% group_by(Treatment) %>% do({
head(., coalesce(table_key[as.character(.$Treatment[1])], 0L))
})
#This removes extra observations 'chronologically'
#Identify extra balls
#Extra_df is the 'inactive' urn####
extra_df<-min_set_obs_df%>%filter(!(min_set_obs_df$seq%in%outcome_df$seq))
#Question: is the number of pts equal to the block size? (lambda*W)?######
#Return min_df back to block_2_df, remove extra_df from block_2_df:
dat<-dat%>%filter(!(seq%in%extra_df$seq))
return(dat)
}
Your while-loop doesn't redefine block2_df. This should work:
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
}
If all you need is a way to subset the data frame...
df <- data.frame(Treatment = c(1, 1, 3, 1, 2, 1, 2),
seq = c(29, 23, 60, 6, 41, 5, 44))
df
Treatment seq
1 1 29
2 1 23
3 3 60
4 1 6
5 2 41
6 1 5
7 2 44
# Decide how many rows you want in output
n <- 6
# Find the top "n" values in the seq variable
head(sort(df$seq), n)
[1] 5 6 23 29 41 44
# Use them in the subset criteria
df[df$seq %in% head(sort(df$seq), n), ]
Treatment seq
1 1 29
2 1 23
4 1 6
5 2 41
6 1 5
7 2 44
I am trying to reorganize my data, basically a list of data.frames.
Its elements represent subjects of interest (A and B), with observations on x and y, collected on two occasions (1 and 2).
I am trying to make this a list that contains data.frames referring to the subjects, with the information on which occasion x and y were collected being stored in the respective data.frames as new variable, as opposed to the element name:
library('rlist')
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
list <- list(A1=A1,A2=A2,B1=B1,B2=B2)
A <- do.call(rbind,list.match(list,"A"))
B <- do.call(rbind,list.match(list,"B"))
list <- list(A=A,B=B)
list <- lapply(list,function(x) {
y <- data.frame(x)
y$class <- c(rep.int(1,2),rep.int(2,2))
return(y)
})
> list
$A
x y class
A1.1 66 96 1
A1.2 76 58 1
A2.1 50 93 2
A2.2 57 12 2
$B
x y class
B1.1 58 56 1
B1.2 69 15 1
B2.1 77 77 2
B2.2 9 9 2
In my real world problem there are about 500 subjects, not always two occasions, differing numbers of observations.
So my example above is just to illustrate where I want to get, and I am stuck at how to pass to the do.call-rbind that it should, based on elements names, bind subject-specific elements as new list elements together, while assigning a new variable.
To me, this is a somewhat fuzzy task, and the closest I got was the rlist package. This question is related but uses unique to identify elements, whereas in my case it seems to be more a regex problem.
I'd be happy even for instructions on how to use google, any keywords for further research etc.
From the data you provided:
subj <- sub("[A-Z]*", "", names(lst))
newlst <- Map(function(x, y) {x[,"class"] <- y;x}, lst, subj)
First we do the regular expression call to isolate the number that will go in the class column. In this case, I matched on capital letters and erased them leaving the number. Therefore, "A1" becomes "1". Please note that the real names will mean a different regex pattern.
Then we use Map to create a new column for each data frame and save to a new list called newlst. Map takes the first element of each argument and carries out the function then continues on with each object element. So the first data frame in lst and the first number in subj are used first. The anonymous function I used is function(x,y) {x[, "class"] <- y; x}. It takes two arguments. The first is the data frame, the second is the column value.
Now it's much easier to move forward. We can create a vector called uniq.nmes to get the names of the data frames that we will combine. Where "A1" will become "A". Then we can rbind on that match:
uniq.nmes <- unique(sub("\\d", "", names(lst)))
lapply(uniq.nmes, function(x) {
do.call(rbind, newlst[grep(x, names(newlst))])
})
# [[1]]
# x y class
# A1.1 1 79 1
# A1.2 30 13 1
# A2.1 90 39 2
# A2.2 43 22 2
#
# [[2]]
# x y class
# B1.1 54 59 1
# B1.2 83 90 1
# B2.1 85 36 2
# B2.2 91 28 2
Data
A1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
A2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B1 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
B2 <- data.frame(x=sample(1:100,2),y=sample(1:100,2))
lst <- list(A1=A1,A2=A2,B1=B1,B2=B2)
It sounds like you're doing a lot of gymnastics because you have a specific form in mind. What I would suggest is first trying to make the data tidy. Without reading the link, the quick summary is to put your data into a single data frame, where it can be easily processed.
The quick version of the answer (here I've used lst instead of list for the name to avoid confusion with the built-in list) is to do this:
do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
What this will do is create a single data frame, with a column, "type", that contains the name of the list item in which that row appeared.
Using a slightly simplified version of your initial data:
lst <- list(A1=data.frame(x=rnorm(5)), A2=data.frame(x=rnorm(3)), B=data.frame(x=rnorm(5)))
lst
$A1
x
1 1.3386071
2 1.9875317
3 0.4942179
4 -0.1803087
5 0.3094100
$A2
x
1 -0.3388195
2 1.1993115
3 1.9524970
$B
x
1 -0.1317882
2 -0.3383545
3 0.8864144
4 0.9241305
5 -0.8481927
And then applying the magic function
df <- do.call(rbind,
lapply(seq(lst), function(i) {
lst[[i]]$type <- names(lst)[i]; lst[[i]]
})
)
df
x type
1 1.3386071 A1
2 1.9875317 A1
3 0.4942179 A1
4 -0.1803087 A1
5 0.3094100 A1
6 -0.3388195 A2
7 1.1993115 A2
8 1.9524970 A2
9 -0.1317882 B
10 -0.3383545 B
11 0.8864144 B
12 0.9241305 B
13 -0.8481927 B
From here we can process to our hearts content; with operations like df$subject <- gsub("[0-9]*", "", df$type) to extract the non-numeric portion of type, and tools like split can be used to generate the sub-lists that you mention in your question.
In addition, once it is in this form, you can use functions like by and aggregate or libraries like dplyr or data.table to do more advanced split-apply-combine operations for data analysis.
I have a data as below. I want to get a list of distinct values and their count for the entire matrix. What is an efficient way to do that?
I thought about putting one column below another (concatenating columns) and creating a single column of 9 elements and then running table command. But I feel that there must be a better way to do this..what are my options?
sm <- matrix(c(51,43,22,"a",51,21,".",22,9),ncol=3,byrow=TRUE)
expected output
distinct value: count
51:2
43:1
22:2
a:1
21:1
.:1
9:1
The table() command works just fine across a matrix
t<-table(sm)
t
# sm
# . 21 22 43 51 9 a
# 1 1 2 1 2 1 1
if you want to reshape the results, you can do
cat(paste0(names(t), ":", t, collapse="\n"), "\n")
# .:1
# 21:1
# 22:2
# 43:1
# 51:2
# 9:1
# a:1