match common rows between different dataframes in a new organized df - r

Can someone help me to match three or more different ranked df to have a final one containing only the rows common to all of them? I am trying match and merge functions but I can not go any further.
here is how the data may look like:
A <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
B <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
C <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
"letter" is however the "row.names" on each df has only one column with the numerical "x", the ranked values.

There are not many details, but I try to suggest a basic approach. The function below tests if the two arguments provided from dataFrame1 and dataFrame2 match between them. In the evenience of TRUE answer, it stores the common value in a new dataFrame3. The index in the square brackets represents the rows that you would like to test.
matching_row <- function(x, y) {
if (identical(x, y)) {
dataFrame3 <- x
}
}
dataFrame3 <- matching_row(dataFrame$x[row], dataFrame2$x[row])
You can modify the function according to the characteristics of your data by adding, i.e., a loop if the dataframes are quite big, ore more strict/flexible logical conditions to test the identity between dataframes.

Related

Looping through a data set based on a factor r

I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

Finding common rows in R

While trying to get my data fit for analysis, I can't seem to do this correctly. Presume I have a datasets in this form:
df1
V1 V2df1
a H
b Y
c Y
df2
V1 V2df2
a Y
j H
b Y
and three more (5 datasets of different lengths alltogether). What I am trying to do is the following. First I must find all common elements from the first column(V1) - in this case those are: a,b. Then according to those common elements, I'm trying to build a joined dataset, where values of V1 would be common to all five datasets and values from other columns would be appended in the same row. So to explain with an example,
my result should look something like:
V1 V2df1 V2df2
a H Y
b Y Y
I managed to get some code working, but apperently the results are not correct. What I did:
read all the lines from all files into variables(example: a<-df1[,1] and so on) and find common rows like:
red<-Reduce(intersect, list(a,b,c,d,e))
then I filtered specific datasets like:
df1 <- unique(filter(df1, V1 %in% red))
I ordered every dataset according to row:
df1<-data.frame(df1[with(df1, order(V1)),])
and deleted duplicates(of elements in first column):
df1<- df1[unique(df1$V1),]
I then created a new dataset with:
newdata<-data.frame(V1common=df1[,1], V2df1=df1[,2],V2df2=df2[,2]...)
... means for all five of datasets. I actually got the same number of rows(a good sign since there are the same number of rows within intersection), and then appended other sorted columns, but something doesn't add up. Thanks for any advice. (I omitted the use of libraries and such, the code is for illustrative purposes).
You can use join_all from plyr package
require(plyr)
df <- join_all(list(df1,df2,df3,df4, df5), by = 'V1', type = 'inner')

Order a data set

I have a list of dataframe which I want to order according 3 column.
I've tried to apply an anonymous function
mylist<-lapply(mylist, function (x) x[order((data[,col1]),(data$namecol2),na.last=NA),])
I've tried in a loop :
for (i in 1:length(mylist)) {
list_sorted <-mylist[[i]][order((data[,col1]),(data$namecol2),na.last=NA),]
}
Either way I get a list of dataframe which are full of NA when they were not in the first place. This step create the dataframe full of NA, I checked the step before and it return my dataframe full of values.
I don't know what I'm doing wrong, any tips?
Thank you.
I guess you have a list with dataframes, and want to sort each of the dataframes based on a column in the dataframe.
The example I have below is a list with two dataframes, the dataframe consists of two columns("x" and "y"). And I sort it based on the column "x" in a descending order. Hope this gives you an idea to accomplish what you want.
x <- rep(1:5)
y <- rnorm(5)
dfrm <- data.frame(x,y)
str(dfrm)
names(dfrm)
listd <- list(dfrm,dfrm)
str(listd)
listsorted <- lapply(listd, function(z) z[with(z,order(x,decreasing=TRUE,na.last=NA)),])
listsorted

Slice dataframe by all rows corresponding to a country, then sample that vector

I have an R script that reads out some parameters via the commandArgs() function to see what kind of slices it should make in a dataset before saving these slices to a text file to be interpreted by a C++ program.
The dataset is a survey done in the EU and I would like to be able to slice per respondent's country, by having relevant arguments in the commandArgs vector be compared to a string vector countries that contains all possible options. Using that and a corresponding integer matrix countryIndices, which contains the bounds of each country (i.e.: all Belgian correspondents are in rows 1-1043, so countryIndices[1,1]=1 and countryIndices[2,1]=1043), I wish to construct a matrix personIndices, that has all relevant bounds, using the which() function.
From this I want to make a vector that contains a sample of indices from the requested countries. The size of this vector is either sampleSize*sampleCountries (sampling sampleSize people per country) or simply sampleSize, depending on another parameter passed through. I was hoping that, at least for the latter type of sampling I could make this vector in one go, through the c() function, as follows
personIndices<-rbind(c(1,1043),c(2044,3061),c(8423,8922))
sampleVector<-c(personIndices[,1]:personIndices[,2])
And then sampling from that vector.
I'd hoped that this would make a vector containing the numbers 1:1043, 2044:3061 and 8423:8922, but this sadly does not seem to work. Any tips? Out of desperation I've constructed a monstrosity containing ifs in ifs in ifs and I'd rather not have it see the light of day if there's a smarter approach, but I haven't been able to find out. For reference as to what I'm doing (or if I wasn't being clear enough), said monstrosity can be found at http://pastebin.ca/2650188
Thanks in advance!
All the acrobatics with vectors of indices are unnecessary.
Logical indexing, subsetting are really all you need, using a new 'country' field (factor) you add to your data. (Maybe also plyr::ddply if you get real fancy)
All you want to do is allow the user to:
Choose a country from a list (by selecting its number, 2-letter abbrev, whatever)...
... then sample in your dataset from within that country. That's all!
.
dat$country <- NA # insert a new column, initialize to NA for pessimism, to catch omissions
dat$country[1:1043,] <- 'Belgium'
dat$country[2044:3061,] <- 'Bulgaria'
dat$country[8423,8922,] <- 'Czech Rep'
...
# Now make country a factor instead of character
dat$country <- as.factor(dat$country)
# Now you can sample() using either logical indexing...
sample(dat[dat$country=='Bulgaria',] , ...)
# ...or subsetting
sample(subset(dat,country=='Bulgaria'), ...)
I would summarize your code as:
If sampleType is TRUE, then draw a sample of size sampleSize from the indices corresponding to each country in sampleCountries, and return all these sampled indices together.
If sampleType is FALSE, then group the indices corresponding to all the countries in sampleCountries together and draw a single sample of size sampleSize.
Let's setup some sample parameters:
sampleCountries <- c("BE", "WG")
sampleSize <- 20
sampleType <- F
The first step is to build a vector of the country for each index:
countries = c(rep("BE", 1043), rep("DM", 1000), rep("WG", 1018), rep("GR", 1003),
rep("IT", 1021), rep("SP", 1021), rep("FR", 1008), rep("IR", 1000),
rep("NI", 308), rep("LX", 500), rep("NL", 1022), rep("PT", 1000),
rep("GB", 1066), rep("EG", 1014))
Next, when "ALL" is in sampleCountries you want to behave like all the countries are selected:
if ("ALL" %in% sampleCountries) {
sampleCountries <- unique(countries)
}
Finally, draw your samples:
if (sampleType) {
personIndices <- unlist(lapply(sampleCountries, function(x) {
return(sample(which(countries == x), sampleSize, replace=F))
}))
} else {
personIndices <- sample(which(countries %in% sampleCountries), sampleSize,
replace=F)
}
In the first part of the if statement, which(countries == x) gets the indices of country x, and lapply does this for all the countries in your vector sampleCountries. Finally, unlist converts the output of lapply to a vector.
In the second part of the if statement, which(countries %in% sampleCountries) gets the indices of every country in sampleCountries.

Resources