I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.
Related
I have this data set that requires some cleaning up. Is there a way to code in R such that it picks up columns with more than 3 different levels from the data set? Eg column C has the different education level and I would like it to be selected along with column D and F. While column E and G wont be picked up because it doesnt meet the more than 3 level requirement.
At the same time I need one of the columns to be arranged in a specific way? Eg Education, I would like PHD to be at the top. The other levels of education does not need to be in any order
Sorry i am really new to R and I attached a snapshot of a sample data i replicated from the original
All help is greatly appreciated
It is a bit complicated to replicate the data as it is an image, but you could use this function to select those columns of your dataframe that have at least 3 levels.
First I converted to factor those columns you are considering, in this case from column C or 3. Then with the for loop I identify those columns with more than 2 levels, and save the result in a vector and then filter the original data set according to these columns.
select_columns <- function(df){
df <- data.frame(lapply(df[,-c(1,2)], as.factor))
selectColumns <- c()
for (i in 1:length(df)) {
if((length(unique(df[,i])) > 3) ){
selectColumns[i] <- colnames(df[i])
}
}
selectColumns <- na.omit(selectColumns)
return(data %>% select(c(1:2),selectColumns))
}
select_columns(your_data_frame)
Warning: Multi-part question!
I realize parts of this have been answered elsewhere but am struggling to bring them together in a nice parsimonious bit of code....
I have a data frame with a number (24) of numeric columns of interest. For each column, I want to create a new variable in the same data frame (named sensibly) in which the values correspond to the mean of the sex-specific decile for that variable (sex is in a different column, coded 0/1).
New column names from an original column called 'WBC' would be, for example: 'WBC_meandec_women', and 'WBC_meandeac_men'.
I've tried various bits of code to first create new variables, then assign values related to the decile but none work well and can't figure out how to put it together. I just know there is a clever way to put all parts into the same code chunk, I'm just not fluent enough in R to get there...
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),WBC=rnorm(100),RBC=rnorm(100))
Trying to achieve:
goaldata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100),WBC_decmean_women=rep(NA,length(dummydata)),WBC_decmean_men=rep(NA,length(dummydata)),RBC_decmean_women=rep(NA,length(dummydata)),RBC_decmean_men=rep(NA,length(dummydata)))
...but obviously with the correct values instead of NAs, and for a list of about 24 original variables.
Any help greatly appreciated!
Depending on if I understood you right, I'll propose this giant ball of duct tape...
# fake data
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100))
# a function to calculate decile means
decilemean <- function(x) {
xrank <- rank(x)
xdec <- floor((xrank-1)/length(x)*10)+1
decmeans <- as.numeric(tapply(x,xdec,mean))
xdecmeans <- decmeans[xdec]
return(xdecmeans)
}
# looping thru your data columns and making new columns
newcol <- 5 # the first new column to create
for(j in c(3,4)) { # all of your colums to decilemean-ify
dummydata[,newcol] <- NA
dummydata[dummydata$sex==0,newcol] <- decilemean(dummydata[dummydata$sex==0,j])
names(dummydata)[newcol] <- paste0(names(dummydata)[j],"_decmean_women")
dummydata[,newcol+1] <- NA
dummydata[dummydata$sex==1,newcol+1] <- decilemean(dummydata[dummydata$sex==1,j])
names(dummydata)[newcol+1] <- paste0(names(dummydata)[j],"_decmean_men")
newcol <- newcol+2
}
I'd recommend testing it though ;)
It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.
I have a column header stored in a variable as follows:
a <- get("colA")# this variable changes and was obtained using regexp
The value of a is actually a column header called Nimu.
I also have a data frame (BigData) having Nimu as a column header along with the other columns. How can I use cbind/data.frame to select a only a few columns, including Nimu, into a new data frame.
I have tried:
data <- cbind(BigData$Miu,BigData$sil,BigData$a)
But this did not work. R did not like BigData$a. Any suggestions? Thanks.
Something like this should work:
a <- get("colA")
b <- get("colB")
c <- get("colC")
cols = c(a, b, c)
df_subset = df[cols]
I do think your solution using get is probably sub-optimal and not needed, but without more context it is hard to say.
Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15