I have a column header stored in a variable as follows:
a <- get("colA")# this variable changes and was obtained using regexp
The value of a is actually a column header called Nimu.
I also have a data frame (BigData) having Nimu as a column header along with the other columns. How can I use cbind/data.frame to select a only a few columns, including Nimu, into a new data frame.
I have tried:
data <- cbind(BigData$Miu,BigData$sil,BigData$a)
But this did not work. R did not like BigData$a. Any suggestions? Thanks.
Something like this should work:
a <- get("colA")
b <- get("colB")
c <- get("colC")
cols = c(a, b, c)
df_subset = df[cols]
I do think your solution using get is probably sub-optimal and not needed, but without more context it is hard to say.
Related
I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.
I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)
It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.
I've finally lost my habit of loops in R. Basically usually calculating new columns, and then doing calculations and aggregations on these new columns.
But I have a question regarding cbind which I use for adding columns.
Is there a better way than using bind for things like this?
Naming this new column always is done by me in this tedious way... Anything cleverer/simpler out there?
library(quantmod)
getSymbols("^GSPC")
GSPC <- cbind(GSPC, lag(Cl(GSPC), k=1)) #Doing some new column calculation
names(GSPC)[length(GSPC[1,])] <- "Laged_1_Cl" #Naming this new column
GSPC <- cbind(GSPC, lag(Cl(GSPC), k=2))
names(GSPC)[length(GSPC[1,])] <- "Laged_2_Cl"
tail(GSPC)
** EDITED **
Roman Luštrik added a great solution in comments below.
GSPC$Laged_3_Cl <- lag(Cl(GSPC), k=3)
tail(GSPC)
One way of adding new variables to a data.frame is through the $ operator. Help page (?"$") shows common usage in the form of
x$i <- value
Where i is the new variable name and value are its associated values.
You can name the new column on the left side of the assignment like so:
exdat <- data.frame(lets = LETTERS[1:10],
nums = 1:10)
exdat$combo <- paste0(exdat$lets, exdat$nums)
Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15