I have this data set that requires some cleaning up. Is there a way to code in R such that it picks up columns with more than 3 different levels from the data set? Eg column C has the different education level and I would like it to be selected along with column D and F. While column E and G wont be picked up because it doesnt meet the more than 3 level requirement.
At the same time I need one of the columns to be arranged in a specific way? Eg Education, I would like PHD to be at the top. The other levels of education does not need to be in any order
Sorry i am really new to R and I attached a snapshot of a sample data i replicated from the original
All help is greatly appreciated
It is a bit complicated to replicate the data as it is an image, but you could use this function to select those columns of your dataframe that have at least 3 levels.
First I converted to factor those columns you are considering, in this case from column C or 3. Then with the for loop I identify those columns with more than 2 levels, and save the result in a vector and then filter the original data set according to these columns.
select_columns <- function(df){
df <- data.frame(lapply(df[,-c(1,2)], as.factor))
selectColumns <- c()
for (i in 1:length(df)) {
if((length(unique(df[,i])) > 3) ){
selectColumns[i] <- colnames(df[i])
}
}
selectColumns <- na.omit(selectColumns)
return(data %>% select(c(1:2),selectColumns))
}
select_columns(your_data_frame)
Related
Good morning everyone.
Please I do have a problem that I have not been able to solve for quite some time now.(please take a look at the image link to see a screen shot of my data set) https://i.stack.imgur.com/g2eTM.jpg
I have a column of data (status) containing two set of values (1 and 2). These are dummies representing two categories (or status) of dependent Variables (say Pp and Pt) that I need for a regression. their actual values are contained the last column Pp.Pt (Pp.Pt is just a name nothing more).
I need to run two separate regressions each using either Pp or Pt (meaning using their respective values in the Pp.Pt column (each value in the last column is either of status 1 or of status 2) . **My question is How do I separte them or group them into these two categories 1= Pp and 2 = Pt so that i could clearly identitify and group them.
https://i.stack.imgur.com/g2eTM.jpg
Thank you very much for your kind help.
Best
Ludovic
Split-Apply-Combine method :
# Using the mtcars dataset as an example:
df <- mtcars
# Allocate some memory for a list storing the split data.frame:
# df_list => empty list with the number of elements of the unique
# values of the cyl vector
df_list <- vector("list", length(unique(df$cyl)))
# Split the data.frame by the cyl vector:
df_list <- split(df, df$cyl)
# Apply the regression model, return the summary data:
lapply(df_list, function(x){
summary(lm(mpg ~ hp, data = x))
}
)
this approach can fix your issue
yourdata %>%
mutate(classofyourcolumn=ifelse(columntosplit<quantile(columntosplit,0.5),1,0))
I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.
Warning: Multi-part question!
I realize parts of this have been answered elsewhere but am struggling to bring them together in a nice parsimonious bit of code....
I have a data frame with a number (24) of numeric columns of interest. For each column, I want to create a new variable in the same data frame (named sensibly) in which the values correspond to the mean of the sex-specific decile for that variable (sex is in a different column, coded 0/1).
New column names from an original column called 'WBC' would be, for example: 'WBC_meandec_women', and 'WBC_meandeac_men'.
I've tried various bits of code to first create new variables, then assign values related to the decile but none work well and can't figure out how to put it together. I just know there is a clever way to put all parts into the same code chunk, I'm just not fluent enough in R to get there...
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),WBC=rnorm(100),RBC=rnorm(100))
Trying to achieve:
goaldata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100),WBC_decmean_women=rep(NA,length(dummydata)),WBC_decmean_men=rep(NA,length(dummydata)),RBC_decmean_women=rep(NA,length(dummydata)),RBC_decmean_men=rep(NA,length(dummydata)))
...but obviously with the correct values instead of NAs, and for a list of about 24 original variables.
Any help greatly appreciated!
Depending on if I understood you right, I'll propose this giant ball of duct tape...
# fake data
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100))
# a function to calculate decile means
decilemean <- function(x) {
xrank <- rank(x)
xdec <- floor((xrank-1)/length(x)*10)+1
decmeans <- as.numeric(tapply(x,xdec,mean))
xdecmeans <- decmeans[xdec]
return(xdecmeans)
}
# looping thru your data columns and making new columns
newcol <- 5 # the first new column to create
for(j in c(3,4)) { # all of your colums to decilemean-ify
dummydata[,newcol] <- NA
dummydata[dummydata$sex==0,newcol] <- decilemean(dummydata[dummydata$sex==0,j])
names(dummydata)[newcol] <- paste0(names(dummydata)[j],"_decmean_women")
dummydata[,newcol+1] <- NA
dummydata[dummydata$sex==1,newcol+1] <- decilemean(dummydata[dummydata$sex==1,j])
names(dummydata)[newcol+1] <- paste0(names(dummydata)[j],"_decmean_men")
newcol <- newcol+2
}
I'd recommend testing it though ;)
While trying to get my data fit for analysis, I can't seem to do this correctly. Presume I have a datasets in this form:
df1
V1 V2df1
a H
b Y
c Y
df2
V1 V2df2
a Y
j H
b Y
and three more (5 datasets of different lengths alltogether). What I am trying to do is the following. First I must find all common elements from the first column(V1) - in this case those are: a,b. Then according to those common elements, I'm trying to build a joined dataset, where values of V1 would be common to all five datasets and values from other columns would be appended in the same row. So to explain with an example,
my result should look something like:
V1 V2df1 V2df2
a H Y
b Y Y
I managed to get some code working, but apperently the results are not correct. What I did:
read all the lines from all files into variables(example: a<-df1[,1] and so on) and find common rows like:
red<-Reduce(intersect, list(a,b,c,d,e))
then I filtered specific datasets like:
df1 <- unique(filter(df1, V1 %in% red))
I ordered every dataset according to row:
df1<-data.frame(df1[with(df1, order(V1)),])
and deleted duplicates(of elements in first column):
df1<- df1[unique(df1$V1),]
I then created a new dataset with:
newdata<-data.frame(V1common=df1[,1], V2df1=df1[,2],V2df2=df2[,2]...)
... means for all five of datasets. I actually got the same number of rows(a good sign since there are the same number of rows within intersection), and then appended other sorted columns, but something doesn't add up. Thanks for any advice. (I omitted the use of libraries and such, the code is for illustrative purposes).
You can use join_all from plyr package
require(plyr)
df <- join_all(list(df1,df2,df3,df4, df5), by = 'V1', type = 'inner')
I have a data frame with 251 observations and 45 variables. There are 6 observations in the middle of the data frame that i'd like to exclude from my analyses. All 6 belong to one level of a factor. It is easy to generate a new data frame that, when printed, appears to exclude the 6 observations. When I use the new data frame to plot variables by the factor in question, however, the supposedly excluded level is still included in the plot (sans observations). Using str() confirms that the level is still present in some form. Also, the index for the new data frame skips 6 values where the observations formerly resided.
How can I create a new data frame that excludes the 6 observations and does not continue to recognize the excluded factor level when plotting? Can the new data frame be made to "re-index", so that the new index does not skip values formerly assigned to the excluded factor level?
I've provided an example with made up data:
# ---------------------------------------------
# data
char <- c( rep("anc", 4), rep("nam", 3), rep("oom", 5), rep("apt", 3) )
a <- 1:15 / pi
b <- seq(1, 8, .5)
d <- rep(c(3, 8, 5), 5)
dat <- data.frame(char, a, b, d)
dat
# two ways to remove rows that contain a string
datNew1 <- dat[-which(dat$char == "nam"), ]
datNew1
datNew2 <- dat[grep("nam", dat[ ,"char"], invert=TRUE), ]
datNew2
# plots still contain the factor level that was excluded
boxplot(datNew1$a ~ datNew1$char)
boxplot(datNew2$a ~ datNew2$char)
# str confirms that it's still there
str(datNew1)
str(datNew2)
# ---------------------------------------------
You can use the drop.levels() function from the gdata package to reduce the factor levels down to the actually used ones -- apply it on your column after you created the new data.frame.
Also try a search for r and drop.levels here (but you need to make the search term [r] drop.levels which I can't here as it interferes with the formatting logic).
Starting with R version 2.12.0, there is a function droplevels, which can be applied either to factor columns or to the entire dataframe. When applied to the dataframe, it will remove zero-count levels from all factor columns. So your example will become simply:
# two ways to remove rows that contain a string
datNew1 <- droplevels( dat[-which(dat$char == "nam"), ] )
datNew2 <- droplevels( dat[grep("nam", dat[ ,"char"], invert=TRUE), ] )
I have pasted something from my code- I have an enclosure experiment in a lake- have measurements from enclosures and the lake but mostly dont want to deal with lake:
my variable is called "t.level" and the levels were control, low medium high and lake-
-this code makes it possible to use the nolk$ or data=nolk to get data without the "lake"..
nolk<-subset(mylakedata,t.level == "control" |
t.level == "low" |
t.level == "medium" |
t.level=="high")
nolk[]<-lapply(nolk, function(t.level) if(is.factor(t.level))
t.level[drop=T]
else t.level)