Order a data set - r

I have a list of dataframe which I want to order according 3 column.
I've tried to apply an anonymous function
mylist<-lapply(mylist, function (x) x[order((data[,col1]),(data$namecol2),na.last=NA),])
I've tried in a loop :
for (i in 1:length(mylist)) {
list_sorted <-mylist[[i]][order((data[,col1]),(data$namecol2),na.last=NA),]
}
Either way I get a list of dataframe which are full of NA when they were not in the first place. This step create the dataframe full of NA, I checked the step before and it return my dataframe full of values.
I don't know what I'm doing wrong, any tips?
Thank you.

I guess you have a list with dataframes, and want to sort each of the dataframes based on a column in the dataframe.
The example I have below is a list with two dataframes, the dataframe consists of two columns("x" and "y"). And I sort it based on the column "x" in a descending order. Hope this gives you an idea to accomplish what you want.
x <- rep(1:5)
y <- rnorm(5)
dfrm <- data.frame(x,y)
str(dfrm)
names(dfrm)
listd <- list(dfrm,dfrm)
str(listd)
listsorted <- lapply(listd, function(z) z[with(z,order(x,decreasing=TRUE,na.last=NA)),])
listsorted

Related

Find difference of same column names across different data frames in a list in R

I have a list of data frames with same column names where each dataframe corresponds to a month
June_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(100,200,250,450), Metric2=c(1000,2000,5000,6000))
July_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(140,250,125,400), Metric2=c(2000,3000,2000,3000))
Aug_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(200,150,250,600), Metric2=c(1500,2000,4000,2000))
Sep_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(500,500,1000,100), Metric2=c(500,4000,6000,8000))
lst1 <- list(Aug_2018,June_2018,July_2018,Sep_2018)
names(lst1) <- c("Aug_2018","June_2018","July_2018","Sep_2018")
I intend to create a new column in each of the data frames in the list named Percent_Change_Metric1 and Percent_Change_Metric2 by doing below calculation
for (i in names(lst1)){
lst1[[i]]$Percent_Change_Metric1 <- ((lst1[[i+1]]$Metric1-lst1[[i]]$Metric1)*100/lst1[[i]]$Metric1)
lst1[[i]]$Percent_Change_Metric2 <- ((lst1[[i+1]]$Metric2-lst1[[i]]$Metric2)*100/lst1[[i]]$Metric2)
}
However, obviously the i in for loop is against the names(lst1) and wouldn't work
Also, the dataframes in my list in random order and not ordered by month-year. So the calculation to subtract successive dataframes' columns isn't entirely accurate.
Please advise
How I go about with adding the Percent_change_Metric1 and
Percent_change_Metric2
How to choose the dataframe corresponding
to next month to arrive at the correct Percent_Change
Thanks for the guidance
Here is one option with base R
lst1[-length(lst1)] <- Map(function(x, y)
transform(y, Percent_Change_Metric1 = (x$Metric1 - Metric1) * 100/Metric1,
Percent_Change_Metric2 = (x$Metric2 - Metric2) * 100/Metric2),
lst1[-1], lst1[-length(lst1)])

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

Create new sequentially named variables and fill with mean of level

Warning: Multi-part question!
I realize parts of this have been answered elsewhere but am struggling to bring them together in a nice parsimonious bit of code....
I have a data frame with a number (24) of numeric columns of interest. For each column, I want to create a new variable in the same data frame (named sensibly) in which the values correspond to the mean of the sex-specific decile for that variable (sex is in a different column, coded 0/1).
New column names from an original column called 'WBC' would be, for example: 'WBC_meandec_women', and 'WBC_meandeac_men'.
I've tried various bits of code to first create new variables, then assign values related to the decile but none work well and can't figure out how to put it together. I just know there is a clever way to put all parts into the same code chunk, I'm just not fluent enough in R to get there...
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),WBC=rnorm(100),RBC=rnorm(100))
Trying to achieve:
goaldata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100),WBC_decmean_women=rep(NA,length(dummydata)),WBC_decmean_men=rep(NA,length(dummydata)),RBC_decmean_women=rep(NA,length(dummydata)),RBC_decmean_men=rep(NA,length(dummydata)))
...but obviously with the correct values instead of NAs, and for a list of about 24 original variables.
Any help greatly appreciated!
Depending on if I understood you right, I'll propose this giant ball of duct tape...
# fake data
dummydata <- data.frame(id=c(1:100),sex=rep(c(1,0),50),WBC=rnorm(100),RBC=rnorm(100))
# a function to calculate decile means
decilemean <- function(x) {
xrank <- rank(x)
xdec <- floor((xrank-1)/length(x)*10)+1
decmeans <- as.numeric(tapply(x,xdec,mean))
xdecmeans <- decmeans[xdec]
return(xdecmeans)
}
# looping thru your data columns and making new columns
newcol <- 5 # the first new column to create
for(j in c(3,4)) { # all of your colums to decilemean-ify
dummydata[,newcol] <- NA
dummydata[dummydata$sex==0,newcol] <- decilemean(dummydata[dummydata$sex==0,j])
names(dummydata)[newcol] <- paste0(names(dummydata)[j],"_decmean_women")
dummydata[,newcol+1] <- NA
dummydata[dummydata$sex==1,newcol+1] <- decilemean(dummydata[dummydata$sex==1,j])
names(dummydata)[newcol+1] <- paste0(names(dummydata)[j],"_decmean_men")
newcol <- newcol+2
}
I'd recommend testing it though ;)

match common rows between different dataframes in a new organized df

Can someone help me to match three or more different ranked df to have a final one containing only the rows common to all of them? I am trying match and merge functions but I can not go any further.
here is how the data may look like:
A <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
B <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
C <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
"letter" is however the "row.names" on each df has only one column with the numerical "x", the ranked values.
There are not many details, but I try to suggest a basic approach. The function below tests if the two arguments provided from dataFrame1 and dataFrame2 match between them. In the evenience of TRUE answer, it stores the common value in a new dataFrame3. The index in the square brackets represents the rows that you would like to test.
matching_row <- function(x, y) {
if (identical(x, y)) {
dataFrame3 <- x
}
}
dataFrame3 <- matching_row(dataFrame$x[row], dataFrame2$x[row])
You can modify the function according to the characteristics of your data by adding, i.e., a loop if the dataframes are quite big, ore more strict/flexible logical conditions to test the identity between dataframes.

Combine several columns under same name

I am trying to get the mvr function in the R-package pls to work. When having a look at the example dataset yarn I realized that all 268 NIR columns are in fact treated as one column:
library(pls)
data(yarn)
head(yarn)
colnames(yarn)
I would need that to use the function with my data (so that a multivariate datset is treated as one entity) but I have no idea how to achive that. I tried
TT<-matrix(NA, 2, 3)
colnames(TT)<-rep("NIR", ncol(TT))
TT
colnames(TT)
You will notice that while all columns have the same heading, colnames(TT) shows a vector of length three, because each column is treated separately. What I would need is what can be found in yarn, that the colname "NIR" occurs only once and applies columns 1-268 alike.
Does anybody know how to do that?
You can just assign the matrix to a column of a data.frame
TT <- matrix(1:6, 2, 3 )
# assign to an existing dataframe
out <- data.frame(desnity = 1:nrow(TT))
out$NIR <- TT
str(out)
# assign to empty dataframe
out <- data.frame(matrix(integer(0), nrow=nrow(TT))) ;
out$NIR <- TT

Resources