Most efficient way of ordering columns and creating rank variables - r

I have a data frame with several columns. I want create a function/loop or what might be more efficient to take the data frame, order a column, create a variable rank(with a name like rank_columnname) based on that order and add it to the data frame.
dat <- data.frame(indi1=rnorm(10),indi2=rnorm(10))
dat1 <- dat[order(dat$indi1), ]
dat1$rank_indi <- 1:nrow(dat)
dat2 <- dat1[order(dat1$indi2), ]
dat2$rank_indi2 <- 1:nrow(dat2)
This example does what I want, but in a cumbersome way. I've tried using lapply but I can't seem to update the data frame with a new column with a similar name.
Any help is appreciated.

Here's a simple loop to insert in "rank_indi" variables:
for(i in names(dat)){
dat[order(dat[,i]),paste0("rank_", i)] <- 1:nrow(dat)
}
dat
indi1 indi2 rank_indi1 rank_indi2
1 1.45829065 -0.3322692 10 2
2 0.55972129 2.5031318 7 10
3 0.45870293 -0.6216859 6 1
4 1.03814922 1.4284271 9 8
5 -0.75211259 0.5600499 3 4
6 -1.89298552 0.8047825 2 6
7 0.03843679 0.6593377 5 5
8 -0.09808913 0.2513729 4 3
9 0.97862797 2.2650003 8 9
10 -2.07767889 1.0684134 1 7
edit: made a mistake in the earlier code

Related

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14
EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14
I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

Import multiple data frames CSV - column separation

I have a csv file with multiple data frames that are all separated by a column (So 4 columns of data, empty column, 4 columns of data, etc.). Is there a nice way to read in the file and have R create a separate df for each of those contiguous sets of columns? Then I would be able to use lapply across all of these dfs.
Thanks for your help.
Read in the whole csv file, then use lapply to separately capture each four-column data frame into a list. Then use rbind to stack all the data frames into a single data frame.
dat = read.csv("YourFile.csv")
# Set this based on how many separate data frames are in your csv file
num.df = ncol(dat)/5 # Per #zx8754's comment
# This will tell the function the column numbers where
# each data frame starts
start.cols = seq(1, 1 + 5*(num.df-1), 5)
df.list = lapply(start.cols, function(x) {
# Capture the next 4 columns
df = dat[, x:(x+3)]
# Use whatever names are appropriate here. This is just
# to make sure all of the data frames have the same column names
# so that rbind won't throw an error
names(df) = c(paste0("col", 1:4))
return(df)
})
# rbind all the data frames into a single data frame
df = do.call(rbind, df.list)
You can take advantage of colClasses:
Example data:
h1 h2 h3 h1.1 h2.1 h3.1 h1.2 h2.2 h3.2
1 1 6 3 1 8 8 1 5 2
2 2 1 1 6 5 8 1 3 1
3 3 2 6 1 2 3 1 2 5
Then you can loop through the number of dataframes you wan't and read the file:
ngroups <- 3 #number of dataframes to read
datacols <- 3 #number of columns to read
fulldata <- list()
for (i in 1:ngroups) {
nskip <- (datacols+1)*(i-1)
cols.to.read <- c(rep("NULL", nskip), rep(NA, datacols), rep("NULL", (datacols+1)*(ngroups-i+1)-1)) #creates a list of NULLs and NAs. NULLs = don't read, NA = read
fulldata[[i]] <- read.csv("test.csv", colClasses=cols.to.read)
}
Result:
fulldata
[[1]]
h1 h2 h3
1 1 6 3
2 2 1 1
3 3 2 6
[[2]]
h1.1 h2.1 h3.1
1 1 8 8
2 6 5 8
3 1 2 3
[[3]]
h1.2 h2.2 h3.2
1 1 5 2
2 1 3 1
3 1 2 5
This works, but I believe the answers reading the file only once would be faster, since reading the same file over and over again doesn't sound like the optimal procedure.
First read in all your data into one large dataframe:
maindf <- read.table(yourfile)
Lets say n is the number of dataframes inside your csv file:
for (i in 0:n-1){
assign(paste0("df",i+1),maindf[,(1+4*i):(4+4*i)])
}
The result should be n dataframes that can be accessed like this: df1, df2,...dfn.
I didnt test it, because no sample data was provided.

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources