select columns after named columns - r

I have a data frame of the following form in R
First
a
b
c
Second
a
b
c
3
8
1
7
6
8
5
9
4
2
8
5
I'm trying to write something that selects the three columns following "First" & "Second", and puts them into new data frames titled "First" & "Second" respectively. I'm thinking of using the strategy below (where df is the dataframe I outline above), but am unsure how to make it such that R takes the columns that follow the ones I specify
names <- c("First", "Second")
for (i in c){
i <- (something to specify the 3 columns following df$i)
}

An option is to split.default to split the data.frame into a list of data.frames
split.default(df, cumsum(names(df) %in% names))
#$`1`
# First a b c
#1 NA 3 8 1
#2 NA 5 9 4
#
#$`2`
# Second a b c
#1 NA 7 6 8
#2 NA 2 8 5
The expression cumsum(...) creates the indices according to which to group and split columns.
Sample data
df <- read.table(text = "First a b c Second a b c
'' 3 8 1 '' 7 6 8
'' 5 9 4 '' 2 8 5", header = T, check.names = F)

You can get position of names vector in column names of the data and subset the next 3 columns from it.
names <- c("First", "Second")
inds <- which(names(df) %in% names)
result <- Map(function(x, y) df[x:y], inds + 1, inds + 3)
result
#[[1]]
# a b c
#1 3 8 1
#2 5 9 4
#[[2]]
# a b c
#1 7 6 8
#2 2 8 5
To create separate dataframes you can name the list and use list2env
names(result) <- names
list2env(result, .GlobalEnv)

Related

Remove the last column of dataframe in R in a function

I need to remove the last column of 10 dataframes, so I decided to put it in lapply(). I wrote a function to remove the col, like below,
remove_col <- function(mydata){
mydata = subset(mydata, select=-c(24))
}
and create a mylist <- (data1, data2.... data10), then I passed lapply as
lapply(mylist, FUN = remove_col)
It did give me a list of the removed dataframe, however, when I checked the original dataframe, the last column is still there.
How should I change the code to change the original dataset?
You need to assign the result of the function call to the input list on the LHS:
mylist <- lapply(mylist, FUN = remove_col)
Had you defined your function with an explicit return value, this might have been more obvious:
remove_col <- function(mydata) {
mydata <- subset(mydata, select=-c(24))
return(mydata) # return the modified list/data frame
}
Instead of hardcoding the column number to remove you can use ncol to remove the last column from each dataframe.
remove_col <- function(mydata){
mydata[, -ncol(mydata)]
}
mylist <- lapply(mylist, remove_col)
To see the changes in the original dataframe you can assign names to list of dataframe and use list2env.
names(mylist) <- paste0('data', seq_along(mylist))
list2env(mylist, .GlobalEnv)
Using base R and lapply, Note, you can remove ", drop = F" from your script if there are more than 2 columns in all dataframes in the list.
> d1
c1 c2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
> d2
c1 c2
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
> mylist <- list(d1, d2)
> mylist
[[1]]
c1 c2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c1 c2
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
> lapply(mylist, function(x) x[,1:(ncol(x)-1), drop = F] )
[[1]]
c1
1 1
2 2
3 3
4 4
5 5
[[2]]
c1
1 5
2 4
3 3
4 2
5 1
>

I want to create a new csv from the existing csv consist of multiple same columns but not sorted data

I have a CSV with these data:
List Rank.A List Rank.B List Rank.C
a 4 a 8 b 3
b 5 e 5 e 9
c 7 f 5 r 1
I want to create a new csv in which there is only a one-column with a name List with a unique value and there is 3 more columns of "Rank.A", "Rank.B", "Rank.C" in same list. Suppose if Rank.A not listed with any row of List than it display blank. I want data in this format
List Rank.A Rank.B Rank.C
a 4 8
b 5 3
c 7
e 5 9
f 5
r 1
Can you please help me in that?
A base R option using split.default (to split your data.frame by columns) and Reduce + merge to combine data into a single data.frame.
Reduce(
function(x, y) merge(x, y, all = TRUE),
split.default(df, rep(1:(ncol(df) / 2), each = 2)))
# List Rank.A Rank.B Rank.C
# 1 a 4 8 NA
# 2 b 5 NA 3
# 3 c 7 NA NA
# 4 e NA 5 9
# 5 f NA 5 NA
# 6 r NA NA 1
Note that this assumes that you always have pairs of columns (List, Rank.x) in your original data.
Sample data
df <- read.table(text =
"List Rank.A List Rank.B List Rank.C
a 4 a 8 b 3
b 5 e 5 e 9
c 7 f 5 r 1", header = T, check.names = F)

Reorder a subset of an R data.frame modifying the row names as well

Given a data.frame:
foo <- data.frame(ID=1:10, x=1:10)
rownames(foo) <- LETTERS[1:10]
I would like to reorder a subset of rows, defined by their row names. However, I would like to swap the row names of foo as well. I can do
sel <- c("D", "H") # rows to reorder
foo[sel,] <- foo[rev(sel),]
sel.wh <- match(sel, rownames(foo))
rownames(foo)[sel.wh] <- rownames(foo)[rev(sel.wh)]
but that is long and complicated. Is there a simpler way?
We can replace the sel values in rownames with the reverse of sel.
x <- rownames(foo)
foo[replace(x, x %in% sel, rev(sel)), ]
# ID x
#A 1 1
#B 2 2
#C 3 3
#H 8 8
#E 5 5
#F 6 6
#G 7 7
#D 4 4
#I 9 9
#J 10 10
Not as concise as ronak-shah's answer, but you could also use order.
# extract row names
temp <- row.names(foo)
# reset of vector
temp[which(temp %in% sel)] <- temp[rev(which(temp %in% sel))]
# reset order of data.frame
foo[order(temp),]
ID x
A 1 1
B 2 2
C 3 3
H 8 8
E 5 5
F 6 6
G 7 7
D 4 4
I 9 9
J 10 10
As noted in the comments, this relies on the row names following a lexicographical order. In instances where this is not true, we can use match.
# set up
set.seed(1234)
foo <- data.frame(ID=1:10, x=1:10)
row.names(foo) <- sample(LETTERS[1:10])
sel <- c("D", "H")
Now, the rownames are
# initial data.frame
foo
ID x
B 1 1
F 2 2
E 3 3
H 4 4
I 5 5
D 6 6
A 7 7
G 8 8
J 9 9
C 10 10
# grab row names
temp <- row.names(foo)
# reorder vector containing row names
temp[which(temp %in% sel)] <- temp[rev(which(temp %in% sel))]
Using, match along with order
foo[order(match(row.names(foo), temp)),]
ID x
B 1 1
F 2 2
E 3 3
D 6 6
I 5 5
H 4 4
A 7 7
G 8 8
J 9 9
C 10 10
your data frame is small so you can duplicate it then change the value of each raw:
footmp<-data.frame(foo)
foo[4,]<-footemp[8,]
foot{8,]<-footemp[4,]
Bob

R - split list every x items

I have data to analyse that is presented in the form of a list (just one row and MANY columns).
A B C D E F G H I
1 2 3 4 5 6 7 8 9
Is there a way to tell R to split this list every x items and get something as seen below (the columns C D E F G H I are virtually the same as A B)?
A B
1 2
3 4
5 6
7 8
9
If the number of columns is a multiple of 'x', then we unlist the dataset, and use matrix to create the expected output.
as.data.frame(matrix(unlist(df1), ncol=2, dimnames=list(NULL, c("A", "B")) , byrow=TRUE))
If the number of columns is not a multiple of 'x', then
x <- 2
gr <- as.numeric(gl(ncol(df1), x, ncol(df1)))
lst <- split(unlist(df1), gr)
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# A B
# 1 1 2
# 2 3 4
# 3 5 6
# 4 7 8
# 5 9 NA

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Resources