Ordering columns in data frame - r

I have a data frame with the following column names:
well, DIV10SD7, DIV11SD7, DIV7SD7, DIV9SD7
However, I want the order to be the following:
well, DIV7SD7, DIV9SD7, DIV10SD7, DIV11SD7
So basically, I want to sort by the number after "DIV" and before "SD7". Additionally, I want to leave out the "well" column when I sort.
When I use the following command:
df[,order(names(df))]
The order of the data frame is unchanged, with the exception of the well column, which moves to the end. I believe this is because R reads each string one character at a time. So, in this case, all the numbers that begin with 1 (e.g. DIV10 and DIV11) are placed before DIV7 and DIV9.
Is there a way to change this behavior?

You can try the mixedorder function from the "gtools" package:
mydf[c(1, mixedorder(names(mydf)[-1]) + 1)]
## well DIV7SD7 DIV9SD7 DIV10SD7 DIV11SD7
## 1 1 7 9 3 5
## 2 2 8 10 4 6
Sample data:
mydf <- structure(list(well = 1:2, DIV10SD7 = 3:4, DIV11SD7 = 5:6, DIV7SD7 = 7:8,
DIV9SD7 = 9:10), .Names = c("well", "DIV10SD7", "DIV11SD7",
"DIV7SD7", "DIV9SD7"), row.names = 1:2, class = "data.frame")
I'd also suggest converting your dataset to a data.table so that you can make use of the set functions in "data.table" (like setcolorder). This will let you update the column order by reference.

Related

How to set a sorted data to plot?

I'm trying to create a scatter plot of two data, but I don't know how to specify my sorted result to the plot.
The procedure is like this:
Read "data_to_be_chosen.csv"
Read "data_to_be_plotted.csv"
Sort "data_to_be_chosen.csv" to find the two topmost values (and their names)
Find the corresponding names/columns in "data_to_be_plotted.csv"
Show the scatter plot of the two
I have a problem at the 5th step.
Let's assume that Column C and Column A have the two topmost values.
If I manually set the data to the plot, it would be:
plot(plotted$C, plotted$A)
However, I'd like to have it done automatically, depending on the sorted order.
I thought the following code would work:
plot(plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1])],
plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[2])])
But, this gives me an error:
Error in stripchart.default(x1, ...) : invalid plotting method
I also tried these, but they don't work either:
plot(names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1]),
names(sort(chosen_list$chosen[1,], decreasing=TRUE)[2]))
plot(colnames(sort(chosen_list$chosen[1,], decreasing=TRUE)[1]),
colnames(sort(chosen_list$chosen[1,], decreasing=TRUE)[2]))
Is there any way to set the sorted result to this plot?
I have no more ideas.
My R version is 4.1.2 (The latest version).
Here's my data:
data_to_be_chosen.csv
A,B,C
2.044281,0.757232,2.188617
data_to_be_plotted.csv
A,B,C
0.34503,-0.38781,-0.3506
0.351566,-0.3901,-0.35244
0.351817,-0.39144,-0.35435
0.351222,-0.39138,-0.35394
0.351222,-0.39113,-0.35366
0.350753,-0.39088,-0.35291
0.350628,-0.39041,-0.3531
0.349127,-0.3881,-0.3511
0.346125,-0.38675,-0.34969
0.346594,-0.38719,-0.34963
Here's my code:
plotted <- read.csv("data_to_be_plotted.csv")
chosen <- read.csv("data_to_be_chosen.csv")
chosen_list <- list(chosen=chosen)
sort(chosen_list$chosen[1,], decreasing=TRUE)[1:2]
names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1:2])
plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1:2])]
# Correlation can be calculated with the above data frame
cor(plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1])],
plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[2])])
# What I want is this plot ... except manually specifying C or A
plot(plotted$C, plotted$A)
# The above data frame can NOT be used to plot / Issues "invalid plotting method"
plot(plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1])],
plotted[names(sort(chosen_list$chosen[1,], decreasing=TRUE)[2])])
# I also tried, but no luck:
plot(names(sort(chosen_list$chosen[1,], decreasing=TRUE)[1]),
names(sort(chosen_list$chosen[1,], decreasing=TRUE)[2]))
plot(colnames(sort(chosen_list$chosen[1,], decreasing=TRUE)[1]),
colnames(sort(chosen_list$chosen[1,], decreasing=TRUE)[2]))
The problem, is just that subsetting one row gives a "data.frame" object instead of a "numeric" vector, what you probably expect. You can check that with class().
class(chosen_list$chosen[1,])
# [1] "data.frame"
The solution is to unlist it.
class(unlist(chosen_list$chosen[1,]))
# [1] "numeric"
In the following make use of creating objects instead of repeating code.
(x <- sort(unlist(chosen_list$chosen[1,]), decreasing=TRUE)[1:2])
# C A
# 2.188617 2.044281
(nx <- names(x))
# [1] "C" "A"
(p_df <- plotted[nx])
# C A
# 1 -0.35060 0.345030
# 2 -0.35244 0.351566
# 3 -0.35435 0.351817
# 4 -0.35394 0.351222
# 5 -0.35366 0.351222
# 6 -0.35291 0.350753
# 7 -0.35310 0.350628
# 8 -0.35110 0.349127
# 9 -0.34969 0.346125
# 10 -0.34963 0.346594
To get the correlation of two vectors of a data frame as a single value, we probably want [, j], since a data frame has two dimensions.
cor(p_df[, 1], p_df[, 2])
# [1] -0.9029339
Check (similar to above):
class(cor(p_df[1], p_df[2]))
# [1] "matrix" "array"
class(cor(p_df[, 1], p_df[, 2]))
# [1] "numeric"
I recommend you to update your knowledge about indexing using brackets.
Or just get the correlation matrix
cor(p_df)
# C A
# C 1.0000000 -0.9029339
# A -0.9029339 1.0000000
Finally use one of those:
plot(plotted[, nx[1]], plotted[, nx[2]])
plot(p_df[1:2])
plot(p_df) ## works in this special case, because we only have two columns
Data:
plotted <- structure(list(A = c(0.34503, 0.351566, 0.351817, 0.351222, 0.351222,
0.350753, 0.350628, 0.349127, 0.346125, 0.346594), B = c(-0.38781,
-0.3901, -0.39144, -0.39138, -0.39113, -0.39088, -0.39041, -0.3881,
-0.38675, -0.38719), C = c(-0.3506, -0.35244, -0.35435, -0.35394,
-0.35366, -0.35291, -0.3531, -0.3511, -0.34969, -0.34963)), class = "data.frame", row.names = c(NA,
-10L))
chosen <- structure(list(A = 2.044281, B = 0.757232, C = 2.188617), class = "data.frame", row.names = c(NA,
-1L))
chosen_list <- list(chosen=chosen)
You can coerce the single-row “data_to_be_chosen” df to a named vector using unlist; then sort, get the first two names, and use these to index into “data_to_be_plotted”:
chosen_vec <- unlist(chosen)
plot(plotted[names(sort(chosen_vec, decreasing = TRUE)[1:2])])

Dataframe in R, different numbers of rows and columns

I am working with a document in excel, which I import in R as a list. This list consists of multiple dataframe types. For instance, when I type
data_list <- import_list("my_doc.xlsx")
I obtain a list with 3 types of dataframes- either 1* 30, 30* 31 or 0* 1. As one can imagine, the 0*1 are scalar values.
After this, I make a consolidated dataframe as follows:
my_data<- ldply (data_list, data.frame)
my_data<-t(my_data)
colnames(my_data) <- my_data[1,]
my_data<- my_data[-1,]
my_data1<-matrix(as.numeric(unlist(my_data)),nrow=nrow(my_data))
my_data1<-data.frame(my_data1)
I now obtain a single dataframe, entitled my_data1, with variables appropriately named. However, I lose all scalar variables. Intuitively, one way to go about it, would be to identify all the scalars, and make a vector of them which repeats in value, and is of the same length (i.e. 30), as the other variables. At the moment, they simply disappear.
Any help is much appreciated!
An example of the datastructure is as follows. a is the scalar, and b represents an example of the 1*30 variable. The ... represent the continuation from period 2 to 30.
a= structure(list(`24` = logical(0)), row.names = character(0), class = "data.frame"))
b= structure(list(period1 = 1, period2 = 2,
period3 = 3, period4 = 4,
period5 = 5), row.names = 1L, class = "data.frame"),
One issue here is that a is stored as logical(0). How can I change this?
Try using dplyr::bind_rows which keep the column from 0 * 1 dataframe and add it in the final dataframe with NAs.
result <- dplyr::bind_rows(a, b)
result
# 24 period1 period2 ...period30
#1 NA 5 4 4

Comparing two columns in a dataframe using R or Excel

I have a csv file containing two columns, "Taxon" in column A and "Tip" in column C. I would like to compare column A against column C, and if the string matches another string in column C I'd like it to print "y" or something similar in column B next to the string in column A, if not I would like to print "n" or equivalent. Here is the beginning of my data:
Taxon B Tip
Nitrosotalea devanaterra Methanothermobacter thermautotrophicus
Nitrososphaera gargensis Methanobacterium beijingense
Nitrososphaera sca5445 Methanobacterium bryantii
Nitrososphaera sca2170 Methanosarcina mazei
Methanobacterium beijingense Persephonella marina
Methanobacterium bryantii Sulfurihydrogenibium azorense
Methanothermobacter thermautotrophicus Balnearium lithotrophicum
Methanosarcina mazei Isosphaera pallida
Koribacter versatilis Methanobacterium beijingense
Acidicapsa borealis Parachlamydia acanthamoebae
Acidobacterium capsulatum Leptospira biflexa
This is only a small part of the data, but the idea is that "n" would be printed in column B for all of the bacteria apart from "Methanobacterium beijingense" and "Methanobacterium bryantii", which are also found in the "Tip" column, and so "y" would be posted there. These could also just be "1" and "0".
I know dplyr has some good functions for filtering and joining data, however I can't find anything that exactly matches my needs. If there is an alternative method of using Excel to do this that's fine too.
Thanks.
For excel use the following formula in B2,
=if(isnumber(match(a2, c:c, 0)), "y", "n")
Fill down or double-click the 'drag button'.
A method using r and dplyr:
# create example data
x = read.table(header = TRUE, stringsAsFactors = FALSE, text =
"Taxon B Tip
Nitrosotalea_devanaterra 1 Methanothermobacter_thermautotrophicus
Nitrososphaera_gargensis 1 Methanobacterium_beijingense
Nitrososphaera_sca5445 1 Methanobacterium_bryantii
Nitrososphaera_sca2170 1 Methanosarcina_mazei
Methanobacterium_beijingense 1 Persephonella_marina
Methanobacterium_bryantii 1 Sulfurihydrogenibium_azorense
Methanothermobacter_thermautotrophicus 1 Balnearium_lithotrophicum
Methanosarcina_mazei 1 Isosphaera_pallida
Koribacter_versatilis 1 Methanobacterium_beijingense
Acidicapsa_borealis 1 Parachlamydia_acanthamoebae
Acidobacterium_capsulatum 1 Leptospira_biflexa")
# Data management part
x1 = data.frame(A = x$Taxon,B = x$B)
x2 = data.frame(A = x$Tip,B = x$B)
x$B[which(x$Taxon == anti_join(x1,x2))] = 0

Use a vector/index as a row name in a dataframe using rbind

I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"

Specifying column names from a list in the data.frame command

I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I wrap cols[1] in I() or eval().
How can I feed that item from the vector into the data.frame() command?
Update:
For some background, I have defined a function calc.means() that takes a data frame and a list of variables and performs a large and complicated ddply operation, summarizing at the level specified by the variables.
What I'm trying to do with the data.frame() command is walk back up the aggregation levels to the very top, re-running calc.means() at each step and using rbind() to glue the results onto one another. I need to add dummy columns with 'All' values in order to get the rbind to work properly.
I'm rolling cast-like margin functionality into ddply, basically, and I'd like to not retype the column names for each run. Here's the full code:
cols <- c('Col1','Col2','Col3')
rbind ( calc.means(dat,cols),
data.frame(cols[1]='All', calc.means(dat, cols[2:3])),
data.frame(cols[1]='All', cols[2]='All', calc.means(dat, cols[3]))
)
Use can use structure:
cols <- c("a","b")
foo <- structure(list(c(1, 2 ), c(3, 3)), .Names = cols, row.names = c(NA, -2L), class = "data.frame")
I don't get why you are doing this though!
I'm not sure how to do it directly, but you could simply skip the step of assigning names in the data.frame() command. Assuming you store the result of data.frame() in a variable named foo, you can simply do:
names(foo) <- cols
after the data frame is created
There is one trick. You could mess with lists:
cols_dummy <- setNames(rep(list("All"), 3), cols)
Then if you use call to list with one paren then you should get what you want
data.frame(cols_dummy[1], calc.means(dat, cols[2:3]))
You could use it on-the-fly as setNames(list("All"), cols[1]) but I think it's less elegant.
Example:
some_names <- list(name_A="Dummy 1", name_B="Dummy 2") # equivalent of cols_dummy from above
data.frame(var1=rnorm(3), some_names[1])
# var1 name_A
# 1 -1.940169 Dummy 1
# 2 -0.787107 Dummy 1
# 3 -0.235160 Dummy 1
I believe the assign() function is your answer:
cols <- c('Col1','Col2','Col3')
data.frame(assign(cols[1], rnorm(10)))
Returns:
assign.cols.1...rnorm.10..
1 -0.02056822
2 -0.03675639
3 1.06249599
4 0.41763399
5 0.38873118
6 1.01779018
7 1.01379963
8 1.86119518
9 0.35760039
10 1.14742560
With the lapply() or sapply() function, you should be able to loop the cbind() process. Something like:
operation <- sapply(cols, function(x) data.frame(assign(x, rnorm(10))))
final <- data.frame(lapply(operation, cbind))
Returns:
Col1.assign.x..rnorm.10.. Col2.assign.x..rnorm.10.. Col3.assign.x..rnorm.10..
1 0.001962187 -0.3561499 -0.22783816
2 -0.706804781 -0.4452781 -1.09950505
3 -0.604417525 -0.8425018 -0.73287079
4 -1.287038060 0.2545236 -1.18795684
5 0.232084366 -1.0831463 0.40799046
6 -0.148594144 0.4963714 -1.34938144
7 0.442054119 0.2856748 0.05933736
8 0.984615916 -0.0795147 -1.91165189
9 1.222310749 -0.1743313 0.18256877
10 -0.231885977 -0.2273724 -0.43247570
Then, to clean up the column names:
colnames(final) <- cols
Returns:
Col1 Col2 Col3
1 0.19473248 0.2864232 0.93115072
2 -1.08473526 -1.5653469 0.09967827
3 -1.90968422 -0.9678024 -1.02167873
4 -1.11962371 0.4549290 0.76692067
5 -2.13776949 3.0360777 -1.48515698
6 0.64240694 1.3441656 0.47676056
7 -0.53590163 1.2696336 -1.19845723
8 0.09158526 -1.0966833 0.91856639
9 -0.05018762 1.0472368 0.15475583
10 0.27152070 -0.2148181 -1.00551111
Cheers,
Adam

Resources