This is about ordering column names that contain both numbers and text. I have a dataframe which resulted from dcastand has 200 rows. I have a problem with the ordering.
The column names are in the following format:
names(DF) <- c('Testname1.1', 'Testname1.100','Testname1.11','Testname1.2',...,Testname2.99)
Edit: I would like to have the columns ordered as:
names(DF) <- c('Testname1.1, Testname1.2,Testname1.3,...Testname1.100,Testname2.1,...Testname 2.100)
The original input has a column which specifies the day, but it is not being used when I 'cast' the data. Is there a way to specify the 'dcast' function to order combined column names numerically?
What would be the easiest way to get the columns ordered as I need to in R?
Thanks a lot!
I think you need to split the column before you can use it to order the data frame:
library("reshape2") ## for colsplit()
library("gtools")
Construct test data:
dat <- data.frame(matrix(1:25,5))
names(dat) <- c('Testname1.1', 'Testname1.100',
'Testname1.11','Testname1.2','Testname2.99')
Split and order:
cdat <- colsplit(names(dat),"\\.",c("name","num"))
dat[,order(mixedorder(cdat$name),cdat$num)]
## Testname1.1 Testname1.2 Testname1.11 Testname1.100 Testname2.99
## 1 1 16 11 6 21
## 2 2 17 12 7 22
## 3 3 18 13 8 23
## 4 4 19 14 9 24
## 5 5 20 15 10 25
The mixedorder() above (borrowed from #BondedDust's answer) is not really necessary for this example, but would be needed if the first (Testnamexx) component had more than 9 elements, so that Testname1, Testname2, and Testname10 would come in the proper order.
The mixedorder and mixedsort functions of pkg:gtools sometimes does what is desired but in this case I think the period separator is messing things up because it is part of numeric values. But clearly was intended go be a separator rather than decimal point. Try
nvec <- c('Testname1.1', 'Testname1.100', 'Testname1.11', 'Testname1.2', 'Testname2.99')
#------------
> require(gtools)
Loading required package: gtools
Attaching package: ‘gtools’
The following objects are masked from ‘package:boot’:
inv.logit, logit
#------------
myvec <- nvec[order( mixedorder( sapply(strsplit(nvec, "\\."), "[[", 1)),
as.numeric(sapply(strsplit(nvec, "\\."), "[[", 2)) )
]
One way would be:
library(gtools) #use gtools library
library(NCmisc) #use NCmisc library for pad.left()
myvec <- c('Testname1.1', 'Testname1.100','Testname1.11','Testname1.2','Testname2.99') #construct your vector
myvec[mixedorder( paste(substring(myvec,1,9), pad.left(substring(myvec,11,100),'0') , sep='') ) ]
[1] "Testname1.1" "Testname1.2" "Testname1.11" "Testname1.100" "Testname2.99"
Related
This is likely a quick fix! I am trying to place the ith position of my vector into my data frame column name. I am trying to use paste0 to enter the ith number.
sma <- 2:20
> sma
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Place i number from sma vector to data frame column name
spx.sma <- df$close.sma.paste0("n", sma[i])
Column name should read:
"close.sma.n2"
If I print
paste0("n", sma[i])
I obtain:
> paste0("n", sma[i])
[1] "n2"
So really if i paste this into my data frame column name then it should read:
close.sma.n2
What is the correct method to achieve this?
I achieve the error:
> spx.sma <- df$close.sma.paste0(".n", sma[i])
Error: attempt to apply non-function
You should treat the dataframe as a list. So avoid the "$" operator and instead use [[]].
so:
spx.sma <- df[[paste0("close.sma.n", sma[i])]]
I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}
I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?
When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23
I have, for example, this three datasets (in my case, they are many more and with a lot of variables):
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame2 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
on each data frame I want to add a variable resulting from a transformation of an existing variable on that data frame. I would to do this by a loop. For example:
datasets <- c("data_frame1","data_frame2","data_frame3")
vars <- c("a","b","c")
for (i in datasets){
for (j in vars){
# here I need a code that create a new variable with transformed values
# I thought this would work, but it didn't...
get(i)$new_var <- log(get(i)[,j])
}
}
Do you have some valid suggestions about that?
Moreover, it would be great for me if it were possible also to assign the new column names (in this case new_var) by a character string, so I could create the new variables by another for loop nested in the other two.
Hope I've not been too tangled in explain my problem.
Thanks in advance.
You can put your dataframes in a list and use lapply to process them one by one. So no need to use a loop in this case.
For example you can do this :
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
## transformation
### .....
df
})
the dataframe1 becomes :
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
I had the same need and wanted to change also the columns in my actual list of dataframes.
I found a great method here (the purrr::map2 method in the question works for dataframes with different columns), followed by
list2env(list_of_dataframes ,.GlobalEnv)
I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I wrap cols[1] in I() or eval().
How can I feed that item from the vector into the data.frame() command?
Update:
For some background, I have defined a function calc.means() that takes a data frame and a list of variables and performs a large and complicated ddply operation, summarizing at the level specified by the variables.
What I'm trying to do with the data.frame() command is walk back up the aggregation levels to the very top, re-running calc.means() at each step and using rbind() to glue the results onto one another. I need to add dummy columns with 'All' values in order to get the rbind to work properly.
I'm rolling cast-like margin functionality into ddply, basically, and I'd like to not retype the column names for each run. Here's the full code:
cols <- c('Col1','Col2','Col3')
rbind ( calc.means(dat,cols),
data.frame(cols[1]='All', calc.means(dat, cols[2:3])),
data.frame(cols[1]='All', cols[2]='All', calc.means(dat, cols[3]))
)
Use can use structure:
cols <- c("a","b")
foo <- structure(list(c(1, 2 ), c(3, 3)), .Names = cols, row.names = c(NA, -2L), class = "data.frame")
I don't get why you are doing this though!
I'm not sure how to do it directly, but you could simply skip the step of assigning names in the data.frame() command. Assuming you store the result of data.frame() in a variable named foo, you can simply do:
names(foo) <- cols
after the data frame is created
There is one trick. You could mess with lists:
cols_dummy <- setNames(rep(list("All"), 3), cols)
Then if you use call to list with one paren then you should get what you want
data.frame(cols_dummy[1], calc.means(dat, cols[2:3]))
You could use it on-the-fly as setNames(list("All"), cols[1]) but I think it's less elegant.
Example:
some_names <- list(name_A="Dummy 1", name_B="Dummy 2") # equivalent of cols_dummy from above
data.frame(var1=rnorm(3), some_names[1])
# var1 name_A
# 1 -1.940169 Dummy 1
# 2 -0.787107 Dummy 1
# 3 -0.235160 Dummy 1
I believe the assign() function is your answer:
cols <- c('Col1','Col2','Col3')
data.frame(assign(cols[1], rnorm(10)))
Returns:
assign.cols.1...rnorm.10..
1 -0.02056822
2 -0.03675639
3 1.06249599
4 0.41763399
5 0.38873118
6 1.01779018
7 1.01379963
8 1.86119518
9 0.35760039
10 1.14742560
With the lapply() or sapply() function, you should be able to loop the cbind() process. Something like:
operation <- sapply(cols, function(x) data.frame(assign(x, rnorm(10))))
final <- data.frame(lapply(operation, cbind))
Returns:
Col1.assign.x..rnorm.10.. Col2.assign.x..rnorm.10.. Col3.assign.x..rnorm.10..
1 0.001962187 -0.3561499 -0.22783816
2 -0.706804781 -0.4452781 -1.09950505
3 -0.604417525 -0.8425018 -0.73287079
4 -1.287038060 0.2545236 -1.18795684
5 0.232084366 -1.0831463 0.40799046
6 -0.148594144 0.4963714 -1.34938144
7 0.442054119 0.2856748 0.05933736
8 0.984615916 -0.0795147 -1.91165189
9 1.222310749 -0.1743313 0.18256877
10 -0.231885977 -0.2273724 -0.43247570
Then, to clean up the column names:
colnames(final) <- cols
Returns:
Col1 Col2 Col3
1 0.19473248 0.2864232 0.93115072
2 -1.08473526 -1.5653469 0.09967827
3 -1.90968422 -0.9678024 -1.02167873
4 -1.11962371 0.4549290 0.76692067
5 -2.13776949 3.0360777 -1.48515698
6 0.64240694 1.3441656 0.47676056
7 -0.53590163 1.2696336 -1.19845723
8 0.09158526 -1.0966833 0.91856639
9 -0.05018762 1.0472368 0.15475583
10 0.27152070 -0.2148181 -1.00551111
Cheers,
Adam