I want to add something on the end of all column names in a dataframe, unless the column name exists in another given vector.
For example say I have
df <- data.frame('my' = c(1,2,3),
'data' = c(4,5,6),
'is' = c(7,8,9),
'here' = c(10,11,12))
dont_update <- c('my', 'is')
to_add <- '_new'
And I want to end up with
my data_new is here_new
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
A bit verbose, but this works
to_update <- names(df)[!names(df) %in% dont_update]
names(df)[match(to_update, names(df))] <- paste0(to_update, to_add)
or maybe this is clearer
names(df) <- ifelse(names(df) %in% dont_update, names(df), paste0(names(df), to_add))
Related
I'm trying to use an element from a list to specify a variable name in a dataframe.
I know I can create a dataframe like this, creating variable A and C
list_A <- c(1,3,6,9,10)
List_C <- c(2,3,5,6,10)
df <- data.frame( A = list_A , C = List_C )
> df
A C
1 1 2
2 3 3
3 6 5
4 9 6
5 10 10
However, I'd like to specify the variable names from elements from a list, in this manner
nameslist <- c("A","B","C")
df <- data.frame( eval(parse(text=nameslist[1])) = List_A , eval(parse(text=nameslist[3])) = List_C )
I tried this, but cant get this code to run. Is there a way to adjust the "eval/parse" bit to make this work? Many thanks in advance.
nameslist <- c("A","B","C")
setNames(data.frame(sapply(nameslist[c(1,3)], \(x) 1:5)), nameslist[c(1,3)])
Or, if you simply had a set of values_for_A and a set of values_for_C, you could do something like this:
setNames(data.frame(list(values_for_A, values_for_C)), nameslist[c(1,3)])
I have a set of data frames named df_1968, df_1969, df_1970, ..., df_2016 collected in a list called my_list.
I want to add a new column in each of these data frames which simply is the current year (1968 in df_1968 and so on). I've managed to do it by looping through the data frames but I am looking for a more neat solution. I've tried the following:
# Function to extract year from name of data frames
substrRight <- function(y, n) {
substr(y, nchar(y) - n + 1, nchar(y))
}
# Add variable "year" equal to 1968 in df_1968 and so on
my_list <- lapply(my_list, function(x) cbind(x, year <- as.numeric(substrRight(names(x), 4 ))))
However this throws the error:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing numbers of rows: 18878, 7
I can see that the way I assign the value to the variable probably does not make sense but can't wrap my head around how to do it instead. Help appreciated.
Note that the substrRight function seems to be working perfectly fine and that
as.numeric(substrRight(names(x), 4 ))
yields the vector of years 1968-2016
This works in Base-R
years <- sub(".*([0-9]{4}$)","\\1",names(my_list))
new_list <- lapply(1:length(years), function(x) cbind(my_list[[x]],year=years[x]))
names(new_list) <- names(my_list)
with this self-made example data
df_1968 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1969 = data.frame(a=c(1,2,3),b=c(4,5,6))
df_1970 = data.frame(a=c(1,2,3),b=c(4,5,6))
my_list <- list(df_1968,df_1969,df_1970)
names(my_list) <- c("df_1968","df_1969","df_1970")
I get this output
> new_list
$df_1968
a b year
1 1 4 1968
2 2 5 1968
3 3 6 1968
$df_1969
a b year
1 1 4 1969
2 2 5 1969
3 3 6 1969
$df_1970
a b year
1 1 4 1970
2 2 5 1970
3 3 6 1970
The following function will loop through a named list of data frames and create a column year with the 4 last characters of the list's names.
I have simplified the function substrRighta bit. Since it's the last characters that are needed, it uses substring, with no need for a last character position.
substrRight <- function(y, n) {
substring(y, nchar(y) - n + 1)
}
my_list <- lapply(names(my_list), function(x){
my_list[[x]][["year"]] <- as.numeric(substrRight(x, 4))
my_list[[x]]
})
Data creation code.
my_list <- lapply(1968:1970, function(i) data.frame(a = 1:5, b = letters[1:5]))
names(my_list) <- paste("df", 1968:1970, sep = "_")
Here is my df (My full data set is up to 20 column items, for simplicity, just show the first 3, i.e. INC_D.1, INC_D.2, INC_D.3):
Item <- c("A","B","C")
INC_D.1 <- c("10A345","255789","402B56")
CODE_D.1 <- c("2","4","5")
INC_D.2 <- c("675C98","404D34","203559")
CODE_D.2 <- c("5","3","2")
INC_D.3 <- c("LG99w0e03","1025gg205","w2krt2")
CODE_D.3 <- c("3","2","2")
df <- as.data.frame(cbind(Item,INC_D.1,CODE_D.1,INC_D.2,CODE_D.2,INC_D.3,CODE_D.3))
Originally I am using the the following code to check the column exist or not and create the new variable one by one:
if("CODE_D.1" %in% colnames(df))
{df$INC_D.1 <- as.character(df$INC_D.1)
df$INC_D.1.2only <- as.character(ifelse(df$CODE_D.1=="2",df$INC_D.1,""))}
if("CODE_D.2" %in% colnames(df))
{df$INC_D.2 <- as.character(df$INC_D.2)
df$INC_D.2.2only <- as.character(ifelse(df$CODE_D.2=="2",df$INC_D.2,""))}
if("CODE_D.3" %in% colnames(df))
{df$INC_D.3 <- as.character(df$INC_D.3)
df$INC_D.3.2only <- as.character(ifelse(df$CODE_D.3=="2",df$INC_D.3,""))}
I am trying to rewrite the code by using forloop:
for (i in 1:3){
if(paste0("CODE_D.",i) %in% colnames(df)){
for (j in 1:nrow(df)){
if(df[paste0("CODE_D.",i)][j,]=="2"){
print(paste0("True:[INC=",i,",ROW=",j,"]")) #Check
df[paste0("INC_D.",i,".2only")] <- c(rep("",nrow(df)))
df[paste0("INC_D.",i,".2only")][j,] <- as.character(df[paste0("INC_D.",i)][j,])
}
}
}
}
The for loop can run but one of the element of INC_D.3.2only is missing, here is the output:
[1] "True:[INC=1,ROW=1]"
[1] "True:[INC=2,ROW=3]"
[1] "True:[INC=3,ROW=2]"
[1] "True:[INC=3,ROW=3]"
> df
Item INC_D.1 CODE_D.1 INC_D.2 CODE_D.2 INC_D.3 CODE_D.3 INC_D.1.2only INC_D.2.2only INC_D.3.2only
1 A 10A345 2 675C98 5 LG99w0e03 3 10A345
2 B 255789 4 404D34 3 1025gg205 2
3 C 402B56 5 203559 2 w2krt2 2 203559 w2krt2
How can I modify to get the desired output
An idea via base R would be to split based on column names, replace the values as per your condition and bind, i.e.
cbind.data.frame(df, do.call(cbind,
lapply(split.default(df[-1], gsub('.*_', '', names(df[-1]))), function(i)
{i <- replace(i[1], i[2] != 2, '');
names(i) <- paste0(names(i), 'only');
i})))
which gives,
Item INC_D.1 CODE_D.1 INC_D.2 CODE_D.2 INC_D.3 CODE_D.3 INC_D.1only INC_D.2only INC_D.3only
1 A 10A345 2 675C98 5 LG99w0e03 3 10A345 <NA> <NA>
2 B 255789 4 404D34 3 1025gg205 2 <NA> <NA> 1025gg205
3 C 402B56 5 203559 2 w2krt2 2 <NA> 203559 w2krt2
I want to delete the header from a dataframe that I have. I read in the data from a csv file then I transposed it, but it created a new header that is the name of the file and the row that the data is from in the file.
Here's an example for a dataframe df:
a.csv.1 a.csv.2 a.csv.3 ...
x 5 6 1 ...
y 2 3 2 ...
I want to delete the a.csv.n row, but when I try df <- df[-1,] it deletes row x and not the top.
If you really, really, really don't like column names, you may convert your data frame to a matrix (keeping possible coercion of variables of different class in mind), and then remove the dimnames.
dd <- data.frame(x1 = 1:5, x2 = 11:15)
mm1 <- as.matrix(dd)
mm2 <- matrix(mm1, ncol = ncol(dd), dimnames = NULL)
I add my previous comment here as well:
?data.frame: "The column names should be non-empty, and attempts to use empty names will have unsupported results.".
Set names to NULL
names(df) <- NULL
You can also use the header option in read.csv
You can use names(df) to change the names of header or col names. If newnames is a list of names as newname<-list("col1","col2","col3"), then names(df)<-newname will give you a data with col names as col1 col2 col3.
As # Henrik said, the col names should be non-empty. Setting the names(df)<-NULLwill give NA in col names.
If your data is csv file and if you use header=TRUE to read the data in R then the data will have same colnames as csv file, but if you set the header=FALSE, R will assign the colnames as V1,V2,...and your colnames in the original csv file appear as a first row.
anydata.csv
a b c d
1 1 2 3 13
2 2 3 1 21
read.csv("anydata.csv",header=TRUE)
a b c d
1 1 2 3 13
2 2 3 1 21
read.csv("anydata.csv",header=FALSE)
V1 V2 V3 V4
1 a b c d
2 1 2 3 13
3 2 3 1 21
You could use
setNames(dat, rep(" ", length(dat)))
where dat is the name of the data frame. Then all columns will have the name " " and hence will be 'invisible'.
It comes with some years of delay but you can simply use a vector renaming de columns:
## if you want to delete all column names:
colnames(df)[] <- ""
## if you want to delete let's say column 1:
colnames(df)[1] <- ""
## if you want to delete 1 to 3 and 7:
colnames(df)[c(1:3,7)] <- ""
As already mentioned not having column names just isn't something that is going to happen with a data frame, but I'm kind of guessing that you don't care so much if they are there you just don't want to see them when you print your data frame? If so, you can write a new print function to get around that, like so:
> dat <- data.frame(var1=c("A","B","C"),var2=rnorm(3),var3=rnorm(3))
> print(dat)
var1 var2 var3
1 A 1.2771777 -0.5726623
2 B -1.5000047 1.3249348
3 C 0.1989117 -1.4016253
> ncol.print <- function(dat) print(matrix(as.matrix(dat),ncol=ncol(dat),dimnames=NULL),quote=F)
> ncol.print(dat)
[,1] [,2] [,3]
[1,] A 1.2771777 -0.5726623
[2,] B -1.5000047 1.3249348
[3,] C 0.1989117 -1.4016253
Your other option it set your variable names to unique amounts of whitespace, for example:
> names(dat) <- c(" ", " ", " ")
> dat
1 A 1.2771777 -0.5726623
2 B -1.5000047 1.3249348
3 C 0.1989117 -1.4016253
You can also write a function do this:
> blank.names <- function(dat){
+ for(i in 1:ncol(dat)){
+ names(dat)[i] <- paste(rep(" ",i),collapse="")
+ }
+ return(dat)
+ }
> dat <- data.frame(var1=c("A","B","C"),var2=rnorm(3),var3=rnorm(3))
> dat
var1 var2 var3
1 A -1.01230289 1.2740237
2 B -0.13855777 0.4689117
3 C -0.09703034 -0.4321877
> blank.names(dat)
1 A -1.01230289 1.2740237
2 B -0.13855777 0.4689117
3 C -0.09703034 -0.4321877
But generally I don't think any of this should be done.
A function that I use in one of my R scripts:
read_matrix <- function (csvfile) {
a <- read.csv(csvfile, header=FALSE)
matrix(as.matrix(a), ncol=ncol(a), dimnames=NULL)
}
How to call this:
iops_even <- read_matrix('even_iops_Jan15.csv')
iops_odd <- read_matrix('odd_iops_Jan15.csv')
You can simply do:
print(df.to_string(header=False))
if you want to remove the line indexes as well, you can do:
print(df.to_string(index=False,header=False))
I have a list of files. I also have a list of "names" which I substr() from the actual filenames of these files. I would like to add a new column to each of the files in the list. This column will contain the corresponding element in "names" repeated times the number of rows in the file.
For example:
df1 <- data.frame(x = 1:3, y=letters[1:3])
df2 <- data.frame(x = 4:6, y=letters[4:6])
filelist <- list(df1,df2)
ID <- c("1A","IB")
Pseudocode
for( i in length(filelist)){
filelist[i]$SampleID <- rep(ID[i],nrow(filelist[i])
}
// basically create a new column in each of the dataframes in filelist, and fill the column with repeted corresponding values of ID
my output should be like:
filelist[1] should be:
x y SAmpleID
1 1 a 1A
2 2 b 1A
3 3 c 1A
fileList[2]
x y SampleID
1 4 d IB
2 5 e IB
3 6 f IB
and so on.....
Any Idea how it could be done.
An alternate solution is to use cbind, and taking advantage of the fact that R will recylce values of a shorter vector.
For Example
x <- df2 # from above
cbind(x, NewColumn="Singleton")
# x y NewColumn
# 1 4 d Singleton
# 2 5 e Singleton
# 3 6 f Singleton
There is no need for the use of rep. R does that for you.
Therfore, you could put cbind(filelist[[i]], ID[[i]]) in your for loop or as #Sven pointed out, you can use the cleaner mapply:
filelist <- mapply(cbind, filelist, "SampleID"=ID, SIMPLIFY=F)
This is a corrected version of your loop:
for( i in seq_along(filelist)){
filelist[[i]]$SampleID <- rep(ID[i],nrow(filelist[[i]]))
}
There were 3 problems:
A final ) was missing after the command in the body.
Elements of lists are accessed by [[, not by [. [ returns a list of length one. [[ returns the element only.
length(filelist) is just one value, so the loop runs for the last element of the list only. I replaced it with seq_along(filelist).
A more efficient approach is to use mapply for the task:
mapply(function(x, y) "[<-"(x, "SampleID", value = y) ,
filelist, ID, SIMPLIFY = FALSE)
This one worked for me:
Create a new column for every dataframe in a list; fill the values of the new column based on existing column. (In your case IDs).
Example:
# Create dummy data
df1<-data.frame(a = c(1,2,3))
df2<-data.frame(a = c(5,6,7))
# Create a list
l<-list(df1, df2)
> l
[[1]]
a
1 1
2 2
3 3
[[2]]
a
1 5
2 6
3 7
# add new column 'b'
# create 'b' values based on column 'a'
l2<-lapply(l, function(x)
cbind(x, b = x$a*4))
Results in:
> l2
[[1]]
a b
1 1 4
2 2 8
3 3 12
[[2]]
a b
1 5 20
2 6 24
3 7 28
In your case something like:
filelist<-lapply(filelist, function(x)
cbind(x, b = x$SampleID))
The purrr way, using map2
library(dplyr)
library(purrr)
map2(filelist, ID, ~cbind(.x, SampleID = .y))
#[[1]]
# x y SampleId
#1 1 a 1A
#2 2 b 1A
#3 3 c 1A
#[[2]]
# x y SampleId
#1 4 d IB
#2 5 e IB
#3 6 f IB
Or can also use
map2(filelist, ID, ~.x %>% mutate(SampleId = .y))
If you name the list, we can use imap and add the new column based on it's name.
names(filelist) <- c("1A","IB")
imap(filelist, ~cbind(.x, SampleID = .y))
#OR
#imap(filelist, ~.x %>% mutate(SampleId = .y))
which is similar to using Map
Map(cbind, filelist, SampleID = names(filelist))
A tricky way:
library(plyr)
names(filelist) <- ID
result <- ldply(filelist, data.frame)
data_lst <- list(
data_1 = data.frame(c1 = 1:3, c2 = 3:1),
data_2 = data.frame(c1 = 1:3, c2 = 3:1)
)
f <- function (data, name){
data$name <- name
data
}
Map(f, data_lst , names(data_lst))