How to indicate row.names=1 using fread() in data.table? - r

I want to consider the first column in my .csv file as a sequence of rownames. Usually I used to do the following:
read.csv("example_file.csv", row.names=1)
But I want to do this with the fread() function in the data.table R package, as it runs very quickly.

X <- as.matrix(fread("bigmatrix.csv"),rownames=1)

Why not saving the rownames in a column:
df <- data.frame(x=rnorm(1000))
df$row_name = row.names(df)
fwrite(df,file="example_file.csv")
Then you can load the saved CSV.
df <- fread(file="example_file.csv")

From a small search I've done, data.tables never uses row names. Since data.tables inherit from data.frames, it still has the row names attribute. But it never uses them.
However, you can probably use this answer (similar post) and later make the rowname column into your actual rownames. Though, it might not be efficient.

Just one function, convert to a dataframe
a <- fread(file="example_file.csv") %>% as.data.frame()
row.names(a) <- a$V1

Related

Why can I not rename columns of a tbl?

I came across a weird function in dplyr's tbl:
df <- as.tibble(iris)
i <- colnames(df)[5]
df$new <- df[,i]
For some reason the newly created column new is named new.Species (at least when I View(df)), however it should be named new only....
I do not understand why this happens. An obious fix is to simply save df as a data.frame - but I still would like to understand what happens here.
Because the df[,i] is still a tibble with one column. We need df[[i]]:
df$new <- df[[i]]
With data.frame, when we use [, by default drop = TRUE (?Extract), but in tibble, it won't drop the dimensions to create a vector. We need [[ to extract the column.

Loop to create one dataframe from multiple URLs

I have a character vector with multiple URLs that each host a csv of crime data for a certain year. Is there an easy way to create a loop that will read.csv and rbind all the dataframes without having to run read.csv 8 times over? The vector of URLs is below
urls <- c('https://opendata.arcgis.com/datasets/73cd2f2858714cd1a7e2859f8e6e4de4_33.csv',
'https://opendata.arcgis.com/datasets/fdacfbdda7654e06a161352247d3a2f0_34.csv',
'https://opendata.arcgis.com/datasets/9d5485ffae914c5f97047a7dd86e115b_35.csv',
'https://opendata.arcgis.com/datasets/010ac88c55b1409bb67c9270c8fc18b5_11.csv',
'https://opendata.arcgis.com/datasets/5fa2e43557f7484d89aac9e1e76158c9_10.csv',
'https://opendata.arcgis.com/datasets/6eaf3e9713de44d3aa103622d51053b5_9.csv',
'https://opendata.arcgis.com/datasets/35034fcb3b36499c84c94c069ab1a966_27.csv',
'https://opendata.arcgis.com/datasets/bda20763840448b58f8383bae800a843_26.csv'
)
The function map_dfr from the purrr package does exactly what you want. It applies a function to every element of an input (in this case urls) and binds together the result by row.
library(tidyverse)
map_dfr(urls, read_csv)
I used read_csv() instead of read.csv() out of personal preference but both will work.
In base R:
result <- lapply(urls, read.csv, stringsAsFactors = FALSE)
result <- do.call(rbind, result)
I usually take this approach as I want to save all the csv files separately in case later I need to do further analysis on each of them. Otherwise, you don't need a for-loop.
for (i in 1:length(urls)) assign(paste0("mycsv-",i), read.csv(url(urls[i]), header = T))
df.list <- mget(ls(pattern = "mycsv-*"))
#use plyr if different column names and need to know which row comes from which csv file
library(plyr)
df <- ldply(df.list) #you can remove first column if you wish
#Alternative solution in base R instead of using plyr
#if they have same column names and you only want rbind then you can do this:
df <- do.call("rbind", df.list)

renaming subset of columns in r with paste0

I have a data frame (my_df) with columns named after individual county numbers. I melted/cast the data from a much larger set to get to this point. The first column name is year and it is a list of years from 1970-2011. The next 3010 columns are counties. However, I'd like to rename the county columns to be "column_"+county number.
This code executes in R but for whatever reason doesn't update the column names. they remain solely the numbers... any help?
new_col_names = paste0("county_",colnames(my_df[,2:ncol(my_df)]))
colnames(my_df[,2:ncol(my_df)]) = new_col_names
The problem is the subsetting within the colnames call.
Try names(my_df) <- c(names(my_df)[1], new_col_names) instead.
Note: names and colnames are interchangeable for data.frame objects.
EDIT: alternate approach suggested by flodel, subsetting outside the function call:
names(my_df)[-1] <- new_col_names
colnames() is for a matrix (or matrix-like object), try simply names() for a data.frame
Example:
new_col_names=paste0("county_",colnames(my_df[,2:ncol(my_df)]))
my_df <- data.frame(a=c(1,2,3,4,5), b=rnorm(5), c=rnorm(5), d=rnorm(5))
names(my_df) <- c(names(my_df)[1], new_col_names)

Rename multiple dataframe columns, referenced by current names

I want to rename some random columns of a large data frame and I want to use the current column names, not the indexes. Column indexes might change if I add or remove columns to the data, so I figure using the existing column names is a more stable solution.
This is what I have now:
mydf = merge(df.1, df.2)
colnames(mydf)[which(colnames(mydf) == "MyName.1")] = "MyNewName"
Can I simplify this code, either the original merge() call or just the second line? "MyName.1" is actually the result of an xts merge of two different xts objects.
The trouble with changing column names of a data.frame is that, almost unbelievably, the entire data.frame is copied. Even when it's in .GlobalEnv and no other variable points to it.
The data.table package has a setnames() function which changes column names by reference without copying the whole dataset. data.table is different in that it doesn't copy-on-write, which can be very important for large datasets. (You did say your data set was large.). Simply provide the old and the new names:
require(data.table)
setnames(DT,"MyName.1", "MyNewName")
# or more explicit:
setnames(DT, old = "MyName.1", new = "MyNewName")
?setnames
names(mydf)[names(mydf) == "MyName.1"] = "MyNewName" # 13 characters shorter.
Although, you may want to replace a vector eventually. In that case, use %in% instead of == and set MyName.1 as a vector of equal length to MyNewName
plyr has a rename function for just this purpose:
library(plyr)
mydf <- rename(mydf, c("MyName.1" = "MyNewName"))
names(mydf) <- sub("MyName\\.1", "MyNewName", names(mydf))
This would generalize better to a multiple-name-change strategy if you put a stem as a pattern to be replaced using gsub instead of sub.
You can use the str_replace function of the stringr package:
names(mydf) <- str_replace(names(mydf), "MyName.1", "MyNewName")

In R, how to create a loop to divide columns in a data frame

In R, I would like to create a loop which takes the first 3000 columns of my data frame and writes them into one file, the next 3000 columns into another file, and so on and so forth until all columns have been divided as such. What would be the best way to do this? I understand there are the isplit and iterators functions available now via CRAN, but I am really unsure how to go about this. Any suggestions please?
You could try something like:
library(plyr)
max.col <- ncol(x)
l_ply(seq(1, max.col, by=3000), function(i)
write.table(x[,i:min(i+2999, max.col)], file=paste("i", i, sep="-"))
)
Not sure why you'd bother loading plyr... assuming your data frame is df... (stole the wise use of min() from Shane's answer)
maxCol <- ncol(df)
for (i in seq(1, maxCol, by 3000)) {
write.table(df[,i:min(i+2999, maxCol)], "i")
}
You may want to edit the write.table command above to add in your preferred formatting.

Resources