R, Dataset without column names - r

Complete noob here, specially with R.
For a school project I have to work with a specific dataset which doesn't come with column names in the dataset it self but there is a .txt that has extra information regarding the dataset, including the column names. The problem I'm having is that when I load the dataset rstudio assumes that the first line of data is actually the column names. Initially I just substituted the name with colnames() but by doing so I ended up ignoring/deleting the first line of data, and I'm sure that's not the right away of dealing with it.
How can I go about adding the correct column names without deleting the first line of data? (Preferably inside R due to school work requirements)
Thanks in advance!

When we read the data with read.table, use header = FALSE so that it automatically assigns a column name
df1 <- read.table('file.txt', header = FALSE)
Then, we can assign the preferred column names from the other .txt column
colnames(df1) <- scan('names.txt', what = '', quiet = TRUE)

Related

Join multiple columns from two data frames with different column names in each

I have the following problem: I have two excel files imported into R and I want to replace the data in the empty data frame that has the correct column names with data from the phone populated dataframe that has the wrong column names as well as other columns I don't need. I have a mapping key where variable name = wrong variable name. I figure the best way to do this is I use a join such as
df <- right join(keepdf, wrongdf, by= c('variable name' = 'wrong variable name', keep=FALSE)
My question is how to pass many pairs of names into the by= parameter. I have tried nested for loops with the respective variable names in each list but it hasn't worked. Also if anyone has a suggestion on how to accomplish this better I would really appreciate it.
if the code you put in top is the exact code you are trying to run you need to fix the function. You also need to load dplyr.
### if you want to keep the keep df it should be a left_join###
library(dplyr)
df <- left_join(keepdf, wrongdf, by= c('variable name' = 'wrong variable name'))

How do I select columns by name while ignoring certain characters?

I'm trying to pull data from a file, but only pull certain columns based on the column name.
I have this bit of code:
filepath <- ([my filepath])
files <- list.files(filepath, full.names=T)
newData <- fread(file,select=c(selectCols))
selectCols contains a list of column names (as strings). But in the data I'm pulling, there may be underscores placed differently in each file for the same data.
Here's an example:
PERIOD_ID
PERIOD_ID_
_PERIOD_ID_
And so on. I know I can use gsub to change the column names once the data is already pulled:
colnames(newData) <- gsub("_","",newData)
Then I can select by column name, but given that it's a lot of data I'm not sure this is the most efficient idea.
Is there a way to do ignore underscores or other characters within the fread function?

How to separate one column into many columns in a .txt file?

I've been given a data set for a project that I need to reformat in order to work with it.
The problem is that all of the column names and corresponding values are mashed into one column in the file. As shown in the picture.
I'm new to R so I hardly know how to work with complex commands.
My Questions:
Is there a simple way to separate this from 1 column into 12 columns?
Desire Output:
I'll also need to remove the periods between the column names and the semicolons between the values.
I just need to be able to do basic statistical analysis on the table.
Thanks
table
Although your data is in one column, it is semi colon separated. The read.csv function has the ability to accept a column separator:
df <- read.csv(file="path/to/your/file.txt", skip=1, header=FALSE, sep=";")
The above call will generate columns based on a ; separator. I skip the first line and ignore the header, because it is a single string. You may manually assign the columns names via:
names(df) <- c("name1", "name2", ..., "name12")

Removing rows causes "row.names" column to appear when displayed with View()

To remove rows from a data frame, I use the following command:
data <- data[-1, ]
for example to remove the first row. I need to remove the first 6 rows, so I used the following:
data <- data[-c(1,2,3,4,5,6), ]
OR
data <- data[-(1:6), ]
this works as far as removing the row names, but introduced a new column called row.names that I cannot get rid of unless I use the command:
row.names(data) <- NULL
What is the reason for this? Is there a better way of removing a number of rows/columns with one command?
Example:
after the following code:
tquery <- tquery[-(1:6), ]
This is the data:
Although it seems as such, you are not actually adding a column to the data. What you are seeing is just a result of using View(). The function is showing the "row.names" attribute of the data frame as the first column, but you didn't really add the column.
This is expected and documented behavior. From the Details section of help(View)
If there are row names on the data frame that are not 1:nrow, they are displayed in a separate first column called row.names.
So since you subsetted the data, the row names are technically not 1:nrow any more and hence the new column is introduced in the viewer.
Print your data in the console and you'll see the difference.
View(mtcars) ## because the mtcars row names are not 1:nrow
versus
mtcars
Basically, don't trust View() to display an exact representation of the actual data. Instead use attributes(), *names(), dim(), length(), etc. or just peek at the data with head().
See r help via "?row.names" for more info. From the documentation, "All data frames have a row names attribute"
?row.names ## get more information about row.names from r help
row.names is not a new column, but rather an attribute of every single data frame. This is simply meta data and is ignored by most data. When you output this data (i.e. CSV) or use it in a function, this data will not interfere. This is similar to how excel has row numbers on the left margin, which is referential data for the application.
str(your_dataframe) ## see that those columns don't exist
colnames(your_dataframe) ## see column names

R - column names in read.table and write.table starting with number and containing space

I am importing a csv of stock data into R, with column names of stock ticker which starts with number and containing space inside, e.g. "5560 JP". After reading into R, the column names are added with "X" and space replaced by ".", e.g. "X5560.JP". After all the works are done in R, I want to write the processed data back to a new csv, but with the original column name, e.g. "5560 JP" instead of "X5560.JP", how can I do that?
Thank you!
When you use write.csv or write.table to save your data to a CSV file, you can set the column names to whatever you like by setting the col.names argument.
But that assumes you have the column names to available.
Once you've read in the data and R has converted the names, you've lost that information. To get around this, you can suppress the conversion to get the column names:
df <- read.csv("mydata.csv", check.names=FALSE)
orig.cols <- colnames(df)
colnames(df) <- make.names(colnames(df))
[your original code]
write.csv(df, col.names=orig.cols)

Resources