I am confused about the following codes are doing:
X2_X26 <- paste("X", 2:26, sep = "")
portf_exret <- paste("excess_return_portfolio", 1:25, sep ="")
X27_X51 <- paste("X", 27:51, sep = "")
logsize_p <- paste("logsize_portfolio", 1:25, sep = "")
setnames(datafile, old = 'X1', new = 'market_exret')
setnames(datafile, old = X2_X26, new = portf_exret)
setnames(datafile, old = X27_X51, new = logsize_p)
For the first line, is it saying: create X2,X3...X26(each of them are seperate columns), and then store it into a dataframe called"X2_X26"?
Then, the setnames function says change the name of X2_X26 dataframe to portf_exret dataframe, nothing else change?
As we have not previously defined 'X1', for the setnames(datafile, old='X1), is it refering to the first column in the dataframe by default?
What are these code doing? Why we need to change the column from 2:26 to 1:25?
Thank you very much for your help.
datafile is frame of class data.table
it has columns named "X1", "X2", "X3"... "X51" (and possibly additional columns)
The lines X2_X26 <- paste("X", 2:26, sep = "") and X27_X51 <- paste("X", 27:51, sep = "") are creating vectors, each of length 25, containing the strings "X2", "X2"..."X26" and "X27", "X28", ... "X51", respectively
The line assigning to portf_xret is creating a string vector of length 25, that will eventually replace the first set of generic columns X2 through X26. This string vector looks like this "excess_return_portfolio1", "excess_return_porfolio2" ... "excess_return_portfolio25"
The line assigning to logsize_p is creating a string vector of length 25, that will eventually replace the second set of generic columns X27 through X51. This string vector looks like this "logsize_portfolio1", "logsize_portfolio2" ... "logsize_portfolio25"
Finally, data.table::setnames() is called three times, each time feeding datafile (the object for which the columns should be renamed), the old names, and the new names. These three lines could also have been combined like this: setnames(datafile, old=c("X1", X2_X26, X27_X51), new=c("market_exret", portf_xret, logsize_p))
Related
I would like to paste the column name to each cell, and I've found this answer in another post.
df[] <- paste(col(df, TRUE), as.matrix(df), sep = ":")
But how should I modify it, if I've just want to use only for the first column in a dataframe?
If you want to do this only for one column, you can do :
df[[1]] <- paste(df[[1]], names(df)[1], sep = ":")
I have a DataFrame from which I've created another DataFrame. Somewhere along the line, things got messed up, but I'm not sure where, and how to fix it.
The code worked on the first dataframe, so I assume it's some sort of type mismatch? Do I need to convert the fields back to string somehow?
##creating the second data frame
adat2 <- data.frame(id=character(), Title=character(), Domain=character(), lemtext1=character(), Language=character(), day=character())
##copying from the first one, whilst splitting rows into multiple rows based on lemtext
for (row in 1:nrow(adat1)) {
splitlines <- strsplit(adat1$lemtext[row], ", |\\. |: |; ")[[1]]
for (row2 in 1:NROW(splitlines)){
adat2 <- add_row(adat2, id=adat1$id[row], Title=adat1$Title[row], Domain=adat1$Domain[row], lemtext1=splitlines[row2], Language=adat1$Language[row], day=adat1$day[row])
}
}
##trying to work with the new dataframe
tokens <- space_tokenizer(adat2$`lemtext2`[which(((adat2$Domain=="index.hu") |
(adat2$Domain=="hvg.hu") | (adat1$Domain=="24.hu") | (adat1$Domain=="444.hu")) &
(adat2$day>=as.Date("2018-10-13")) & (adat1$day<=as.Date("2019-10-13")))])
getting error messages
adat1 doutput:
https://www.pastiebin.com/5df253f6b79aa
In adat2 everything is a factor. This has to do how you created adat2. You need to add stringAsFactors = FALSE to the data.frame() function.
adat2 <- data.frame(id = character(),
Title = character(),
Domain = character(),
lemtext1 = character(),
Language = character(),
day = character(),
stringAsFactors = FALSE)
If you want to now what kind of columns you have. You should str(adat2) or per column you can use e.g. class(adat2$id).
EDIT
I am trying to name a column and rename all items within the column of a dataset:
dataSet <- read.csv(url) %>%
rename("newColumn1" = V1) %>%
mutate(newColumn1 = recode(newColumn1, "oldEntryX" = "newEntryX") %>%
select(dataSet, newColumn1)
And I get this error:
Error in recode(newColumn1, oldEntryX = "newEntryX" :
object 'newColumn1' not found
What am I missing?
The code runs correctly up through the rename function and displays the renamed column correctly, but soon as I include mutate it throws an error.
I have no problem sharing the real code but wanted to generalize it for the crowd.
source info was from https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
IN the mutate step, you don't need quotes for column names on the lhs of =. Also, there are couple of case mismatches
Assuming the dataset is read correctly, we can
df1 %>%
rename(newColumn1 = V1, newColumn2 = V2) %>%
mutate(newColumn1 = recode(newColumn1, oldEntryX = "newEntryX"),
newColumn2 = recode(newColumn2, oldEntryY = "newEntryY"))
Based on the OP's code there is no closing quote as well "newColumn1
data
set.seed(24)
df1 <- data.frame(V1 = sample(c("oldEntryX", "x", "y"), 10, replace = TRUE),
V2 = sample(c("oldEntryY", "x", "y"), 10, replace = TRUE), stringsAsFactors= FALSE)
you can do this with some simple codes of R programming:
How to read csv file
Syntax :- `read.csv("filename.csv")
by using this command 1st row will be used as header. To improve this fault one should write
data <- read.csv("datafile.csv", header=FALSE)
How to rename the header/Column name:
names(data) <- c("Column1", "Column2", "Column3")
Now your headers are replaced by Column1, Column2 and Column3
Now to change Column1 data you can follow steps
data$Column1 <- c(write down set of values with which you want to replace)
To see the output type
data
I have a large dataset, 3000x400. I need to created new columns that are means of the existing columns subsetted by a variable constituency. I have a list of new columns names that I want to use to name the new columns, below called newNames. But I can only figure out how to name columns when I directly type the desired new name.
What I currently do:
set.seed(1)
dataTest = data.table(turnout_avg = rnorm(20), urban_avg = rnorm(20,5,2), Constituency = c("A","B","C","D"), key = "Constituency")
oldColumnNames = c( "turnout_avg" , "urban_avg")
newNames = c( "turnout" , "urban")
# Here's my problem, naming these new columns
comm_means_by_district = cbind(
dataTest[,list(Const_turnout = mean(na.omit(get(oldColumnNames[[1]])))), by= Constituency],
dataTest[,list(Const_urban = mean(na.omit(get(oldColumnNames[[2]])))),by= Constituency])
In reality, I want to create much more than two new columns. So I cannot feasibly type Const_turnout, Const_urban, etc. for all new columns.
I've have tried two ideas, but neither works,
1.
dataTest[,list(paste("district", newNames[1], sep="_") = mean(na.omit(get(refColNames[[1]])))), by= Constituency]
Or 2.
dataTest[,list(paste(oldColumnNames[1], "constMean", sep="_") = mean(na.omit(get(refColNames[[1]])))), by= Constituency]
first get the mean of all the columns in one go
DT <- dataTest[,lapply(.SD,function(x) mean(na.omit(x))), by= Constituency]
then change the colnames afterwards
setnames(DT,colnames(DT),vector_of_newnames)
Why is it important to change the names in the same line where you apply the function? I would just first calculate the constituency-wise means and set the column names after. Here's how this would look like:
dt <- dataTest[, lapply(oldColumnNames, function(x) mean(na.omit(get(x)))),
by=Constituency]
setnames(dt, c("Constituency", paste("Const", newNames, sep="_")))
dt
I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"