Please assist in understanding an R import from csv - r

I have R code like this, that I cannot execute because I do not have rights to install the packages, so need some help understanding what it is doing.
raw_data<- read.csv("raw_data.csv")
attach(raw_data)
raw_data$new_col<- raw_data$Employee.Name
raw_data <- select(raw_data, - Employee.Name)
Am I correct that line 3 is creating a new field called new_col and assigning the value from the csv field Employee Name . The . is supposed to mask the space between Employee and Name
In the 4th line, we are just dropping the original column from the dataset?

Yes, the fourth line (raw_data <- select(raw_data, - Employee.Name)) is using the select() function from the dplyr package to drop a column/variable from the data set. The base R equivalents would be
subset(raw_data, select = -Employee.Name)
or
raw_data[,!(names(raw_data)=="Employee.Name")]
Almost every modern R lesson recommends avoiding attach() (even its own help page!)
The operation here creates a new column by copying the employee name column, then drops the employee name column. It might be more efficient and easier to understand to rename the column instead.
names(raw_data)[names(raw_data)=="Employee.Name"] <- "new_col"
or in tidyverse
rename(raw_data, new_col = Employee.Name)
(see here)

Related

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

Why am I having errors with order of functions using %>% in R?

This is the code I am trying to run:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I want to first add the new column ("Ubb") from "new_table" and then add a calculated column using that new column. However, I get an error saying that Ubb column does not exist. So it's not performing merge before running mutate? When I separate the functions everything works fine:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')
data_table<-data_table%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I would like to keep everything together just for style, but I'm also just curious, shouldn't R perform merge first and then mutate? How does order of operation during piping work?
Thank you!
you dont need to refer to column name with $ sign. i.e. use Normalized_value = ((1.8^(Ubb - Ct_adj))*10000)
because it is merged now. with $ sign I believe R, even though does the merge, has original data_table still in memory. because the assignment operator did not work yet. the assignment will take place after all operations.
Try running the code like this:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(Ubb - Ct_adj))*10000))
Notice that I'm not using the table name with the $ within the pipe. Your way is telling the mutate column to look at a vector. Maybe it's having some trouble understanding the length of that vector when used with the merge. Just call the variable name within the pipe. It's easiest to think of the %>% as 'and then': data_table() and then merge() and then mutate(). You might also want to think about a left_join instead of a merge.

How do I write a dplyr pipe-friendly function where a new column name is provided from a function argument?

I'm hoping to produce a pipe-friendly function where a user specifies the "name of choice" for a new column produced by the function as one of the function arguments.
In the function below, I'd like name_for_elective to be something that the user can set at will, and afterwards, the user could expect that there will be a new column in their data with the name that they provided here.
I've looked at https://dplyr.tidyverse.org/reference/dplyr_data_masking.html, the mutate() function documentation, searched here, and tried working with https://dplyr.tidyverse.org/reference/rename.html, but to no avail.
elective_open<-function(.data,name_for_elective,course,tiebreaker){
name_for_elective<-rlang::ensym(name_for_elective)
course<-rlang::ensym(course)
tiebreaker<-rlang::ensym(tiebreaker)
.data%>%
mutate(!!name_for_elective =ifelse(!!tiebreaker==max(!!tiebreaker),1,0))%>%mutate(!!name_for_elective=ifelse(!!name_for_elective==0,!!course[!!name_for_elective==1],""))%>%
filter(!(!!course %in% !!name_for_elective))
}
I've included this example function because there are several references to the desired new column name, and I'm unsure if the context in which the reference is made changes syntax.
As you can see, I was hoping !!name_for_elective would let me name our new column, but no. I've played with {{}}, not using rlang::ensym, and still haven't got this figured out.
Any solution would be greatly appreciated.
This: Use dynamic variable names in `dplyr` may be helpful, but I can't seem to figure out how to extend this to work in the case where multiple references are made to the name argument.
Example data, per a good suggestion by #MrFlick, takes the form below:
dat<-tibble(ID=c("AA","BB","AA","BB","AA","BB"),Class=c("A_Class","B_Class","C_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.43,.97,.41,.27,.38))
The user could then run something like:
dat2<-dat%>%
elective_open(MyChosenName,Class,randomNo)
A desired result, if using the function a single time, would be:
desired_result_1<-tibble(ID=c("AA","BB","AA","BB"),
Class=c("A_Class","D_Class","E_Class","F_Class"),
randomNo=c(.75,.41,.27,.38),
MyChosenName=c("C_Class","B_Class"))
The goal would be to allow this function to be used again if so desired, with a different name specified.
In the case where a user runs:
dat3<-dat%>%
elective_open(MyChosenName,Class,randomNo)%>%
mutate(Just_Another_One=1)%>%
elective_open(SecondName,Class,randomNo)
The output would be:
desired_result_2<-tibble(ID=c("AA","BB"),
Class=c("E_Class","F_Class"),
randomNo=c(.27,.38),
MyChosenName=c("C_Class","B_Class"),
Just_Another_One=c(1,1),
SecondName=c("A_Class","D_Class"))
In reality, there may be any number of records with the same ID, and any number of Class-es.
In this case you can just stick to using the embrace {{}} option for your variables. If you want to dynamically create column names, you're going to still need to use :=. The difference here is that you can use the glue-style syntax with the embrace operator to get the name of the symbol. This works with the data provided.
elective_open <- function(.data, name_for_elective, course, tiebreaker){
.data%>%
mutate("{{name_for_elective}}" := ifelse({{tiebreaker}}==max({{tiebreaker}}),1,0)) %>%
mutate("{{name_for_elective}}" := ifelse({{name_for_elective}}==0,{{course}}[{{name_for_elective}}==1],"")) %>%
filter(!({{course}} %in% {{name_for_elective}}))
}

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

R how to combine data from a for loop in one txt file

I have one loop
for (chr in paste('chr',c(seq(1,22),'X','Y'),sep='')){
.
.
write.table(exp,file="list.txt",col.names =FALSE,row.names=TRUE,sep="\t")
}
right now the loop will give me only one txt with the data from the last loop (chromosome Y)
What i want is one txt file with the data from all the chromosomes/ from all the loops. Not 24 different txt (one per chromosome)
Thank you
Best regards
Anna
What you are not yet recognizing is that 'append' is set to FALSE by default and that you need to explicitly change it to TRUE if you want to record each 'exp'-object (which by the way is an unfortunate name for a data object because it's shared by a common math function). If you set append=TRUE and if the "exp"-object is a full record of one chromosome at each time the write.csv is called (hopefully with an 'chr'-identifier column so you can keep track of where the data came from), then you should succeed.

Resources