column names have periods inserted where there should be spaces - r

In the plot generated by ggplot, each label along the x-axis is a string, i.e., “the product in 1990”. However, the generated plot there is a period in between each word. In other words, the above string is shown as “the.product.in.1990”
How can I ensure the above “.” is not added?
The following code is what I used to add string for each point along the x-axis
last_plot()+scale_x_discrete(limits=ddata$labels$text)
Sample code:
library(ggdendro)
x <- read.csv("test.csv",header=TRUE)
d <- as.dist(x,diag=FALSE,upper=FALSE)
hc <- hclust(d,"ave")
dhc <- as.dendrogram(hc)
ddata <- dendro_data(dhc,type="rectangle")
ggplot(segment(ddata)) + geom_segment(aes(x=x0,y=y0,xend=x1,yend=y1))
last_plot() + scale_x_discrete(limits=ddata$labels$text)
each row of ddata$labels$text is a string, like "the product in 1990".
I would like to keep the same format in the generated plot rather than "the.product.in.1990"

The issue arises because you are trying to read data with column names that contain spaces.
When you read this data with read.csv these column names are converted to syntactically valid R names. Here is an example to illustrate the issues:
some.file <- '
"Col heading A", "Col heading B"
A, 1
B, 2
C, 3
'
Read it with the default read.csv settings:
> x1 <- read.csv(text=some.file)
> x1
Col.heading.A Col.heading.B
1 A 1
2 B 2
3 C 3
4 NA
> names(x1)
[1] "Col.heading.A" "Col.heading.B"
To avoid this, use the argument check.names=FALSE:
> x2 <- read.csv(text=some.file, check.names=FALSE)
> x2
Col heading A Col heading B
1 A 1
2 B 2
3 C 3
4 NA
> names(x2)
[1] "Col heading A" "Col heading B"
Now, the remaining issue is that a column name can not contain spaces. So to refer to these columns, you need to wrap your column name in backticks:
> x2$`Col heading A`
[1] A B C
Levels: A B C
For more information, see ?read.csv and specifically the information for check.names.
There is also some information about backticks in ?Quotes

Related

How to rename row names in data from Eurostat in R

I downloaded data from Eurostat to R. In one column I have abbreviated country names, I found how to display full names. But I need to change these row names to show in a different language. I don't care about a specific function, I can do it manually, but how?
enter image description here
The countrycode package could help:
#install.packages("countrycode") # if necessary
require("countrycode")
# assuming you data.frame is named df
df$geo <- countrycode(df$geo, origin = 'iso2c', destination = 'un.name.en')
Check out ?countrycode and ?codelist for more encoding options.
You can set row names with rownames
d <- data.frame(x=1:3, y = letters[1:3])
rownames(d)
#[1] "1" "2" "3"
rownames(d) = c("the", "fox", "jumps")
rownames(d)
#[1] "the" "fox" "jumps"
Or you can add a new column to a data.frame like this
d$french <- c("le", "renard", "saute")
d
# x y french
#the 1 a le
#fox 2 b renard
#jumps 3 c saute
Even with another script
d$tajik <- c("рӯбоҳ", "ҷаҳиш", "мекунад")
May look weird because of the UTF-8 encoding
d
# x y french tajik
#the 1 a le <U+0440><U+04EF><U+0431><U+043E><U+04B3>
#fox 2 b renard <U+04B7><U+0430><U+04B3><U+0438><U+0448>
#jumps 3 c saute <U+043C><U+0435><U+043A><U+0443><U+043D><U+0430><U+0434>
But is good
d$tajik
#[1] "рӯбоҳ" "ҷаҳиш" "мекунад"

How to plot based on a wildcard

I have data that looks like this:
A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A
plot(Data$V2[Data$V4 == "LOGIC:A"], DATA$V3[Data$V4 == "LOGIC:A"])
However I want to plot whenever the column 4 is LOGIC, when I provide "LOGIC" inside the plot command it should plot both "LOGIC:A" and "LOGIC:B". Right now it only accepts the exact column 4 value. Can I use wildcards?
You can use grepl to find occurrences of your string.
x <- c("LOGIC: A", "COMBO: B")
x[grepl("LOGIC", x)]
[1] "LOGIC: A"
Using Data shown reproducibly in the Note at the end this will plot those rows for which V4 contains the substring LOGIC using the character after the colon to represent the point. If you want all points to be represented by the same character omit the pch argument from plot.
plot(V3 ~ V2, Data, subset = grep("LOGIC", V4), pch = sub("LOGIC:", "", V4))
Note
Lines <- "A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A"
Data <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Splitting a dataframe if rows are numeric or not in R

I have a data frame (let's call it 'df') it consists of two columns
Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com
I want to split the dataframe into two dataframe based on whether a row in column "Contact" is numeric or not
Expected Output:
Name Contact
A 34552325
B 423424
C 4324234242
and
Name Contact
D hello1#company.com
I tired using:
df$IsNum <- !(is.na(as.numeric(df$Contact)))
But this classified "hello1#company.com" also as numeric.
Basically if there is even a single non-numeric value in column "Contact", then code must classify it as non-numeric
You may use grepl..
x <- " Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com"
df <- read.table(text=x, header = T)
x <- df[grepl("^\\d+$",df$Contact),]
y <- df[!grepl("^\\d+$",df$Contact),]
x
# Name Contact
# 1 A 34552325
# 2 B 423424
# 3 C 4324234242
y
# Name Contact
# 4 D hello1#company.com
We can create a grouping variable with grepl (same as how #Avinash Raj created), split the dataframe with that to create a list of data.frames.
split(df, grepl('^\\d+$', df$Contact))

Does column exist and how to rearrange columns in R data frame

How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)

Resources