I downloaded data from Eurostat to R. In one column I have abbreviated country names, I found how to display full names. But I need to change these row names to show in a different language. I don't care about a specific function, I can do it manually, but how?
enter image description here
The countrycode package could help:
#install.packages("countrycode") # if necessary
require("countrycode")
# assuming you data.frame is named df
df$geo <- countrycode(df$geo, origin = 'iso2c', destination = 'un.name.en')
Check out ?countrycode and ?codelist for more encoding options.
You can set row names with rownames
d <- data.frame(x=1:3, y = letters[1:3])
rownames(d)
#[1] "1" "2" "3"
rownames(d) = c("the", "fox", "jumps")
rownames(d)
#[1] "the" "fox" "jumps"
Or you can add a new column to a data.frame like this
d$french <- c("le", "renard", "saute")
d
# x y french
#the 1 a le
#fox 2 b renard
#jumps 3 c saute
Even with another script
d$tajik <- c("рӯбоҳ", "ҷаҳиш", "мекунад")
May look weird because of the UTF-8 encoding
d
# x y french tajik
#the 1 a le <U+0440><U+04EF><U+0431><U+043E><U+04B3>
#fox 2 b renard <U+04B7><U+0430><U+04B3><U+0438><U+0448>
#jumps 3 c saute <U+043C><U+0435><U+043A><U+0443><U+043D><U+0430><U+0434>
But is good
d$tajik
#[1] "рӯбоҳ" "ҷаҳиш" "мекунад"
Related
I have a data frame which records the changes in the name of companies. A simple representation would be :
df <- data.frame(key = c("A", "B","C", "E","F","G"), Change = c("B", "C","D" ,"F","G","H"))
print(df)
Key Change
1 A B
2 B C
3 C D
4 E F
5 F G
6 G H
I want to track all the changes a value is going through. Here is an output that can help me do so:
Key 1st 2nd 3rd 4th
1 A B C D
2 E F G H
How can I do it in R? I am new to R and Programming. It would be great to get help.
The question was marked duplicate of How to reshape data from long to wide format?
However, it is not an exact duplicate. For the reasons :
1. example used here contains data changing across columns. That is not the case in the question of reshaping data. Here, the two columns are dependent on each other.
2. Before reshaping, I reckon there is another step : maybe giving an id for the changes taking place. I am not sure how to do it.
Could you help me?
Can we assume that a same name never appears (never occurs like A->B->C and D->E->A)? If so, you can do the following.
df <- data.frame(key = c("A","B","C", "E","F","G"),
Change = c("B","C","D" ,"F","G","H"))
print(df)
# mapping from old to new name
next_name <- as.character(df$Change)
names(next_name) <- df$key
all_names <- unique(c(as.character(df$key), as.character(df$Change)))
get_id <- function(x) {
# for each name, repeatedly traverse until the final name
ss <- x %in% names(next_name)
if (any(ss)) {
x[ss] <- get_id(next_name[x[ss]])
}
x
}
ids <- get_id(all_names)
lapply(unique(ids), function(i) c(all_names[ids==i]))
# out come is a list of company names,
# each entry represents a history of a firm
##[[1]]
##[1] "A" "B" "C" "D"
##[[2]]
##[1] "E" "F" "G" "H"
The outcome is a list, not data frame since the number of name sequences may not be unique (firms may have different number of names).
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I loaded some data in R and mistakenly named it as 86. Now when I want to call the data frame I end up with the number 86 instead of my data set. Is there a way to call the data set rather than the number 86? Also, is there a way to change the name of the data so it is no longer a number? Thank you.
You need to use backticks:
"86" <- data.frame(a = "meow", b = "wouf")
> `86`
# a b
# 1 meow wouf
To change the name of your data frame, simply assign (<-) data from 86 to df and remove (rm) the original 86
df <- `86`; rm(`86`)
> df
# a b
# 1 meow wouf
Because of copy-on-modify, this will not allocate memory for df.
> "86" <- data.frame(a = "meow", b = "wouf"); tracemem(`86`)
# [1] "<0x3936b28>"
> df <- `86`; tracemem(df)
# [1] "<0x3936b28>"
I want to change the name of the output of my R function to reflect different strings that are inputted. Here is what I have tried:
kd = c("a","b","d","e","b")
test = function(kd){
return(list(assign(paste(kd,"burst",sep="_"),1:6)))
}
This is just a simple test function. I get the warning (which is just as bad an error for me):
Warning message:
In assign(paste(kd, "burst", sep = "_"), 1:6) :
only the first element is used as variable name
Ideally I would get ouput like a_burst = 1, b_burst = 2 and so on but am not getting close.
I would like split up a dataframe by contents of a vector and be able to name everything according to the name from that vector, similar to
How to split a data frame by rows, and then process the blocks?
but not quite. The naming is imperative.
Something like this, maybe?
kd = c("a","b","d","e","b")
test <- function(x){
l <- as.list(1:5)
names(l) <- paste(x,"burst",sep = "_")
l
}
test(kd)
You could use a vector instead of a list by way of setNames:
t1_6 <- setNames( 1:6, kd)
t1_6
a b d e b <NA>
1 2 3 4 5 6
> t1_6["a"]
a
1
Looking at the question again I wondered if you wnated to assign sequential names to a character vector:
> a1_5 <- setNames(kd, paste0("alpha", 1:5))
> a1_5
alpha1 alpha2 alpha3 alpha4 alpha5
"a" "b" "d" "e" "b"
I have a table in csv format, the data is the following:
1 3 1 2
1415_at 1 8.512147859 8.196725061 8.174426394 8.62388149
1411_at 2 9.119200527 9.190318548 9.149239039 9.211401637
1412_at 3 10.03383593 9.575728316 10.06998673 9.735217522
1413_at 4 5.925999419 5.692092375 5.689299161 7.807354922
When I read it with:
m <- read.csv("table.csv")
and print the values of m, I notice that they change to:
X X.1 X1 X3 X1.1 X4
1 1415_at 1 8.512148 8.196725 8.174426 8.623881
I made some manipulation to keep only those columns that are labelled 1 or 2, so I do that with:
smallerdat <- m[ grep("^X$|^X.1$|^X1$|^X2$|1\\.|2\\." , names(m) ) ]
write.csv(smallerdat,"table2.csv")
it writes me the file with those annoying headers and that first column added, which I do not need it:
X X.1 X1 X1.1 X2
1 1415_at 1 8.512148 8.174426 8.623881
so when I open that data in Excel the headers are still X, X.1 and son on. What I need is that the headers remain the same as:
1 1 2
1415_at 1 8.196725061 8.174426394 8.62388149
any help?
Please notice also that first column that is added automatically, I do not need it, so how I can get rid that of that column?
There are two issues here.
For reading your CSV file, use:
m <- read.csv("table.csv", check.names = FALSE)
Notice that by doing this, though, you can't use the column names as easily. You have to quote them with backticks instead, and will most likely still run into problems because of duplicated column names:
m$1
# Error: unexpected numeric constant in "mydf$1"
mydf$`1`
# [1] 8.512148 9.119201 10.033836 5.925999
For writing your "m" object to a CSV file, use:
write.csv(m, "table2.csv", row.names = FALSE)
After reading your file in using the method in step 1, you can subset as follows. If you wanted the first column and any columns named "3" or "4", you can use:
m[names(m) %in% c("", "3", "4")]
# 3 4
# 1 1415_at 1 8.196725 8.623881
# 2 1411_at 2 9.190319 9.211402
# 3 1412_at 3 9.575728 9.735218
# 4 1413_at 4 5.692092 7.807355
Update: Fixing the names before using write.csv
If you don't want to start from step 1 for whatever reason, you can still fix your problem. While you've succeeded in taking a subset with your grep statement, that doesn't change the column names (not sure why you would expect that it should). You have to do this by using gsub or one of the other regex solutions.
Here are the names of the columns with the way you have read in your CSV:
names(m)
# [1] "X" "X.1" "X1" "X3" "X1.1" "X2"
You want to:
Remove all "X"s
Remove all ".some-number"
So, here's a workaround:
# Change the names in your original dataset
names(m) <- gsub("^X|\\.[0-9]$", "", names(m))
# Create a temporary object to match desired names
getme <- names(m) %in% c("", "1", "2")
# Subset your data
smallerdat <- m[getme]
# Reassign names to your subset
names(smallerdat) <- names(m)[getme]
I am not sure I understand what you are attempting to do, but here is some code that reads a csv file with missing headers for the first two columns, selects only columns with a header of 1 or 2 and then writes that new data file retaining the column names of 1 or 2.
# first read in only the headers and deal with the missing
# headers for columns 1 and 2
b <- readLines('c:/users/Mark W Miller/simple R programs/missing_headers.csv',
n = 1)
b <- unlist(strsplit(b, ","))
b[1] <- 'name1'
b[2] <- 'name2'
b <- gsub(" ","", b, fixed=TRUE)
b
# read in the rest of the data file
my.data <- (
read.table(file = "c:/users/mark w miller/simple R programs/missing_headers.csv",
na.string=NA, header = F, skip=1, sep=','))
colnames(my.data) <- b
# select the columns with names of 1 or 2
my.data <- my.data[names(my.data) %in% c("1", "2")]
# retain the original column names of 1 or 2
names(my.data) <- floor(as.numeric(names(my.data)))
# write the new data file with original column names
write.csv(
my.data, "c:/users/mark w miller/simple R programs/missing_headers_out.csv",
row.names=FALSE, quote=FALSE)
Here is the input data file. Note the commas with missing names for columns 1 and 2:
, , 1, 3, 1, 2
1415_at, 1, 8.512147859, 8.196725061, 8.174426394, 8.62388149
1411_at, 2, 9.119200527, 9.190318548, 9.149239039, 9.211401637
1412_at, 3, 10.03383593, 9.575728316, 10.06998673, 9.735217522
1413_at, 4, 5.925999419, 5.692092375, 5.689299161, 7.807354922
Here is the output data file:
1,1,2
8.512147859,8.174426394,8.62388149
9.119200527,9.149239039,9.211401637
10.03383593,10.06998673,9.735217522
5.925999419,5.689299161,7.807354922