Change integer to character in R? - r

I know this is a really stupid problem, but it's driving me nuts.
I'm trying to combine two columns in a dataframe.
One column is year, with numbers such as 2006, 2007, etc.
The other column is month, with numbers from 1-12.
I want to create a column called date that looks like this:
2012 and 12 becomes 201212
2012 and 4 becomes 201204
This should be really simple, but I can't seem to get the 0 between the 2012 and 4!!!!!!
The dataframe is called x. I have tried a number of variations of this:
attach(x)
x$mymonth <- as.character(mymonth)
x[!(mymonth=="10"|mymonth=="11"|mymonth=="12"),]$mymonth <- paste0("0",x[!(mymonth=="10"|mymonth=="11"|mymonth=="12"),]$mymonth)
x$mymonth <- as.character(mymonth)
x$date <- paste0(as.character(year),as.character(mymonth),"")
detach(x)
This doesn't work.

We can use sprintf and specify the appropriate fmt.
df1$date <- sprintf("%04d%02d", df1$year, df1$month)
df1$date
#[1] "201501" "201502" "201503" "201504" "201505" "201506" "201507" "201508"
#[9] "201509" "201510" "201511" "201512"
Or another option would be str_pad from library(stringr) and then paste the columns
library(stringr)
paste0(df1$year, str_pad(df1$month, width=2, pad=0))
NOTE: It is not recommended to use attach. Instead we can use with, within etc.
data
df1 <- data.frame(year=2015, month=1:12)

Related

Insert specific numbers at the end of the values of a column in R

Let's say I have a data df like this:
df<-data.frame(date=c(202203,202204,202205,202206))
202203 means March, 2022.
So all the values of date column represent year and month .
However, since I don't know the exact date, I want to insert 01 to every values of date column.That is ,202203should be 20220301:March 1st,2022 .
My expected output is
df<-data.frame(date=c(20220301,20220401,20220501,20220601))
I tried to use gsub but, the output was not what I have expected.
df$date <- as.numeric(paste0(df$date, "01"))
oh hang on, got a better one:
df$date <- df$date * 100 + 1
You should use paste0 and loop.
For example:
for(i in df$date){
paste0(i, "01") -> df$date
}

subset R data frame using only exact matches of character vector

I would like to subset a data frame (Data) by column names. I have a character vector with column name IDs I want to exclude (IDnames).
What I do normally is something like this:
Data[ ,!colnames(Data) %in% IDnames]
However, I am facing the problem that there is a name "X-360" and another one "X-360.1" in the columns. I only want to exclude the "X-360" (which is also in the character vector), but not "X-360.1" (which is not in the character vector, but extracted anyway). - So I want only exact matches, and it seems like this does not work with %in%.
It seems such a simple problem but I just cannot find a solution...
Update:
Indeed, the problem was that I had duplicated names in my data.frame! It took me a while to figure this out, because when I looked at the subsetted columns with
Data[ ,colnames(Data) %in% IDnames]
it showed "X-360" and "X-360.1" among the names, as stated above.
But it seems this was just happening when subsetting the data, before there were just columns with the same name ("X-360") - and that happened because the data frame was set up from matrices with cbind.
Here is a demonstration of what happened:
D1 <-matrix(rnorm(36),nrow=6)
colnames(D1) <- c("X-360", "X-400", "X-401", "X-300", "X-302", "X-500")
D2 <-matrix(rnorm(36),nrow=6)
colnames(D2) <- c("X-360", "X-406", "X-403", "X-300", "X-305", "X-501")
D <- cbind(D1, D2)
Data <- as.data.frame(D)
IDnames <- c("X-360", "X-302", "X-501")
Data[ ,colnames(Data) %in% IDnames]
X-360 X-302 X-360.1 X-501
1 -0.3658194 -1.7046575 2.1009329 0.8167357
2 -2.1987411 -1.3783129 1.5473554 -1.7639961
3 0.5548391 0.4022660 -1.2204003 -1.9454138
4 0.4010191 -2.1751914 0.8479660 0.2800923
5 -0.2790987 0.1859162 0.8349893 0.5285602
6 0.3189967 1.5910424 0.8438429 0.1142751
Learned another thing to be careful about when working with such data in the future...
One regex based solution here would be to form an alternation of exact keyword matches:
regex <- paste0("^(?:", paste(IDnames, collapse="|"), ")$")
Data[ , !grepl(regex, colnames(Data))]

Convert character dates in r (weird format)

I have columns that are named "X1.1.21", "X12.31.20" etc.
I can get rid of all the "X"s by using the substring function:
names(df) <- substring(names(df), 2, 8)
I've been trying many different methods to change "1.1.21" into a date format in R, but I'm having no luck so far. How can I go about this?
R doesn't like column names that start with numbers (hence you get X in front of them). However, you can still force R to allow column names that start with number by using check.names = FALSE while reading the data.
If you want to include date format as column names, you can use :
df <- data.frame(X1.1.21 = rnorm(5), X12.31.20 = rnorm(5))
names(df) <- as.Date(names(df), 'X%m.%d.%y')
names(df)
#[1] "2021-01-01" "2020-12-31"
However, note that they look like dates but are still of type 'character'
class(names(df))
#[1] "character"
So if you are going to use the column names for some date calculation you need to change it to date type first.
as.Date(names(df))

How to add dot to a number - YYYYMM to YYYY.MM

I have the problem following, I have a lot of numbers:
x <- c(200103, 200106,200109)
Actually those are dates and I want them in format 2001.03, 2001.06, 2001.09 etc., i.e. I want to add dot after four first numbers. Is there any simple way how can we do that in r?
You can capture data in two groups. 1st 4 characters and next 2 and add "." in between them.
x <- c(200103, 200106,200109)
sub('(.{4})(.{2})', '\\1.\\2', x)
#[1] "2001.03" "2001.06" "2001.09"
The standard way would be to convert to date and use format to get data in required format.
format(as.Date(paste0(x, 1), '%Y%m%d'), '%Y.%m')
We could convert to yearmon class and then use format
library(zoo)
format(as.yearmon(as.character(x), "%Y%m"), "%Y.%m")
#[1] "2001.03" "2001.06" "2001.09"

Delete some characters from a dataframe in R

I'm quite new to manipulate dataframes in R. I need to create a dataframe by joining several other ones, each containing some data.
I've succeeded in joining them, but I got that:
https://i.stack.imgur.com/SkFDg.png
And what I want is a clean dataframe, so I would like to remove the , " " and $ characters in order to obtain a "real" dataframe. Can you help me with that? Many thanks!
PS: I'm using dplyr and statsr libraries, don't know if this onformation is useful though...
Your data looks like comma separated format (csv). The simplest way would probably be to save it as plain text and read the file with csv.get.
As noted by #Jan, the best way is to read-in the data more fittingly. If, for some reason, this is not a viable option, then this might work:
First off, some illustrative data:
v1 <- c(',"Name","Area","Population"')
v2 <- c(',"Afghanistan",652230,32564342')
v3 <- c(',"Akrotiri",123,NA"')
v4 <- c(',"Albania",28748,3029278')
df1 <- as.data.frame(rbind(v1,v2,v3,v4))
df1
V1
v1 ,"Name","Area","Population"
v2 ,"Afghanistan",652230,32564342
v3 ,"Akrotiri",123,NA"
v4 ,"Albania",28748,3029278
The first step is (i) to get rid of the leading comma and the quote marks using gsub, (ii) split the rows at the comma using strsplit, (iii) to save the result as a dataframe using as.data.frame, and (iv) to transpose it using t:
df2 <- t(as.data.frame(apply(df1, 2, function(x) strsplit(trimws(gsub('^,|"', '', x)),","))))
The rest is rather cosmetic: first remove the row names, then add the correct column names, and finally remove the first row (which contains the names too):
rownames(df2) <- NULL
colnames(df2) <- df2[1,]
df3 <- as.data.frame(df2[-1,])
The result is a neat and clean structure:
df3
Name Area Population
1 Afghanistan 652230 32564342
2 Akrotiri 123 NA
3 Albania 28748 3029278

Resources