Conversion to Unique and ordered string of numbers - r

If I had a dataframe that has a string of numbers separated by a comma in a column, how can I convert that string to an ordered and unique converted set in another column?
Month String_of_Nums Converted
May 3,3,2 2,3
June 3,3,3,1 1,3
Sept 3,3,3, 3 3
Oct 3,3,3, 4 3,4
Jan 3,3,4 3,4
Nov 3,3,5,5 3,5
I tried splitting up the string of numbers to get unique to work
strsplit(df$String_of_Nums,",")
but I end up with spaces in the character list. Any ideas how to efficently generate a Converted column? Also need to figure out how to operate on all elements of the column, etc.

Try:
df1 <- read.table(text="Month String_of_Nums
May '3,3,2'
June '3,3,3,1'
Sept '3,3,3,3'
Oct '3,3,3,4'
Jan '3,3,4'
Nov '3,3,5,5'", header = TRUE)
df1$converted <- apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ","))
df1
Month String_of_Nums converted
1 May 3,3,2 2,3
2 June 3,3,3,1 1,3
3 Sept 3,3,3,3 3
4 Oct 3,3,3, 4 3,4
5 Jan 3,3,4 3,4
6 Nov 3,3,5,5 3,5

I'd like to leave another way. As far as I see, Jay's example has String_of_Nums as factor. Given you said that strsplit() worked, I am assuming that you have String_of_Nums as character. Here I have the column as character as well. First, split each string (strsplit), find unique characters (unique), sort the characters (sort), and paste them (toString). At this point, you have a list. You want to convert the vectors in the list using as_vector from the purrr package. Of interest, I used benchmark to see how performance would be like to create a vector (i.e., Converted)
library(magrittr)
library(purrr)
lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character") -> mydf$out
# Month String_of_Nums out
#1 May 3,3,2 2, 3
#2 June 3,3,3,1 1, 3
#3 Sept 3,3,3,3 3
#4 Oct 3,3,3,4 3, 4
#5 Jan 3,3,4 3, 4
#6 Nov 3,3,5,5 3, 5
library(microbenchmark)
microbenchmark(
jazz = lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character"),
jay = apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ",")),
times = 10000)
# expr min lq mean median uq max neval
# jazz 358.913 393.018 431.7382 405.9395 420.1735 54779.29 10000
# jay 1099.587 1151.244 1233.5631 1167.0920 1191.5610 56871.45 10000
DATA
Month String_of_Nums
1 May 3,3,2
2 June 3,3,3,1
3 Sept 3,3,3,3
4 Oct 3,3,3,4
5 Jan 3,3,4
6 Nov 3,3,5,5
mydf <- structure(list(Month = c("May", "June", "Sept", "Oct", "Jan",
"Nov"), String_of_Nums = c("3,3,2", "3,3,3,1", "3,3,3,3", "3,3,3,4",
"3,3,4", "3,3,5,5")), .Names = c("Month", "String_of_Nums"), row.names = c(NA,
-6L), class = "data.frame")

Related

How to organise a date in 3 different columns in r?

I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2

Extracting "Year" , "Month" and "Day" from Date column which is in continuous string format

Hi I have a dataframe in the form shown below:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7), Date = c("20200230",
"20200422", "20100823", "20190801", "20130230", "20160230", "20150627"
)), class = "data.frame", row.names = c(NA, -7L))
ID Date
1 1 20200230
2 2 20200422
3 3 20100823
4 4 20190801
5 5 20130230
6 6 20160230
7 7 20150627
the date in the Date column is not in the standard format and it's shown in yyyymmdd form. How can I separate year, month and day from Date column and save them as separate new column in data frame, so the result look like this?
ID Date Year Month Day
1 1 20200230 2020 02 30
2 2 20200422 2020 04 22
3 3 20100823 ....................
4 4 20190801 ....................
5 5 20130230 ....................
6 6 20160230 ....................
7 7 20150627 ....................
I tried using format(as.Date(x, format="%YYYY%mm/%dd"),"%YYYY") but it didn't work for me. I also tried follwing code:
Data$Year <- year(ymd(Data$Date))
The result is in this form:
ID Date Year
1 1 20200230 NA
2 2 20200422 2020
3 3 20100823 2010
4 4 20190801 2019
5 5 20130230 NA
6 6 20160230 NA
7 7 20150627 2015
As mentioned by #neilfws , the reason I get NA is that the date is not valid; however, I really don't care about the validity and I want to extract the year in anycase.
If you only want the year and are not concerned with date validation, the easiest solution is probably to extract the first 4 characters from Date and convert to numeric.
Data$Year <- as.numeric(substring(Data$Date, 1, 4))
Might be good to have some kind of check for Date, e.g. that they all contain 8 digits.
Base R in one expression:
# If you want to keep the Date vector:
cbind(df,
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))
# If you want to drop the Date vector:
cbind(within(df, rm(Date)),
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))

Splitting Columns by Number of Characters [duplicate]

This question already has answers here:
Split character string multiple times every two characters
(2 answers)
Closed 6 years ago.
I have a column of dates in a data table entered in 6-digit numbers as such: 201401, 201402, 201403, 201412, etc. where the first 4 digits are the year and second two digits are month.
I'm trying to split that column into two columns, one called "year" and one called "month". Been messing around with strsplit() but can't figure out how to get it to do number of characters instead of a string pattern, i.e. split in the middle of the 4th and 5th digit.
Without using any external package, we can do this with substr
transform(df1, Year = substr(dates, 1, 4), Month = substr(dates, 5, 6))
# dates Year Month
#1 201401 2014 01
#2 201402 2014 02
#3 201403 2014 03
#4 201412 2014 12
We have the option to remove or keep the column.
Or with sub
cbind(df1, read.csv(text=sub('(.{4})(.{2})', "\\1,\\2", df1$dates), header=FALSE))
Or using some package solutions
library(tidyr)
extract(df1, dates, into = c("Year", "Month"), "(.{4})(.{2})", remove=FALSE)
Or with data.table
library(data.table)
setDT(df1)[, tstrsplit(dates, "(?<=.{4})", perl = TRUE)]
tidyr::separate can take an integer for its sep parameter, which will split at a particular location:
library(tidyr)
df <- data.frame(date = c(201401, 201402, 201403, 201412))
df %>% separate(date, into = c('year', 'month'), sep = 4)
#> year month
#> 1 2014 01
#> 2 2014 02
#> 3 2014 03
#> 4 2014 12
Note the new columns are character; add convert = TRUE to coerce back to numbers.

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!
Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Sum duplicates then remove all but first occurrence

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41

Resources