R separate lines into columns specified by start and end

R separate lines into columns specified by start and end - r

I'd like to split a dataset made of character strings into columns specified by start and end.
My dataset looks something like this:
>head(templines,3)
[1] "201801 1 78"
[2] "201801 2 67"
[3] "201801 1 13"
and i'd like to split it by specifying my columns using the data dictionary:
>dictionary
col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 12
so it becomes:
year week gender age
2018 01 1 78
2018 01 2 67
2018 01 1 13
In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.
In tidyr::separate it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?
I thought of doing this with read_fwf but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:
write_lines(templines,"t1.txt")
read_fwf("t1.txt",
fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name)
is it possible to use read_fwf on an already loaded dataset?

Answering your question directly: yes, it is possible to use read_fwf with already loaded data. The relevant part of the docs is the part about the argument file:
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
...
Literal data is most useful for examples and tests.
It must contain at least one new line to be recognised as data (instead of a path).
Thus, you can simply collapse your data and then use read_fwf:
templines %>%
paste(collapse = "\n") %>%
read_fwf(., fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name))
This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).
There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to age, 11, 12 it works as expected.

A solution with substring:
library(data.table)
x <- transpose(lapply(templines, substring, dictionary$col_start, dictionary$col_end))
setDT(x)
setnames(x, dictionary$col_name)
# > x
# year week gender age
# 1: 2018 01 1 78
# 2: 2018 01 2 67
# 3: 2018 01 1 13

How about this?
data.frame(year=substr(templines,1,4),
week=substr(templines,5,6),
gender=substr(templines,7,8),
age=substr(templines,11,13))

Using base R:
m = list(`attr<-`(dat$col_start,"match.length",dat$col_end-dat$col_start+1))
d = do.call(rbind,regmatches(x,rep(m,length(x))))
setNames(data.frame(d),dat$col_name)
year week gender age
1 2018 01 1 78
2 2018 01 2 67
3 2018 01 1 13
DATA USED:
x = c("201801 1 78", "201801 2 67", "201801 1 13")
dat=read.table(text="col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 13 ",h=T)

We could use separate from tidyverse
library(tidyverse)
data.frame(Col = templines) %>%
separate(Col, into = dictionary$col_name, sep= head(dictionary$col_end, -1))
# year week gender age
#1 2018 01 1 78
#2 2018 01 2 67
#3 2018 01 1 13
The convert = TRUE argument can also be used with separate to have numeric columns as output
tibble(Col = templines) %>%
separate(Col, into = dictionary$col_name,
sep= head(dictionary$col_end, -1), convert = TRUE)
# A tibble: 3 x 4
# year week gender age
# <int> <int> <int> <int>
#1 2018 1 1 78
#2 2018 1 2 67
#3 2018 1 1 13
data
dictionary <- structure(list(col_name = c("year", "week", "gender", "age"),
col_start = c(1L, 5L, 8L, 11L), col_end = c(4L, 6L, 8L, 13L
)), .Names = c("col_name", "col_start", "col_end"),
class = "data.frame", row.names = c(NA, -4L))
templines <- c("201801 1 78", "201801 2 67", "201801 1 13")

This is an explicit function which seems to be working the way you wanted.
split_func<-function(char,ref,name,start,end){
res<-data.table("ID" = 1:length(char))
for(i in 1:nrow(ref)){
res[,ref[[name]][i] := substr(x = char,start = ref[[start]][i],stop = ref[[end]][i])]
}
return(res)
}
I have created the same input files as you:
templines<-c("201801 1 78","201801 2 67","201801 1 13")
dictionary<-data.table("col_name" = c("year","week","gender","age"),"col_start" = c(1,5,8,11),
"col_end" = c(4,6,8,13))
# col_name col_start col_end
#1: year 1 4
#2: week 5 6
#3: gender 8 8
#4: age 11 13
As for the arguments,
char - The character vector with the values you want to split
ref - The reference table or dictionary
name - The column number in the reference table containing the column names you want
start - The column number in the reference table containing the start points
end - The column number in the reference table containing the stop points
If I use this function with these inputs, I get the following result:
out<-split_func(char = templines,ref = dictionary,name = 1,start = 2,end = 3)
#>out
# ID year week gender age
#1: 1 2018 01 1 78
#2: 2 2018 01 2 67
#3: 3 2018 01 1 13
I had to include an "ID" column to initiate the data table and make this easier. In case you want to drop it later you can just use:
out[,ID := NULL]
Hope this is closer to the solution you were looking for.

Related

Extracting "Year" , "Month" and "Day" from Date column which is in continuous string format

Hi I have a dataframe in the form shown below:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7), Date = c("20200230",
"20200422", "20100823", "20190801", "20130230", "20160230", "20150627"
)), class = "data.frame", row.names = c(NA, -7L))
ID Date
1 1 20200230
2 2 20200422
3 3 20100823
4 4 20190801
5 5 20130230
6 6 20160230
7 7 20150627
the date in the Date column is not in the standard format and it's shown in yyyymmdd form. How can I separate year, month and day from Date column and save them as separate new column in data frame, so the result look like this?
ID Date Year Month Day
1 1 20200230 2020 02 30
2 2 20200422 2020 04 22
3 3 20100823 ....................
4 4 20190801 ....................
5 5 20130230 ....................
6 6 20160230 ....................
7 7 20150627 ....................
I tried using format(as.Date(x, format="%YYYY%mm/%dd"),"%YYYY") but it didn't work for me. I also tried follwing code:
Data$Year <- year(ymd(Data$Date))
The result is in this form:
ID Date Year
1 1 20200230 NA
2 2 20200422 2020
3 3 20100823 2010
4 4 20190801 2019
5 5 20130230 NA
6 6 20160230 NA
7 7 20150627 2015
As mentioned by #neilfws , the reason I get NA is that the date is not valid; however, I really don't care about the validity and I want to extract the year in anycase.

If you only want the year and are not concerned with date validation, the easiest solution is probably to extract the first 4 characters from Date and convert to numeric.
Data$Year <- as.numeric(substring(Data$Date, 1, 4))
Might be good to have some kind of check for Date, e.g. that they all contain 8 digits.

Base R in one expression:
# If you want to keep the Date vector:
cbind(df,
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))
# If you want to drop the Date vector:
cbind(within(df, rm(Date)),
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))

Select last observation of a date variable - SPSS or R

I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.
I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.
My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.
As it looks on spss :
IDC
neighborhood
start_date
1
22
08.07.2001
1
44
04.02.2005
1
13
21.06.2010
2
44
24.12.2014
2
3
06.03.2002
3
22
04.01.2006
4
13
08.07.2001
4
2
15.06.2011
In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...
Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.
##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),] # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)
###### attempt 2--- not working
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(d1, IDC==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
##### atempt 3 -- doesn't work
do.call("rbind",
by(d1,INDICES = d1$IDC,
FUN=function(DF)
DF[which.max(DF$start_date),]))
#### attempt 4 -- doesnt work
library(plyr)
ddply(d1,.(IDC), function(X)
X[which.max(X$start_date),])
### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)
My goal dataset d1 would look like this:
IDC
neighborhood
start_date
1
13
21.06.2010
2
44
24.12.2014
3
22
04.01.2006
4
2
15.06.2011
I'm grateful for any help, many thanks <3

There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.
It would generally be better to work with that column as a Date object in R, so that comparisons can be better.
dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
# IDC neighborhood start_date
# 1 1 22 2001-07-08
# 2 1 44 2005-02-04
# 3 1 13 2010-06-21
# 4 2 44 2014-12-24
# 5 2 3 2002-03-06
# 6 3 22 2006-01-04
# 7 4 13 2001-07-08
# 8 4 2 2011-06-15
From here, things are a bit simpler:
Base R
do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
# IDC neighborhood start_date
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
dplyr
dat %>%
group_by(IDC) %>%
slice(which.max(start_date)) %>%
ungroup()
# # A tibble: 4 x 3
# IDC neighborhood start_date
# <int> <int> <date>
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
Data
dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance

The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34

You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val

Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

I have a data frame (track) with the position (longitude - Latitude) and date (number of the day in the year) of tracking point for different animals and an other data frame (var) which gives a the mean temperature for every day of the year in different locations.
I would like to add a new column TEMP to my data frame (Track) where the value would be from (var) and correspond to the date and GPS location of each tracking points in (track).
Here are a really simple subset of my data and what I would like to obtain.
track = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5))
Var = data.frame(
Longitude=c(117,117,116,116),
Latitude=c(18,20,18,20),
Day1=c(22,23,24,21),
Day2=c(21,28,27,29),
Day3=c(12,13,14,11),
Day4=c(17,19,20,23),
Day5=c(32,33,34,31)
)
TrackPlusVar = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5),
Temp= c(22,11,19,22,31)
)
I've no idea how to assign the value from the same date and GPS location as it is a column name. Any idea would be very useful !

This is a dplyr and tidyr approach.
library(dplyr)
library(tidyr)
# reshape table Var
Var %>%
gather(Day,Temp,-Longitude, -Latitude) %>%
mutate(Day = as.numeric(gsub("Day","",Day))) -> Var2
# join tables
track %>% left_join(Var2, by=c("Longitude", "Latitude", "Day"))
# animals Longitude Latitude Day Temp
# 1 1 117 18 1 22
# 2 1 116 20 3 11
# 3 1 117 20 4 19
# 4 2 117 18 1 22
# 5 2 116 20 5 31
If the process that creates your tables makes sure that all your cases belong to both tables, then you can use inner_join instead of left_join to make the process faster.
If you're still not happy with the speed you can use a data.table join process to check if it is faster, like:
library(data.table)
Var2 = setDT(Var2, key = c("Longitude", "Latitude", "Day"))
track = setDT(track, key = c("Longitude", "Latitude", "Day"))
Var2[track][order(animals,Day)]
# Longitude Latitude Day Temp animals
# 1: 117 18 1 22 1
# 2: 116 20 3 11 1
# 3: 117 20 4 19 1
# 4: 117 18 1 22 2
# 5: 116 20 5 31 2

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!

Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer

eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R separate lines into columns specified by start and end - r

A solution with substring: library(data.table) x <- transpose(lapply(templines, substring, dictionary$col_start, dictionary$col_end)) setDT(x) setnames(x, dictionary$col_name) # > x # year week gender age # 1: 2018 01 1 78 # 2: 2018 01 2 67 # 3: 2018 01 1 13

How about this? data.frame(year=substr(templines,1,4), week=substr(templines,5,6), gender=substr(templines,7,8), age=substr(templines,11,13))

Related

Extracting "Year" , "Month" and "Day" from Date column which is in continuous string format

Select last observation of a date variable - SPSS or R

arrange one below the other every 2 columns from data frame in R

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

R Cleaning and reordering names/serial numbers in data frame

Categories

Resources