R Cleaning and reordering names/serial numbers in data frame - r

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!

Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer

eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Related

R separate lines into columns specified by start and end

I'd like to split a dataset made of character strings into columns specified by start and end.
My dataset looks something like this:
>head(templines,3)
[1] "201801 1 78"
[2] "201801 2 67"
[3] "201801 1 13"
and i'd like to split it by specifying my columns using the data dictionary:
>dictionary
col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 12
so it becomes:
year week gender age
2018 01 1 78
2018 01 2 67
2018 01 1 13
In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.
In tidyr::separate it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?
I thought of doing this with read_fwf but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:
write_lines(templines,"t1.txt")
read_fwf("t1.txt",
fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name)
is it possible to use read_fwf on an already loaded dataset?
Answering your question directly: yes, it is possible to use read_fwf with already loaded data. The relevant part of the docs is the part about the argument file:
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
...
Literal data is most useful for examples and tests.
It must contain at least one new line to be recognised as data (instead of a path).
Thus, you can simply collapse your data and then use read_fwf:
templines %>%
paste(collapse = "\n") %>%
read_fwf(., fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name))
This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).
There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to age, 11, 12 it works as expected.
A solution with substring:
library(data.table)
x <- transpose(lapply(templines, substring, dictionary$col_start, dictionary$col_end))
setDT(x)
setnames(x, dictionary$col_name)
# > x
# year week gender age
# 1: 2018 01 1 78
# 2: 2018 01 2 67
# 3: 2018 01 1 13
How about this?
data.frame(year=substr(templines,1,4),
week=substr(templines,5,6),
gender=substr(templines,7,8),
age=substr(templines,11,13))
Using base R:
m = list(`attr<-`(dat$col_start,"match.length",dat$col_end-dat$col_start+1))
d = do.call(rbind,regmatches(x,rep(m,length(x))))
setNames(data.frame(d),dat$col_name)
year week gender age
1 2018 01 1 78
2 2018 01 2 67
3 2018 01 1 13
DATA USED:
x = c("201801 1 78", "201801 2 67", "201801 1 13")
dat=read.table(text="col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 13 ",h=T)
We could use separate from tidyverse
library(tidyverse)
data.frame(Col = templines) %>%
separate(Col, into = dictionary$col_name, sep= head(dictionary$col_end, -1))
# year week gender age
#1 2018 01 1 78
#2 2018 01 2 67
#3 2018 01 1 13
The convert = TRUE argument can also be used with separate to have numeric columns as output
tibble(Col = templines) %>%
separate(Col, into = dictionary$col_name,
sep= head(dictionary$col_end, -1), convert = TRUE)
# A tibble: 3 x 4
# year week gender age
# <int> <int> <int> <int>
#1 2018 1 1 78
#2 2018 1 2 67
#3 2018 1 1 13
data
dictionary <- structure(list(col_name = c("year", "week", "gender", "age"),
col_start = c(1L, 5L, 8L, 11L), col_end = c(4L, 6L, 8L, 13L
)), .Names = c("col_name", "col_start", "col_end"),
class = "data.frame", row.names = c(NA, -4L))
templines <- c("201801 1 78", "201801 2 67", "201801 1 13")
This is an explicit function which seems to be working the way you wanted.
split_func<-function(char,ref,name,start,end){
res<-data.table("ID" = 1:length(char))
for(i in 1:nrow(ref)){
res[,ref[[name]][i] := substr(x = char,start = ref[[start]][i],stop = ref[[end]][i])]
}
return(res)
}
I have created the same input files as you:
templines<-c("201801 1 78","201801 2 67","201801 1 13")
dictionary<-data.table("col_name" = c("year","week","gender","age"),"col_start" = c(1,5,8,11),
"col_end" = c(4,6,8,13))
# col_name col_start col_end
#1: year 1 4
#2: week 5 6
#3: gender 8 8
#4: age 11 13
As for the arguments,
char - The character vector with the values you want to split
ref - The reference table or dictionary
name - The column number in the reference table containing the column names you want
start - The column number in the reference table containing the start points
end - The column number in the reference table containing the stop points
If I use this function with these inputs, I get the following result:
out<-split_func(char = templines,ref = dictionary,name = 1,start = 2,end = 3)
#>out
# ID year week gender age
#1: 1 2018 01 1 78
#2: 2 2018 01 2 67
#3: 3 2018 01 1 13
I had to include an "ID" column to initiate the data table and make this easier. In case you want to drop it later you can just use:
out[,ID := NULL]
Hope this is closer to the solution you were looking for.

Programmatically Finding, Correcting IDs in Dataframes with Different Column and Row Lengths

I have two data frames of differing lengths and widths. Both contain panel data on sites across several years, with each site having a unique ID code. However, these unique ID codes were altered for some sites between data frames. For example:
Year <- c(2006,2006,2006,2006)
Name <- as.character(c("A","B","C","D.B"))
Qtr.2 <- as.numeric(c(14,32,62,40))
Code <- as.character(c(123,456,789,101))
DF1 <- data.frame(Year,Name,Qtr.2,Code,stringsAsFactors = FALSE)
Year2 <- c(2007,2007,2007,2007,2007,2007)
Name2 <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3 <- as.numeric(c(14,32,62,11,40,20))
Code2 <- as.character(c("W33","456","789","121","W133","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2 <- data.frame(Year2,Name2,Qtr.3,Code2,Type,stringsAsFactors = FALSE)
> DF1
Year Name Qtr.2 Code
1 2006 A 14 123
2 2006 B 32 456
3 2006 C 62 789
4 2006 D.B 40 101
> DF2
Year2 Name2 Qtr.3 Code2 Type
1 2007 A 14 W33 Blue
2 2007 B 32 456 Red
3 2007 C 62 789 Red
4 2007 E 11 121 Green
5 2007 D.B 40 W133 Blue
6 2007 D.A 20 W111 Red
Here, site “A's” code has changed from “123” in DF1 to “W33” in DF2.
I am having trouble programmatically finding and converting the altered ID codes to match their prior ID code. In other words, I want to match names from DF1 to DF2, and replace "Code2" in DF2 with "Code" from DF1 when a matching name is discovered. My approach thus far has involved a rather convoluted padding and for loop process. However, I feel this must be a semiregular wrangling problem and there must be a simpler approach.
Ideally, my second DF would look as follows:
Year2_fixed <- c(2007,2007,2007,2007,2007,2007)
Name2_fixed <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3_fixed <- as.numeric(c(14,32,62,11,40,20))
Code2_fixed <- as.character(c("123","456","789","121","101","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2_fixed <-data.frame(Year2_fixed,Name2_fixed,Qtr.3_fixed,Code2_fixed,Type,stringsAsFactors = FALSE)
> DF2_fixed
Year2_fixed Name2_fixed Qtr.3_fixed Code2_fixed Type
1 2007 A 14 123 Blue
2 2007 B 32 456 Red
3 2007 C 62 789 Red
4 2007 E 11 121 Green
5 2007 D.B 40 101 Blue
6 2007 D.A 20 W111 Red
I have done some looking but I haven't found a clear answer on OS that gets at this problem. It is possible I am not asking the question clearly enough in searches. Please point it out if it is out there, or let me know if I can clarify my question.
A few last points: I want to be able to perform an inner_join BY the code, preserving those observations that appear in both sets. I am providing a toy example, but, as is often the case, the true problem is too large to manually check these names.
Edit
As pointed out by others, stringAsFactors = FALSE has been added to prevent error.
Try using the match command:
DF2 <- within(DF2, {
ind <- match(Name2, DF1$Name)
new_code <- DF1$Code[ind]
Code_fixed <- ifelse(is.na(ind), as.character(Code2), as.character(new_code))
rm(ind, new_code)
})
DF2
A solution is to use dplyr::coalesce along with left_join to get the desired result.
library(dplyr)
DF2 %>% left_join(select(DF1, Name, Code), by=c("Name2" = "Name")) %>%
mutate(Code2 = coalesce(Code, Code2)) %>%
select(-Code)
# Year2 Name2 Qtr.3 Code2 Type
# 1 2007 A 14 123 Blue
# 2 2007 B 32 456 Red
# 3 2007 C 62 789 Red
# 4 2007 E 11 121 Green
# 5 2007 D.B 40 101 Blue
# 6 2007 D.A 20 W111 Red
Note: stringsAsFactors = FALSE has been added in OP's code to create data.frames, otherwise it would generate unnecessary warnings.
Data:
Year <- c(2006,2006,2006,2006)
Name <- as.character(c("A","B","C","D.B"))
Qtr.2 <- as.numeric(c(14,32,62,40))
Code <- as.character(c(123,456,789,101))
DF1 <- data.frame(Year,Name,Qtr.2,Code, stringsAsFactors = FALSE)
Year2 <- c(2007,2007,2007,2007,2007,2007)
Name2 <- as.character(c("A","B","C","E","D.B","D.A"))
Qtr.3 <- as.numeric(c(14,32,62,11,40,20))
Code2 <- as.character(c("W33","456","789","121","W133","W111"))
Type <- as.character(c("Blue","Red","Red","Green","Blue","Red"))
DF2 <- data.frame(Year2,Name2,Qtr.3,Code2,Type, stringsAsFactors = FALSE)

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance
The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34
You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val
Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

Organizing Multidimensional Data in R

I am trying to organize multidimensional data in R. The data is extracted in R from CSV file. My data in data frame of R is, as following:
Rank Arrangers YearAmt
1994
1 JPM 6,605.00
2 UBS 7,806.00
3 RBS 1,167.34
1995
1 Citi 1,150.00
2 Scotiabank 483.33
3 ING 800.56
4 UniCredit 700.70
This is just a toy data. Original dataset is large. I would like to subset the data by year like 1994, 1995 etc. So that I can conduct some analysis. I have tried to subset the data set by factor/level using sapply and subset. But, I realized R is just treating 1994 and 1995 as a data in a row. I am thinking to format the original csv file by creating Year as a separate column and then putting a corresponding year in a field for all the rows.
I would appreciate any help in suggesting a way to organize data in R. I am expecting an output like this:
Rank Arrangers YearAmt Year
1 JPM 6,605.00 1994
2 UBS 7,806.00 1994
3 RBS 1,167.34 1994
1 Citi 1,150.00 1995
2 Scotiabank 483.33 1995
3 ING 800.56 1995
4 UniCredit 700.70 1995
1) ave Using cumsum(Rank == "") to create a grouping variable for years, this uses ave to create a Year column creating within each group of year rows a Year consisting of NA followed by the year repeated. Finally use na.omit to remove the rows with NA. No packages are used:
na.year <- function(x) c(NA, rep(x[1], length(x) - 1)) # c(NA, x[1], x[1], ..., x[1])
na.omit( transform(df1, Year = ave(YearAmt, cumsum(Rank == ""), FUN = na.year)) )
Using the input df1 reproducibly defined in the answer from #akrun we get:
Rank Arrangers YearAmt Year
2 1 JPM 6,605.00 1994
3 2 UBS 7,806.00 1994
4 3 RBS 1,167.34 1994
6 1 Citi 1,150.00 1995
7 2 Scotiabank 483.33 1995
8 3 ING 800.56 1995
9 4 UniCredit 700.70 1995
2) by Using by split df1 into years applying addYear to each component of the split. Finally put them back together. No packages are used.
addYear <- function(x) cbind(x[-1, ], Year = x[1, "YearAmt"])
do.call("rbind", by(df1, cumsum(df1$Rank == ""), addYear))
3) sqldf Using the sqldf package we can join each row of df1 with all prior rows of itself having a zero length rank Rank taking the maximum YearAmt of those to form the Year. Then keep only those rows having a non-zero length Rank.
library(sqldf)
sqldf("select b.*, max(a.YearAmt) Year
from df1 a join df1 b on a.rowid < b.rowid and a.Rank = ''
group by b.rowid
having b.Rank != ''")
We create a logical vector based on the blank elements in 'Rank' ('i1'), then subset the rows of 'df1' by removing all the blank rows using 'i1' (df1[!i1,]) and transform the dataset to create the 'Year' column by replicating the 'YearAmt' (that corresponds to the blank in 'Rank') using the cumulative sum of 'i1'.
i1 <- df1$Rank == ''
res <- transform(df1[!i1,], Year = df1$YearAmt[i1][cumsum(i1)[!i1]])
res
# Rank Arrangers YearAmt Year
#2 1 JPM 6,605.00 1994
#3 2 UBS 7,806.00 1994
#4 3 RBS 1,167.34 1994
#6 1 Citi 1,150.00 1995
#7 2 Scotiabank 483.33 1995
#8 3 ING 800.56 1995
#9 4 UniCredit 700.70 1995
Or as #G.Grothendieck mentioned in the comments, the transform step can be made compact by
res <- transform(df1, Year = YearAmt[i1][cumsum(i1)])[!i1, ]
row.names(res) <- NULL
NOTE: No external packages are needed. Only baseverse..
Or using dtverse/zooverse
library(data.table)
library(zoo)
setDT(df1)[Rank=='', Year:= YearAmt][, Year := na.locf(Year)][Rank!='']
# Rank Arrangers YearAmt Year
#1: 1 JPM 6,605.00 1994
#2: 2 UBS 7,806.00 1994
#3: 3 RBS 1,167.34 1994
#4: 1 Citi 1,150.00 1995
#5: 2 Scotiabank 483.33 1995
#6: 3 ING 800.56 1995
#7: 4 UniCredit 700.70 1995
data
df1 <- structure(list(Rank = c("", "1", "2", "3", "", "1", "2", "3",
"4"), Arrangers = c("", "JPM", "UBS", "RBS", "", "Citi", "Scotiabank",
"ING", "UniCredit"), YearAmt = c("1994", "6,605.00", "7,806.00",
"1,167.34", "1995", "1,150.00", "483.33", "800.56", "700.70")),
.Names = c("Rank",
"Arrangers", "YearAmt"), row.names = c(NA, -9L), class = "data.frame")
A tidyverse option:
library(dplyr)
library(tidyr)
# add Year column, with NAs where no year in row
df %>% mutate(Year = ifelse(Rank == '' & Arrangers == '', YearAmt, NA)) %>%
# fill year downwards
fill(Year) %>%
# chop out year rows
filter(Rank != '', Arrangers != '')
## Rank Arrangers YearAmt Year
## 1 1 JPM 6,605.00 1994
## 2 2 UBS 7,806.00 1994
## 3 3 RBS 1,167.34 1994
## 4 1 Citi 1,150.00 1995
## 5 2 Scotiabank 483.33 1995
## 6 3 ING 800.56 1995
## 7 4 UniCredit 700.70 1995

Creating a long table from a wide table using merged.stack (or reshape)

I have a data frame that looks like this:
ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1 A 80 XX 100 NA NA BB 45
2 XX 90 NA NA AA 80 XX 80
I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:
ID Subject Test Score Year
1 rd A 80 2011
1 mt XX 100 2012
1 mt BB 45 2012
2 rd XX 90 2011
2 rd AA 80 2012
2 mt XX 80 2012
I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:
library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")
I've had more success (gotten closer) with reshape:
y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE),
timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
new.row.names=NULL)
This will get your data into the right format:
df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
timevar="Year", varying=list(grep("Test", names(df), value=TRUE),
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL)
Then omitting NA:
df.long = df.long[!is.na(df.long$Test),]
Then splitting Year to remove Test_:
df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
And ordering by ID:
df.long[order(df.long$ID),]
ID Year Test Score
1 1 2011 A 80
5 1 2012 XX 100
2 2 2011 XX 90
9 2 2013 AA 80
6 3 2012 A 10
3 4 2011 A 50
7 4 2012 XX 60
10 4 2013 AA 99
4 5 2011 C 50
8 5 2012 A 75
Using reshape:
dat.long <- reshape(dat, direction="long", varying=list(c(2, 4,6), c(3, 5,7)),
times=2011:2013,timevar='Year',
sep="_", v.names=c("Test", "Score"))
dat.long[complete.cases(dat.long),]
ID Year Test Score id
1.2011 1 2011 A 80 1
2.2011 2 2011 XX 90 2
4.2011 4 2011 A 50 4
5.2011 5 2011 C 50 5
1.2012 1 2012 XX 100 1
3.2012 3 2012 A 10 3
4.2012 4 2012 XX 60 4
5.2012 5 2012 A 75 5
2.2013 2 2013 AA 80 2
4.2013 4 2013 AA 99 4
Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.
The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.
However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.
For this, we can use some gsub to rearrange the existing names.
library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID" "test_rd_2011" "score_rd_2011" "test_mt_2011"
# [5] "score_mt_2011" "test_rd_2012" "score_rd_2012" "test_mt_2012"
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
# ID Subject Year test score
# 1: 1 mt 2011 XX 100
# 2: 1 mt 2012 BB 45
# 3: 1 rd 2011 A 80
# 4: 2 mt 2012 XX 80
# 5: 2 rd 2011 XX 90
# 6: 2 rd 2012 AA 80
For the benefit of others, "mydf" in this answer is defined as:
mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"),
rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA),
mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"),
rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"),
mt_score_2012 = c(45L, 80L)),
.Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011",
"mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012",
"mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))

Resources