Organizing Multidimensional Data in R

Organizing Multidimensional Data in R - r

I am trying to organize multidimensional data in R. The data is extracted in R from CSV file. My data in data frame of R is, as following:
Rank Arrangers YearAmt
1994
1 JPM 6,605.00
2 UBS 7,806.00
3 RBS 1,167.34
1995
1 Citi 1,150.00
2 Scotiabank 483.33
3 ING 800.56
4 UniCredit 700.70
This is just a toy data. Original dataset is large. I would like to subset the data by year like 1994, 1995 etc. So that I can conduct some analysis. I have tried to subset the data set by factor/level using sapply and subset. But, I realized R is just treating 1994 and 1995 as a data in a row. I am thinking to format the original csv file by creating Year as a separate column and then putting a corresponding year in a field for all the rows.
I would appreciate any help in suggesting a way to organize data in R. I am expecting an output like this:
Rank Arrangers YearAmt Year
1 JPM 6,605.00 1994
2 UBS 7,806.00 1994
3 RBS 1,167.34 1994
1 Citi 1,150.00 1995
2 Scotiabank 483.33 1995
3 ING 800.56 1995
4 UniCredit 700.70 1995

1) ave Using cumsum(Rank == "") to create a grouping variable for years, this uses ave to create a Year column creating within each group of year rows a Year consisting of NA followed by the year repeated. Finally use na.omit to remove the rows with NA. No packages are used:
na.year <- function(x) c(NA, rep(x[1], length(x) - 1)) # c(NA, x[1], x[1], ..., x[1])
na.omit( transform(df1, Year = ave(YearAmt, cumsum(Rank == ""), FUN = na.year)) )
Using the input df1 reproducibly defined in the answer from #akrun we get:
Rank Arrangers YearAmt Year
2 1 JPM 6,605.00 1994
3 2 UBS 7,806.00 1994
4 3 RBS 1,167.34 1994
6 1 Citi 1,150.00 1995
7 2 Scotiabank 483.33 1995
8 3 ING 800.56 1995
9 4 UniCredit 700.70 1995
2) by Using by split df1 into years applying addYear to each component of the split. Finally put them back together. No packages are used.
addYear <- function(x) cbind(x[-1, ], Year = x[1, "YearAmt"])
do.call("rbind", by(df1, cumsum(df1$Rank == ""), addYear))
3) sqldf Using the sqldf package we can join each row of df1 with all prior rows of itself having a zero length rank Rank taking the maximum YearAmt of those to form the Year. Then keep only those rows having a non-zero length Rank.
library(sqldf)
sqldf("select b.*, max(a.YearAmt) Year
from df1 a join df1 b on a.rowid < b.rowid and a.Rank = ''
group by b.rowid
having b.Rank != ''")

We create a logical vector based on the blank elements in 'Rank' ('i1'), then subset the rows of 'df1' by removing all the blank rows using 'i1' (df1[!i1,]) and transform the dataset to create the 'Year' column by replicating the 'YearAmt' (that corresponds to the blank in 'Rank') using the cumulative sum of 'i1'.
i1 <- df1$Rank == ''
res <- transform(df1[!i1,], Year = df1$YearAmt[i1][cumsum(i1)[!i1]])
res
# Rank Arrangers YearAmt Year
#2 1 JPM 6,605.00 1994
#3 2 UBS 7,806.00 1994
#4 3 RBS 1,167.34 1994
#6 1 Citi 1,150.00 1995
#7 2 Scotiabank 483.33 1995
#8 3 ING 800.56 1995
#9 4 UniCredit 700.70 1995
Or as #G.Grothendieck mentioned in the comments, the transform step can be made compact by
res <- transform(df1, Year = YearAmt[i1][cumsum(i1)])[!i1, ]
row.names(res) <- NULL
NOTE: No external packages are needed. Only baseverse..
Or using dtverse/zooverse
library(data.table)
library(zoo)
setDT(df1)[Rank=='', Year:= YearAmt][, Year := na.locf(Year)][Rank!='']
# Rank Arrangers YearAmt Year
#1: 1 JPM 6,605.00 1994
#2: 2 UBS 7,806.00 1994
#3: 3 RBS 1,167.34 1994
#4: 1 Citi 1,150.00 1995
#5: 2 Scotiabank 483.33 1995
#6: 3 ING 800.56 1995
#7: 4 UniCredit 700.70 1995
data
df1 <- structure(list(Rank = c("", "1", "2", "3", "", "1", "2", "3",
"4"), Arrangers = c("", "JPM", "UBS", "RBS", "", "Citi", "Scotiabank",
"ING", "UniCredit"), YearAmt = c("1994", "6,605.00", "7,806.00",
"1,167.34", "1995", "1,150.00", "483.33", "800.56", "700.70")),
.Names = c("Rank",
"Arrangers", "YearAmt"), row.names = c(NA, -9L), class = "data.frame")

A tidyverse option:
library(dplyr)
library(tidyr)
# add Year column, with NAs where no year in row
df %>% mutate(Year = ifelse(Rank == '' & Arrangers == '', YearAmt, NA)) %>%
# fill year downwards
fill(Year) %>%
# chop out year rows
filter(Rank != '', Arrangers != '')
## Rank Arrangers YearAmt Year
## 1 1 JPM 6,605.00 1994
## 2 2 UBS 7,806.00 1994
## 3 3 RBS 1,167.34 1994
## 4 1 Citi 1,150.00 1995
## 5 2 Scotiabank 483.33 1995
## 6 3 ING 800.56 1995
## 7 4 UniCredit 700.70 1995

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.

Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)

From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.

I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA

Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446

I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

Add stringA or stringB to values in column based on condition

I have a data table "dates" such as:
dates <- data.frame(date1=c("2015","1998","2000","1991"),
date2=c("98","00","18","92"))
dates <- mutate_if(dates,is.factor,as.character)
Where the values in "dates" are of class -char
I want to make "date2" a 4-digit number. For this I would like the following condition:
If "date2" starts with 9 add a 19 before the value
If "date2" starts with anything else add a 20
I have done a lot of research but I cannot find how to add a string to an already existing string by using a conditional
Afterthought: How can we deal with "NA" values so it does not assign a "19" or "20" to "NA´s"

A regex-free alternative:
d2int <- as.integer(dates$date2)
dates[["date2n"]] <- as.character(d2int + ifelse(d2int > 18, 1900, 2000))
dates
date1 date2 date2n
1 2015 98 1998
2 1998 00 2000
3 2000 18 2018
4 1991 92 1992
5 2015 89 1989
6 1998 18 2018
7 2000 19 1919
8 1991 NA <NA>
Where:
dates <- data.frame(
date1=c("2015","1998","2000","1991"),
date2=c("98","00","18","92", "89", "18", "19", "NA"),
stringsAsFactors = FALSE
)

you can use lubridate and try something like :
Input:
dates <- data.frame(date1=c("2015","1998","2000","1991", "1991", "1991"),
date2=c("98","00","18","92", "88", NA))
use:
dates %>%
mutate(date2 = as.integer(date2)) %>%
mutate(date3 = if_else(date2+2000 > year(today()), date2+1900, date2+2000))
which gives:
date1 date2 date3
1 2015 98 1998
2 1998 0 2000
3 2000 18 2018
4 1991 92 1992
5 1991 88 1988
6 1991 NA NA
p.s. added two rows to the input data to show how this handles NA values

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!

Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer

eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Organizing Multidimensional Data in R - r

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

Euclidean distant for distinct classes of factors iterated by groups

How to create a loop for sum calculations which then are inserted into a new row?

Add stringA or stringB to values in column based on condition

R Cleaning and reordering names/serial numbers in data frame

Categories

Resources