Updating a data frame's columns based on other columns - r

I have a data frame containing a person's stage, as follows (this is only a sample of a very large one):
df = structure(list(DeceasedDate = c(0.283219178082192, 1.12678843226788,
2.02865296803653, 0.892465753424658, NA, 0.88013698630137, NA
), LastClinicalEventMonthEnd = c(0.244862981988838, 1.03637744165398,
10.9464611555048, 0.763698598427194, 3.35011412354135, 0.677397228564181,
3.83687211440893), FirstYStage = c("N/A", "2", "2", "2", "2",
"2", "3.1"), SecondYStage = c("N/A", "N/A", "2", "N/A", "2",
"N/A", "3.1"), ThirdYStage = c("N/A", "N/A", "2", "N/A", "2",
"N/A", "3.1"), FourthYStage = c("N/A", "N/A", "N/A", "N/A", "2",
"N/A", "3.1"), FifthYStage = c("N/A", "N/A", "N/A", "N/A", "N/A",
"N/A", "N/A")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-7L))
The 5 right hand columns are a stage of a person, but do not contain all the information yet. I need to include the information in the first two columns, in which the numbers are in years, as follows:
if the value in column 1 is smaller than a year, FirstYStage should be "Deceased", and also all the next columns (the person is still dead...); if the value is between 1 and 2, it SecondYStage should be "Deceased", and so on.
if the value in column 2 is smaller than a year, SecondYStage should be "EndOfEvents"; if the value is between 1 and 2, it SecondYStage should be "EndOfEvents", and so on.
So the expected output in this case should be:
df_updated = structure(list(DeceasedDate = c(0.283219178082192,
1.12678843226788,
2.02865296803653, 0.892465753424658, NA, 0.88013698630137, NA
), LastClinicalEventMonthEnd = c(0.244862981988838, 1.03637744165398,
10.9464611555048, 0.763698598427194, 3.35011412354135, 0.677397228564181,
3.83687211440893), FirstYStage = c("Deceased", "2", "2", "Deceased",
"2", "Deceased", "3.1"), SecondYStage = c("Deceased", "Deceased",
"2", "Deceased", "2", "Deceased", "3.1"), ThirdYStage = c("Deceased",
"Deceased", "Deceased", "Deceased", "2", "Deceased", "3.1"),
FourthYStage = c("Deceased", "Deceased", "Deceased", "Deceased",
"2", "Deceased", "3.1"), FifthYStage = c("Deceased", "Deceased",
"Deceased", "Deceased", "LastEvent", "Deceased", "LastEvent"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))
One important point is that "Death" should be given priority, in other words, if there is a clash and on the one hand there is a number and "death" is contradicting it, we should prefer death.
How do I do this in the most efficient way? At the moment I am doing if's but I think it is not the best course of action

This is what I would do:
Reshape from wide to long format
Compute years from column names
Selectively update the value column
Reshape back to wide format
data.table
As I am more fluent in data.table than in dplyr here is the approach implemented in data.table syntax. (Apologies but I will add a dplyr solution if time permits.)
library(data.table)
long <- melt(setDT(df)[, rn := .I], measure.vars = patterns("Stage$"))
long[, year := as.integer(variable)] # column index
long[floor(DeceasedDate) < year, value := "Deceased"]
long[is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < year, value := "EndOfEvents"]
dcast(long, rn + DeceasedDate + LastClinicalEventMonthEnd ~ variable)
rn DeceasedDate LastClinicalEventMonthEnd FirstYStage SecondYStage ThirdYStage FourthYStage FifthYStage
1: 1 0.2832192 0.2448630 Deceased Deceased Deceased Deceased Deceased
2: 2 1.1267884 1.0363774 2 Deceased Deceased Deceased Deceased
3: 3 2.0286530 10.9464612 2 2 Deceased Deceased Deceased
4: 4 0.8924658 0.7636986 Deceased Deceased Deceased Deceased Deceased
5: 5 NA 3.3501141 2 2 2 2 EndOfEvents
6: 6 0.8801370 0.6773972 Deceased Deceased Deceased Deceased Deceased
7: 7 NA 3.8368721 3.1 3.1 3.1 3.1 EndOfEvents
dplyr / tidyr
As promised, here is also a dplyr/tidyr implemention of the same approach:
library(tidyr)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
gather(key, val, ends_with("Stage"), factor_key = TRUE) %>%
mutate(year = as.integer(key)) %>%
mutate(val = if_else(!is.na(DeceasedDate) & floor(DeceasedDate) < year, "Deceased", val)) %>%
mutate(val = if_else(is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < year, "EndOfEvents", val)) %>%
select(-year) %>%
spread(key, val) %>%
arrange(rn)
DeceasedDate LastClinicalEventMonthEnd rn FirstYStage SecondYStage ThirdYStage FourthYStage FifthYStage
1 0.2832192 0.2448630 1 Deceased Deceased Deceased Deceased Deceased
2 1.1267884 1.0363774 2 2 Deceased Deceased Deceased Deceased
3 2.0286530 10.9464612 3 2 2 Deceased Deceased Deceased
4 0.8924658 0.7636986 4 Deceased Deceased Deceased Deceased Deceased
5 NA 3.3501141 5 2 2 2 2 EndOfEvents
6 0.8801370 0.6773972 6 Deceased Deceased Deceased Deceased Deceased
7 NA 3.8368721 7 3.1 3.1 3.1 3.1 EndOfEvents
or without creating a year column:
df %>%
mutate(rn = row_number()) %>%
gather(key, val, ends_with("Stage"), factor_key = TRUE) %>%
mutate(val = if_else(!is.na(DeceasedDate) & floor(DeceasedDate) < as.integer(key),
"Deceased", val)) %>%
mutate(val = if_else(is.na(DeceasedDate) & floor(LastClinicalEventMonthEnd) + 1 < as.integer(key),
"EndOfEvents", val)) %>%
spread(key, val) %>%
arrange(rn)

Related

How to find the earliest date across multiple columns in R (Issue with NAs)

I have 3 date columns (class-date) and I want to create a new column that will have the earliest of the 3 dates. This is the code I used below:
df1 <- df %>% mutate(timeout= pmin(date1, date2, end_date))
In the case that date1 and date2 are NAs, then I would like the date in end_date to be returned in the timeout column and therefore timeout should not have any NAs. The code above is bringing back NAs. Any assistance will be greatly appreciated.
You can add na.rm = TRUE, then it will ignore the NAs in each row when calculating pmin.
library(dplyr)
df %>%
mutate(timeout = pmin(date1, date2, end_date, na.rm = TRUE))
Output
id date1 date2 end_date timeout
1 1 <NA> <NA> 2008-01-23 2008-01-23
2 1 2007-10-16 2007-11-01 2008-01-23 2007-10-16
3 2 2007-11-30 2007-11-30 2007-11-30 2007-11-30
4 3 2007-08-17 2007-12-17 2008-12-12 2007-08-17
5 3 2008-11-12 2008-12-12 2008-12-12 2008-11-12
Data
df <- structure(list(id = c(1L, 1L, 2L, 3L, 3L), date1 = structure(c(NA,
13802, 13847, 13742, 14195), class = "Date"), date2 = structure(c(NA,
13818, 13847, 13864, 14225), class = "Date"), end_date = c("2008-01-23",
"2008-01-23", "2007-11-30", "2008-12-12", "2008-12-12")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))

How can I reshape multiple columns of long data to wide?

structure(tibble(c("top", "jng", "mid", "bot", "sup"), c("369", "Karsa", "knight", "JackeyLove", "yuyanjia"),
c("Malphite", "Rek'Sai", "Zoe", "Aphelios", "Braum"), c("1", "1", "1", "1", "1"), c("7", "5", "7", "5", "0"),
c("6079-7578", "6079-7578", "6079-7578", "6079-7578", "6079-7578")), .Names = c("position", "player", "champion", "result", "kills", "gameid"))
Output:
# A tibble: 5 x 6
position player champion result kills gameid
* <chr> <chr> <chr> <chr> <chr> <chr>
1 top 369 Malphite 1 7 6079-7578
2 jng Karsa Rek'Sai 1 5 6079-7578
3 mid knight Zoe 1 7 6079-7578
4 bot JackeyLove Aphelios 1 5 6079-7578
5 sup yuyanjia Braum 1 0 6079-7578
My desired output would be:
structure(list(gameid = "6079-7578", result = "1", player_top = "369",
player_jng = "Karsa", player_mid = "knight", player_bot = "JackeyLove",
player_sup = "yuyanjia", champion_top = "Malphite", champion_jng = "Rek'Sai",
champion_mid = "Zoe", champion_bot = "Aphelios", champion_sup = "Braum",
kills_top = "7", kills_jng = "5", kills_mid = "7", kills_bot = "5",
kills_sup = "0"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
which looks like this:
gameid result player_top player_jng player_mid player_bot player_sup champion_top champion_jng champion_mid champion_bot champion_sup
1 6079-7578 1 369 Karsa knight JackeyLove yuyanjia Malphite RekSai Zoe Aphelios Braum
kills_top kills_jng kills_mid kills_bot kills_sup
1 7 5 7 5 0
I know I should use pivot_wider() and something like drop_na, but I don't know how to do pivot_wider() with mutiple columns and collapse the rows at the same time. Any help would be appreciated.
You can use pivot_wider() for this, defining the "position" variable as the variable that the new column names come from in names_from and the three variables with values you want to use to fill those columns with as values_from.
By default the multiple values_from variables are pasted on to the front of new columns names. This can be changed, but in this case that matches the naming structure you want.
All other variables in the original dataset will be used as the id_cols in the order that they appear.
library(tidyr)
pivot_wider(dat,
names_from = "position",
values_from = c("player", "champion", "kills"))
#> result gameid player_top player_jng player_mid player_bot player_sup
#> 1 1 6079-7578 369 Karsa knight JackeyLove yuyanjia
#> champion_top champion_jng champion_mid champion_bot champion_sup kills_top
#> 1 Malphite Rek'Sai Zoe Aphelios Braum 7
#> kills_jng kills_mid kills_bot kills_sup
#> 1 5 7 5 0
You can control the order of your id columns in the output by explicitly writing them out via id_cols. Here's an example, matching your desired output.
pivot_wider(dat, id_cols = c("gameid", "result"),
names_from = "position",
values_from = c("player", "champion", "kills"))
#> gameid result player_top player_jng player_mid player_bot player_sup
#> 1 6079-7578 1 369 Karsa knight JackeyLove yuyanjia
#> champion_top champion_jng champion_mid champion_bot champion_sup kills_top
#> 1 Malphite Rek'Sai Zoe Aphelios Braum 7
#> kills_jng kills_mid kills_bot kills_sup
#> 1 5 7 5 0
Created on 2021-06-24 by the reprex package (v2.0.0)
Using data.table might help here. In dcast() each row will be identified by a unique combo of gameid and result, the columns will be spread by position, and filled with values from the variables listed in value.var.
library(data.table)
library(dplyr)
df <- structure(tibble(c("top", "jng", "mid", "bot", "sup"), c("369", "Karsa", "knight", "JackeyLove", "yuyanjia"),
c("Malphite", "Rek'Sai", "Zoe", "Aphelios", "Braum"), c("1", "1", "1", "1", "1"), c("7", "5", "7", "5", "0"),
c("6079-7578", "6079-7578", "6079-7578", "6079-7578", "6079-7578")), .Names = c("position", "player", "champion", "result", "kills", "gameid"))
df2 <- dcast(setDT(df), gameid + result~position, value.var = list('player','champion','kills'))

Sum all cells to the right of a column in each row using Dplyr

So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply with as.numeric:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
Benchmark
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.
Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.
If we remove the select, it becomes just a base R i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
Benchmarks
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
data
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
Similar to akrun you can try
df %>%
mutate_at(vars(starts_with("q")),funs(as.numeric)) %>%
mutate(sum_new = rowSums(select(., starts_with("q")), na.rm = TRUE)) %>%
filter(sum_new>0)
Here i use reduce in purrr to sum rows, it's the fastest way.
library(tidyverse)
data %>% filter_at(vars(starts_with('q')),~!is.na(.)) %>%
mutate( Sum = reduce(select(., starts_with("q")), `+`)) %>%
filter(Sum > 0)

conditionally aggregate columns using tidyverse for large time series dataset

After looking at a few other asked questions, and reading a few guides, I'm not able to find a suitable solution to my specific problem. Here's an example of the data to begin:
data <- data.frame(
Date = sample(c("1993-07-05", "1993-07-05", "1993-07-05", "1993-08-30", "1993-08-30", "1993-08-30", "1993-08-30", "1993-09-04", "1993-09-04")),
Site = sample(c("1", "1", "1", "1", "1", "1", "1", "1", "1")),
Station = sample(c("1", "2", "3", "1", "2", "3", "4", "1", "2")),
Oxygen = sample(c("0.9", "0.4", "4.2", "5.6", "7.3", "4.3", "9.5", "5.3", "0.3")))
I want to average all the oxygen values for the stations that are nested within a site that corresponds to a date. My dataset has a couple of thousand rows, and like in the example, there are an uneven number of stations, and the dates are uneven in length.
The output I'm looking for are columns like, "Date -> Site -> Average Oxygen", foregoing the need for a station column altogether in the new version of the time series.
Any help would be greatly appreciated!
After grouping by 'Site', 'Date', get the mean of 'Oxygen' (after converting it to numeric - it is factor column)
library(tidyverse)
data %>%
group_by(Site, Date) %>%
summarise(AverageOxygen = mean(as.numeric(as.character(Oxygen))))
# A tibble: 3 x 3
# Groups: Site [1]
# Site Date AverageOxygen
# <fct> <fct> <dbl>
#1 1 1993-07-05 3.97
#2 1 1993-08-30 5.2
#3 1 1993-09-04 2.55
Try:
library(hablar)
library(tidyverse)
data %>%
retype() %>%
group_by(Site, Date) %>%
summarize(AverageOxygen = mean(Oxygen))
which gives you:
# A tibble: 3 x 3
# Groups: Site [?]
Site Date AverageOxygen
<int> <date> <dbl>
1 1 1993-07-05 4.7
2 1 1993-08-30 3.55
3 1 1993-09-04 4.75

Transposing data frames [duplicate]

This question already has answers here:
Transposing a dataframe maintaining the first column as heading
(5 answers)
Closed 1 year ago.
Happy Weekends.
I've been trying to replicate the results from this blog post in R. I am looking for a method of transposing the data without using t, preferably using tidyr or reshape. In example below, metadata is obtained by transposing data.
metadata <- data.frame(colnames(data), t(data[1:4, ]) )
colnames(metadata) <- t(metadata[1,])
metadata <- metadata[-1,]
metadata$Multiplier <- as.numeric(metadata$Multiplier)
Though it achieves what I want, I find it little unskillful. Is there any efficient workflow to transpose the data frame?
dput of data
data <- structure(list(Series.Description = c("Unit:", "Multiplier:",
"Currency:", "Unique Identifier: "), Nominal.Broad.Dollar.Index. = c("Index:_1997_Jan_100",
"1", NA, "H10/H10/JRXWTFB_N.M"), Nominal.Major.Currencies.Dollar.Index. = c("Index:_1973_Mar_100",
"1", NA, "H10/H10/JRXWTFN_N.M"), Nominal.Other.Important.Trading.Partners.Dollar.Index. = c("Index:_1997_Jan_100",
"1", NA, "H10/H10/JRXWTFO_N.M"), AUSTRALIA....SPOT.EXCHANGE.RATE..US..AUSTRALIAN...RECIPROCAL.OF.RXI_N.M.AL. = c("Currency:_Per_AUD",
"1", "USD", "H10/H10/RXI$US_N.M.AL"), SPOT.EXCHANGE.RATE...EURO.AREA. = c("Currency:_Per_EUR",
"1", "USD", "H10/H10/RXI$US_N.M.EU"), NEW.ZEALAND....SPOT.EXCHANGE.RATE..US..NZ...RECIPROCAL.OF.RXI_N.M.NZ.. = c("Currency:_Per_NZD",
"1", "USD", "H10/H10/RXI$US_N.M.NZ"), United.Kingdom....Spot.Exchange.Rate..US..Pound.Sterling.Reciprocal.of.rxi_n.m.uk = c("Currency:_Per_GBP",
"0.01", "USD", "H10/H10/RXI$US_N.M.UK"), BRAZIL....SPOT.EXCHANGE.RATE..REAIS.US.. = c("Currency:_Per_USD",
"1", "BRL", "H10/H10/RXI_N.M.BZ"), CANADA....SPOT.EXCHANGE.RATE..CANADIAN...US.. = c("Currency:_Per_USD",
"1", "CAD", "H10/H10/RXI_N.M.CA"), CHINA....SPOT.EXCHANGE.RATE..YUAN.US.. = c("Currency:_Per_USD",
"1", "CNY", "H10/H10/RXI_N.M.CH"), DENMARK....SPOT.EXCHANGE.RATE..KRONER.US.. = c("Currency:_Per_USD",
"1", "DKK", "H10/H10/RXI_N.M.DN"), HONG.KONG....SPOT.EXCHANGE.RATE..HK..US.. = c("Currency:_Per_USD",
"1", "HKD", "H10/H10/RXI_N.M.HK"), INDIA....SPOT.EXCHANGE.RATE..RUPEES.US. = c("Currency:_Per_USD",
"1", "INR", "H10/H10/RXI_N.M.IN"), JAPAN....SPOT.EXCHANGE.RATE..YEA.US.. = c("Currency:_Per_USD",
"1", "JPY", "H10/H10/RXI_N.M.JA"), KOREA....SPOT.EXCHANGE.RATE..WON.US.. = c("Currency:_Per_USD",
"1", "KRW", "H10/H10/RXI_N.M.KO"), Malaysia...Spot.Exchange.Rate..Ringgit.US.. = c("Currency:_Per_USD",
"1", "MYR", "H10/H10/RXI_N.M.MA"), MEXICO....SPOT.EXCHANGE.RATE..PESOS.US.. = c("Currency:_Per_USD",
"1", "MXN", "H10/H10/RXI_N.M.MX"), NORWAY....SPOT.EXCHANGE.RATE..KRONER.US.. = c("Currency:_Per_USD",
"1", "NOK", "H10/H10/RXI_N.M.NO"), SWEDEN....SPOT.EXCHANGE.RATE..KRONOR.US.. = c("Currency:_Per_USD",
"1", "SEK", "H10/H10/RXI_N.M.SD"), SOUTH.AFRICA....SPOT.EXCHANGE.RATE..RAND.US.. = c("Currency:_Per_USD",
"1", "ZAR", "H10/H10/RXI_N.M.SF"), Singapore...SPOT.EXCHANGE.RATE..SINGAPORE...US.. = c("Currency:_Per_USD",
"1", "SGD", "H10/H10/RXI_N.M.SI"), SRI.LANKA....SPOT.EXCHANGE.RATE..RUPEES.US.. = c("Currency:_Per_USD",
"1", "LKR", "H10/H10/RXI_N.M.SL"), SWITZERLAND....SPOT.EXCHANGE.RATE..FRANCS.US.. = c("Currency:_Per_USD",
"1", "CHF", "H10/H10/RXI_N.M.SZ"), TAIWAN....SPOT.EXCHANGE.RATE..NT..US.. = c("Currency:_Per_USD",
"1", "TWD", "H10/H10/RXI_N.M.TA"), THAILAND....SPOT.EXCHANGE.RATE....THAILAND. = c("Currency:_Per_USD",
"1", "THB", "H10/H10/RXI_N.M.TH"), VENEZUELA....SPOT.EXCHANGE.RATE..BOLIVARES.US.. = c("Currency:_Per_USD",
"1", "VEB", "H10/H10/RXI_N.M.VE")), .Names = c("Series.Description",
"Nominal.Broad.Dollar.Index.", "Nominal.Major.Currencies.Dollar.Index.",
"Nominal.Other.Important.Trading.Partners.Dollar.Index.", "AUSTRALIA....SPOT.EXCHANGE.RATE..US..AUSTRALIAN...RECIPROCAL.OF.RXI_N.M.AL.",
"SPOT.EXCHANGE.RATE...EURO.AREA.", "NEW.ZEALAND....SPOT.EXCHANGE.RATE..US..NZ...RECIPROCAL.OF.RXI_N.M.NZ..",
"United.Kingdom....Spot.Exchange.Rate..US..Pound.Sterling.Reciprocal.of.rxi_n.m.uk",
"BRAZIL....SPOT.EXCHANGE.RATE..REAIS.US..", "CANADA....SPOT.EXCHANGE.RATE..CANADIAN...US..",
"CHINA....SPOT.EXCHANGE.RATE..YUAN.US..", "DENMARK....SPOT.EXCHANGE.RATE..KRONER.US..",
"HONG.KONG....SPOT.EXCHANGE.RATE..HK..US..", "INDIA....SPOT.EXCHANGE.RATE..RUPEES.US.",
"JAPAN....SPOT.EXCHANGE.RATE..YEA.US..", "KOREA....SPOT.EXCHANGE.RATE..WON.US..",
"Malaysia...Spot.Exchange.Rate..Ringgit.US..", "MEXICO....SPOT.EXCHANGE.RATE..PESOS.US..",
"NORWAY....SPOT.EXCHANGE.RATE..KRONER.US..", "SWEDEN....SPOT.EXCHANGE.RATE..KRONOR.US..",
"SOUTH.AFRICA....SPOT.EXCHANGE.RATE..RAND.US..", "Singapore...SPOT.EXCHANGE.RATE..SINGAPORE...US..",
"SRI.LANKA....SPOT.EXCHANGE.RATE..RUPEES.US..", "SWITZERLAND....SPOT.EXCHANGE.RATE..FRANCS.US..",
"TAIWAN....SPOT.EXCHANGE.RATE..NT..US..", "THAILAND....SPOT.EXCHANGE.RATE....THAILAND.",
"VENEZUELA....SPOT.EXCHANGE.RATE..BOLIVARES.US.."), row.names = c(NA,
4L), class = "data.frame")
Using tidyr, you gather all the columns except the first, and then you spread the gathered columns.
Try:
library(dplyr)
library(tidyr)
data %>%
gather(var, val, 2:ncol(data)) %>%
spread(Series.Description, val)
library(dplyr)
# Omitted data <- structure part ...
Here is something that replicates what's in the main answer, but more generically (e.g., works where Series.Description is not the first column of the result) and using the newer pivot_wider/pivot_longer verbs.
df_transpose <- function(df) {
df %>%
tidyr::pivot_longer(-1) %>%
tidyr::pivot_wider(names_from = 1, values_from = value)
}
df_transpose(data)
#> # A tibble: 26 x 5
#> name `Unit:` `Multiplier:` `Currency:` `Unique Identifi…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Nominal.Broad.Dollar.… Index:_19… 1 <NA> H10/H10/JRXWTFB_…
#> 2 Nominal.Major.Currenc… Index:_19… 1 <NA> H10/H10/JRXWTFN_…
#> 3 Nominal.Other.Importa… Index:_19… 1 <NA> H10/H10/JRXWTFO_…
#> 4 AUSTRALIA....SPOT.EXC… Currency:… 1 USD H10/H10/RXI$US_N…
#> 5 SPOT.EXCHANGE.RATE...… Currency:… 1 USD H10/H10/RXI$US_N…
#> 6 NEW.ZEALAND....SPOT.E… Currency:… 1 USD H10/H10/RXI$US_N…
#> 7 United.Kingdom....Spo… Currency:… 0.01 USD H10/H10/RXI$US_N…
#> 8 BRAZIL....SPOT.EXCHAN… Currency:… 1 BRL H10/H10/RXI_N.M.…
#> 9 CANADA....SPOT.EXCHAN… Currency:… 1 CAD H10/H10/RXI_N.M.…
#> 10 CHINA....SPOT.EXCHANG… Currency:… 1 CNY H10/H10/RXI_N.M.…
#> # … with 16 more rows
But note that (like the answer above) the name of the first column is lost. The following retains this (as, I guess does the spread_(names(data)[1], "val") approach proposed by #jbkunst above).
df_transpose <- function(df) {
first_name <- colnames(df)[1]
temp <-
df %>%
tidyr::pivot_longer(-1) %>%
tidyr::pivot_wider(names_from = 1, values_from = value)
colnames(temp)[1] <- first_name
temp
}
df_transpose(data)
#> # A tibble: 26 x 5
#> Series.Description `Unit:` `Multiplier:` `Currency:` `Unique Identif…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Nominal.Broad.Dollar.In… Index:_1… 1 <NA> H10/H10/JRXWTFB…
#> 2 Nominal.Major.Currencie… Index:_1… 1 <NA> H10/H10/JRXWTFN…
#> 3 Nominal.Other.Important… Index:_1… 1 <NA> H10/H10/JRXWTFO…
#> 4 AUSTRALIA....SPOT.EXCHA… Currency… 1 USD H10/H10/RXI$US_…
#> 5 SPOT.EXCHANGE.RATE...EU… Currency… 1 USD H10/H10/RXI$US_…
#> 6 NEW.ZEALAND....SPOT.EXC… Currency… 1 USD H10/H10/RXI$US_…
#> 7 United.Kingdom....Spot.… Currency… 0.01 USD H10/H10/RXI$US_…
#> 8 BRAZIL....SPOT.EXCHANGE… Currency… 1 BRL H10/H10/RXI_N.M…
#> 9 CANADA....SPOT.EXCHANGE… Currency… 1 CAD H10/H10/RXI_N.M…
#> 10 CHINA....SPOT.EXCHANGE.… Currency… 1 CNY H10/H10/RXI_N.M…
#> # … with 16 more rows
Created on 2021-05-30 by the reprex package (v2.0.0)

Resources