Related
I have two data frames, df1 and df2:
df1<-structure(list(protocol_no = c("study5", "study5",
"study5", "study5", "study5", "study5","study6"),
sequence_number = c("1", "15", "73", "42", "2", "9","5021")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(record_id = c(11, 12, 13, 14, 15, 16), protocol_no = c("study5",
"study5", "study5", "study5", "study5", "study5"
), sequence_number = c("1", "15", "73", "42", "2", "9"), form_1_complete = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
You can kinda ignore whats in these, i just made up some names and numbers, the key points here are that df2 has more columns than df1, and the real data sets will have 27,000+ rows.
df1 will always have slightly more rows than df2 because it has newer data.
What I'm trying to do is find which rows exist in df1 that don't exist in df2 and isolate them. I know I could do this with anti-join(), the problem is that I also want to include the "record_id" column from df2, and I want it to start numbering from wherever df2 left off.
So in this case, the row of df1 that is "study 6, 5021" would be the 'new' row, and it would be numbered record_id = 17 (because thats where df2 left off), and my output would look like this:
We could bind the data, get the distinct rows and update the record_id
library(dplyr)
library(tidyr)
library(data.table)
bind_rows(df2, df1) %>%
distinct(protocol_no, sequence_number, .keep_all = TRUE) %>%
fill(record_id, form_1_complete) %>%
mutate(record_id = record_id + (rowid(record_id) - 1))
-output
record_id protocol_no sequence_number form_1_complete
1 11 study5 1 0
2 12 study5 15 0
3 13 study5 73 0
4 14 study5 42 0
5 15 study5 2 0
6 16 study5 9 0
7 17 study6 5021 0
Is there a way to rename_with but instead of the predicate function be on the column name, the predicate function would be based on a value in another variable?
Say I have a dataset as follows:
data <- tibble(home_team = c("SF", "KC", "JAX", "WAS", "BUF"),
away_team = c("GB", "CAR", "HOU", "NYG", "SEA"),
home_total = c(21, 25, 30, 22, 23.5),
home_plays = c(65, 64, 63, 57, 60),
away_total = c(30, 22, 25, 22, 25),
away_plays = c(56, 62, 66, 59, 62))
And I am trying to get it to look something like:
finalized_data <- tibble(team = c("SF", "KC", "JAX", "WAS", "BUF", "GB", "CAR", "HOU", "NYG", "SEA"),
total = c(21, 25, 30, 22, 23.5, 30, 22, 25, 22, 25),
plays = c(65, 64, 63, 57, 60, 56, 62, 66, 59, 62))
Currently the best way I know how is with a mutate function that gets long when theres a lot of variables, and there's got to be a cleaner way to do it since its essentially a rename I'm doing based on a variable in the data.
current_way <- data %>%
pivot_longer(c(home_team, away_team), names_to = "team_type", values_to = "team") %>%
mutate(total = ifelse(str_detect(team_type, "home_team"), home_total, away_total),
plays = ifelse(str_detect(team_type, "home_team"), home_plays, away_plays)) %>%
select(team, total, plays)
Any thoughts, or is there even a way to do it in the pivot function that I am missing?
Here is an option with pivot_longer by making use of the column names pattern to split into columns
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = everything(), names_to = c("grp", ".value"),
names_sep = "_") %>%
arrange(desc(grp)) %>%
select(-grp)
-output
# A tibble: 10 x 3
# team total plays
# <chr> <dbl> <dbl>
# 1 SF 21 65
# 2 KC 25 64
# 3 JAX 30 63
# 4 WAS 22 57
# 5 BUF 23.5 60
# 6 GB 30 56
# 7 CAR 22 62
# 8 HOU 25 66
# 9 NYG 22 59
#10 SEA 25 62
I tried to replace the unicode "U+00F3" from a data frame with the sapply function but nothing happened. The unicode part I want to replace is a chr type.
Here the function :
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U+00F3>", replacement= "o")
EDIT :
Thanks to the answer of Cath below, I added before the + : \\
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U\\+00F3>", replacement= "o")
But it didn't work.
I also tried to provide an exemple of my dataset but the problem is that it works on it and not on mine :
tableExcel <- data.frame("Team" = c("A", "B", "C", "Reducci<U+00F3>n"), "Point" = c(2, 30, 40, 30))
tableExcel$Team <- as.character(tableExcel$Team)
To provide more information, here the importation of my excel file:
tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))
The structure of my data :
structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), 2019-S01 = c(0, 0, 50, 0, NA, NA), 2019-S02 = c(0, 0, 10, 10, NA, NA), 2019-S03 = c(93, 88, 46, 19, NA, NA), 2019-S04 = c(56, 48, 0, 0, 13, 13), 2019-S05 = c(NA, NA, 80.5, 49.5, 42, 28.5), 2019-S06 = c(NA, NA, 66, 48, 55, 39.5), 2019-S07 = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame")
I'm unable to replicate the issue with gsub. The following works as expected:
tableExcel$Team <- gsub("<U\\+00F3>", "o", tableExcel$Team)
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducci<U+00F1>n P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducci<U+00F2>n P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reduccion P. entr NA NA NA NA NA NA NA
However, replacement using regular expressions might not be the most efficient way convert the unicode characters, as this would require multiple calls to gsub. Instead, you might want to give stringi's stri_unescape_unicode() a try:
# install.packages("stringi") # Use if not yet installed.
library(stringi)
tableExcel$Team <- stri_unescape_unicode(gsub("<U\\+(.*)>", "\\\\u\\1", tableExcel$Team))
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducciñn P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducciòn P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reducción P. entr NA NA NA NA NA NA NA
The format <U+0000> is first converted to \\u0000 using gsub and then unescaped. As you can see, it takes care of multiple unicode characters in one go, which makes things much simpler.
Data:
tableExcel <- structure(list(Team = c("Reducci<U+00F1>n", "CHURN", "Reducci<U+00F2>n",
"RESIDENCIAL NPTB", "AUDIENCIAS TV", NA, "Reducci<U+00F3>n"),
Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig",
"P. entr", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA,
NA), `2019-S02` = c(0, 0, 10, 10, NA, NA, NA), `2019-S03` = c(93,
88, 46, 19, NA, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13,
13, NA), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5, NA),
`2019-S06` = c(NA, NA, 66, 48, 55, 39.5, NA), `2019-S07` = c(131,
112, 103, 63, 40.5, 38, NA)), row.names = c(1L, 2L, 4L, 5L,
7L, 8L, 9L), class = "data.frame")
I have this dataset
Book2 <- structure(list(meanX3 = c(21.66666667, 21.66666667, 11, 25, 240.3333333
), meanX1 = c(23, 34.5, 10, 25, 233.5), meanX2 = c(24.5, 26.5,
20, 25, 246.5), to_select = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("meanX1",
"meanX2", "meanX3"), class = "factor"), selected = c(NA, NA,
NA, NA, NA)), .Names = c("meanX3", "meanX1", "meanX2", "to_select",
"selected"), class = "data.frame", row.names = c(NA, -5L))
I want to get the coresponding row value for the column name on variable to_select .
I have tried
Book2 %>% dplyr::mutate(selected=.[paste0(to_select)])
But it returns all the column values. How can I go about to get a data set like
structure(list(meanX3 = c(21.66666667, 21.66666667, 11, 25, 240.3333333
), meanX1 = c(23, 34.5, 10, 25, 233.5), meanX2 = c(24.5, 26.5,
20, 25, 246.5), to_select = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("meanX1",
"meanX2", "meanX3"), class = "factor"), selected = c(21.66, 34.5,
20, 25, 240.33)), .Names = c("meanX3", "meanX1", "meanX2", "to_select",
"selected"), class = "data.frame", row.names = c(NA, -5L))
With base R, a safe strategy would be something like
cols <- as.character(unique(Book2$to_select))
row_col <- match(Book2$to_select, cols)
idx <- cbind(seq_along(Book2$to_select), row_col)
selected <- Book2[, cols][idx]
Book2$selected <- selected
Or using tidyverse packages, something like
library(tidyverse)
Book2 %>% mutate(row=1:n()) %>%
gather(prop, val, meanX3:meanX2) %>%
group_by(row) %>%
mutate(selected=val[to_select==prop]) %>%
spread(prop, val) %>% select(-row)
Would be a decent strategy.
One way is to group by row using rowwise() and then get the value of the string in 'to_select' column
Book2 %>%
rowwise() %>%
mutate(selected = get(as.character(to_select)))
# A tibble: 5 × 5
# meanX3 meanX1 meanX2 to_select selected
# <dbl> <dbl> <dbl> <fctr> <dbl>
#1 21.66667 23.0 24.5 meanX3 21.66667
#2 21.66667 34.5 26.5 meanX1 34.50000
#3 11.00000 10.0 20.0 meanX2 20.00000
#4 25.00000 25.0 25.0 meanX1 25.00000
#5 240.33333 233.5 246.5 meanX1 233.50000
In base R you can use match to select the desired column and then matrix subsetting to select the particular element for each row like this
Book2$selected <- as.numeric(Book2[cbind(seq_len(nrow(Book2)),
match(Book2$to_select, names(Book2)))])
I have the following data - it is a dump from a normalized database, but I can not access the database, and the database maintainer insists that this is not necessary.
The obs variable is the unique observation id, a.k.a. the one to "pivot" around
Specifically, I want to go from this olddata to the newdata data frame below:
> olddata
species obs variable value
3 ADFA 1 mean 4
4 ADFA 1 lat 118
5 ADFA 1 lon 49
6 ADFA 1 masl 74
96 HODO 8 mean 18
97 HODO 8 lat 120
98 HODO 8 lon 45
99 HODO 8 masl 36
189 HODO 9 mean 34
190 HODO 9 lat 126
191 HODO 9 lon 12
192 HODO 9 masl 35
I would like to reshape this data frame to look like:
> newdata
species obs mean lat lon masl
1 ADFA 1 4 118 49 74
2 HODO 8 18 120 45 36
3 HODO 9 34 126 12 35
Disclaimer: this has likely been asked before but I am unable to find the question among the many questions related to transforming data frames / matrices
Here are the dataframes for use when reproducing this issue:
olddata <- structure(list(species = c("ADFA", "ADFA", "ADFA", "ADFA", "HODO",
"HODO", "HODO", "HODO", "HODO", "HODO", "HODO", "HODO"), obs = c(1,
1, 1, 1, 8, 8, 8, 8, 9, 9, 9, 9), variable = c("mean", "lat",
"lon", "masl", "mean", "lat", "lon", "masl", "mean", "lat", "lon",
"masl"), value = c(4, 118, 49, 74, 18, 120, 45, 36, 34, 126,
12, 35)), .Names = c("species", "obs", "variable", "value"),
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12"), class = "data.frame")
newdata <- structure(list(species = c("ADFA", "HODO", "HODO"), obs = c(1,
8, 9), mean = c(4, 18, 34), lat = c(118, 120, 126), lon = c(49,
45, 12), masl = c(74, 36, 35)), .Names = c("species", "obs",
"mean", "lat", "lon", "masl"), row.names = c(NA, -3L),
class = "data.frame")
Here is an example:
> library(reshape2)
> dcast(olddata, species+obs~variable)
species obs lat lon masl mean
1 ADFA 1 118 49 74 4
2 HODO 8 120 45 36 18
3 HODO 9 126 12 35 34
library(reshape2)
dcast(olddata,species+obs~variable)