I have a numeric vector:
> dput(vec_exp)
structure(c(12.344902729712, 6.54357482855349, 17.1939193108764,
23.1029632631654, 8.91495023159554, 14.3259091357051, 18.0494234749187,
2.92524638658168, 5.10306474037357, 2.66645609602021), .Names = c("Arthur_1",
"Mark_1", "Mark_2", "Mark_3", "Stephen_1", "Stephen_2",
"Stephen_3", "Rafael_1", "Marcus_1", "Georg_1"))
and then I have a data frame like the one below:
Name Nr Numb
1 Rafael 20.8337 20833.7
2 Joseph 25.1682 25168.2
3 Stephen 40.5880 40588.0
4 Leon 198.7730 198773.0
5 Thierry 16.5430 16543.0
6 Marcus 31.6600 31660.0
7 Lucas 39.6700 39670.0
8 Georg 194.9410 194941.0
9 Mark 60.1020 60102.0
10 Chris 56.0578 56057.8
I would like to multiply the numbers in numeric vector by the numbers from the column Nr in this data frame. Of course it is important to multiply the values by the name. It means that Mark_1 from numeric vector should be multiplied by the Nr = 60.1020, same for Mark_2, and Stephen_3 by 40.5880, etc.
Can someone recommend any easy solution ?
You could use match to match the names after extracting only the first part of the names of vec_exp, i.e. extract Mark from Mark_1 etc.
vec_exp * df$Nr[match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)]
# Arthur_1 Mark_1 Mark_2 Mark_3 Stephen_1 Stephen_2 Stephen_3 Rafael_1 Marcus_1 Georg_1
# NA 393.28193 1033.38894 1388.53430 361.84000 581.46000 732.59000 60.94371 161.56303 519.80162
Arthur is NA because there's no match in the data.frame.
If you want to keep those entries without a match in the data as they were before, you could do it like this:
i <- match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)
vec_exp[!is.na(i)] <- vec_exp[!is.na(i)] * df$Nr[na.omit(i)]
This first computes the matches and then only multiplies those if they are not NA.
We can use base R methods. Convert the vector to a data.frame with stack, create a 'Name' column by removing the substring from 'ind' and merge with the data.frame ('df1'). Then, we can multiply the 'Nr' and the 'values' column.
d1 <- merge(df1, transform(stack(vec_exp), Name = sub("_.*", "", ind)), all.y=TRUE)
d1$Nr*d1$values
Or with dplyr, it is much more easier to understand.
library(dplyr)
library(tidyr)
stack(vec_exp) %>%
separate(ind, into = c("Name", "ind")) %>%
left_join(., df1, by = "Name") %>%
mutate(res = values*Nr) %>%
.$res
#[1] NA 393.28193 1033.38894 1388.53430 361.84000
#[6] 581.46000 732.59000 60.94371 161.56303 519.80162
Related
I'm dealing with a quite complicated data frame. I'm trying to rename its columns by adding some terminal digits to each column several times. How can I do that?
Let me make an example:
df=data.frame(disease=c("HB1","HB2","HB3","HB4"),
region="AZ",
hospitalAZ=runif(4),
hospitalAZ=runif(4),
hospitalAZ=runif(4),
hospitalAZ=runif(4))
This is just a stupid example. The outcome should be: the columns after "region" should be named HospitalAZ1, HospitalAZ2, HospitalAZ1, HospitalAZ2. I looking for a parsimonious way of adding, in this case, 1 and 2 to the 4 columns with repetition (2 times in this case). Then, how can I extract the outcome in an xls file?
We could use rename_with
library(dplyr)
library(stringr)
df <- df %>%
rename_with(~ make.unique(str_c("HospitalAZ", rep(1:2,
length.out = length(.x)))), starts_with("hospitalAZ"))
-output
disease region HospitalAZ1 HospitalAZ2 HospitalAZ1.1 HospitalAZ2.1
1 HB1 AZ 0.1796734 0.28729264 0.8549300 0.8486733
2 HB2 AZ 0.8518319 0.03438504 0.5909983 0.8378173
3 HB3 AZ 0.3961885 0.67294967 0.4627137 0.5484321
4 HB4 AZ 0.9955195 0.38767387 0.1961428 0.6010028
NOTE: matrix can have duplicate column names, but data.frame duplicate column names are not recommended and in tidyverse the duplicates can result in error
In base R, we may do
i1 <- startsWith(names(df), "hospitalAZ")
names(df)[i1] <- paste0("HospitalAZ", rep(1:2, length.out = sum(i1)))
> df
disease region HospitalAZ1 HospitalAZ2 HospitalAZ1 HospitalAZ2
1 HB1 AZ 0.1796734 0.28729264 0.8549300 0.8486733
2 HB2 AZ 0.8518319 0.03438504 0.5909983 0.8378173
3 HB3 AZ 0.3961885 0.67294967 0.4627137 0.5484321
4 HB4 AZ 0.9955195 0.38767387 0.1961428 0.6010028
I have a dataset where I list states with their respective cities, some of these places have been aggregated (not by me) and are classified as "Other ([count of places])" (e.g. Other (99)). Appended to this list of places are numeric 'count' values. I'd like to 1.) find the average count per place and 2.) duplicate these 'Other...' places along with the average according to the number within the parenthesis. Example below:
set.seed(5)
df <- data.frame(state = c('A','B'), city = c('Other (3)','Other (2)'), count = c('250','50'))
Output:
state
city
count
A
Other (3)
83.333
A
Other (3)
83.333
A
Other (3)
83.333
B
Other (2)
25.000
B
Other (2)
25.000
So far I've only been able to figure out how to pull the numbers from the parenthesis and create an average:
average = df$count/as.numeric(gsub(".*\\((.*)\\).*", "\\1", df$city))
An option with uncount. Extract the numeric part in 'city' with parse_number, divide the 'count' by 'n' and replicate the rows with uncount
library(dplyr)
library(tidyr)
df %>%
mutate(n = readr::parse_number(city), count = as.numeric(count)/n) %>%
uncount(n)
-output
state city count
1 A Other (3) 83.33333
2 A Other (3) 83.33333
3 A Other (3) 83.33333
4 B Other (2) 25.00000
5 B Other (2) 25.00000
You could extend your example with the followign code:
set.seed(5)
df <- data.frame(state = c('A','B'), city = c('Other (3)','Other (2)'), count = c('250','50'))
times <- as.numeric(gsub(".*\\((.*)\\).*", "\\1", df$city))
df$count <- as.numeric(df$count)/times
output <- df[rep(seq_along(times),times),]
The key addition is the line creating output, which uses row indexing on the input dataframe to repeat each row as required.
A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!
You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33
I have two big datasets, I would like to subset some columns in order to use the data.
My problem is that the reference column for subsetting is not completely matching. So I would like to be able to match for the part of the strings that are the same.
Here a simpler example:
ref_df <- data.frame("reference" = c("swietenia macrophylla",
"azadirachta indica",
"cedrela odorata",
"ochroma pyramidale",
"tectona grandis",
"tamarindus indica",
"cariniana pyriformis",
"paquita quinata",
"albizia saman",
"enterolobium cyclocarpum",
"tapirira guianensis",
"dipteryx oleifera"),
"values" = c(rnorm(12)))
tofind_df <- c("swietenia macrophylla and try try",
"azadirachta indica",
"tamarindus indica (bla bla)",
"tara",
"bla bla (paquita quinata)",
"prosopis pallida",
"dipteryx oleifera")
So I try to keep all the values of ref_df that have a name that matches even partially in tofond_df, but it only matches if they are the same.
finale <- ref_df[ref_df$reference %in% tofind_df$names,]
I tried with grepl as well, but I couldn't find the solution.
My ideal finale should look like this:
reference values
1 swietenia macrophylla -0.459001383
2 azadirachta indica -0.430014486
3 tamarindus indica -0.541887328
4 paquita quinata -0.003572792
5 dipteryx oleifera -0.855659901
Please, think about two big df and not this easier situation.
We need to use sapply to get the results from grepl for every element
ref_df[sapply(ref_df$reference, function(x) any(grepl(x, tofind_df))),]
reference values
1 swietenia macrophylla 1.4482830
2 azadirachta indica 0.9037943
6 tamarindus indica -0.2994678
8 paquita quinata 0.4895183
12 dipteryx oleifera -1.1652528
You can use group_by and filter from dplyr and str_detect from stringr:
library(dplyr)
library(stringr)
ref_df %>%
mutate(reference = as.character(reference))%>% #reference is factor. Making it a character
group_by(reference) %>%
filter(any(str_detect(tofind_df,reference)))%>% #Finding if there are any matches between each reference name and any of the strings in the tofind_df
ungroup()
# A tibble: 5 x 2
reference values
<chr> <dbl>
1 swietenia macrophylla -0.456
2 azadirachta indica -1.08
3 tamarindus indica -0.428
4 paquita quinata -0.937
5 dipteryx oleifera 0.816
Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)