Manipulating variable values using values from another data frame - r

I have a dataset, df, where columns consist of various chemicals and rows consist of samples identified by their id and the concentration of each chemical.
I need to correct the chemical concentrations using a unique value for each chemical, which are found in another dataset, df2.
Here's a minimal df1 dataset:
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",")
and here is a df2 example:
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",")
What I need to do is to divide all observations of chem1 in df1 by the value provided for chem1 in df2, repeated for each chemical. In reality, chemical names are not sequential, and there's roughly 30 chemicals.
Previously I would have done this using Excel and index/match but I'm looking to make my methods more reproducible, hence fighting my way through with R. I mostly do data manipulation with dplyr, so if there's a tidyverse solution out there, that would be great!
Thankful for any help

We can use the 'chem' column from 'df2' to subset the 'df1', divide by the 'value' column of 'df2' replicated to make the lengths same and update the columns of 'df1' by assigning the results back
df1[as.character(df2$chem)] <- df1[as.character(df2$chem)]/df2$value[col(df1[-1])]

Using reshape2 package, the data frame can be changed to long format to merge with the df2 as follows. (Note that the example df introduce some whitespace that are filtered in this solution)
library(reshape2)
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",",stringsAsFactors = F)
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",",stringsAsFactors = F)
df2$chem <- gsub("\\s+","",df2$chem) #example introduces whitespaces in the names
df1A <- melt(df1,id.vars=c("id"),variable.name="chem")
combined <- merge(x=df1A,y=df2,by="chem",all.x=T)
combined$div <- combined$value.x/combined$value.y
head(combined)
chem id value.x value.y div
1 chem1 1 0.5 1.7 0.2941176
2 chem1 2 1.5 1.7 0.8823529
3 chem1 3 1.0 1.7 0.5882353
4 chem1 4 2.0 1.7 1.1764706
5 chem1 5 3.0 1.7 1.7647059
6 chem2 1 1.0 2.3 0.4347826
or in wide format:
> dcast(combined[,c("id","chem","div")],id ~ chem,value.var="div")
id chem1 chem2 chem3 chemA chemB
1 1 0.2941176 0.4347826 1.2195122 0.7692308 1.1111111
2 2 0.8823529 0.2173913 0.4878049 0.5769231 1.4814815
3 3 0.5882353 0.4347826 0.6097561 1.3461538 0.3703704
4 4 1.1764706 2.1739130 0.7317073 0.1923077 2.5925926
5 5 1.7647059 1.7391304 0.5609756 0.1346154 0.8518519

Here's a tidyverse solution.
df3 <- df1 %>%
# convert the data from wide to long to make the next step easier
gather(key = chem, value = value, -id) %>%
# do your math, using 'match' to map values from df2 to rows in df3
mutate(value = value/df2$value[match(df3$chem, df2$chem)]) %>%
# return the data to wide format if that's how you prefer to store it
spread(chem, value)

Related

Melt Data and fill new column with desired data

Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

Is it possible to create a row containing totals for some columns and averages for others?

I have a dataframe dealing with time series data in which some some columns represent amounts, and some columns represent percentages. I want a row which summarizes each column, but obviously it is not particularly useful for me to sum the columns containing percentages.
Here is an example dataframe:
date<-c("2019-04-27", "2019-04-28", "2019-05-01")
name<-c("sam", "sam", "sam")
amt1<-c(3,6,2)
amt2<-c(4,2,7)
percent1<-c(0.25, 0.7, 0.42)
amt3<-c(13,7,4)
percent2<-c(0.54, 0.48, 0.77)
df<-data.frame(date,name, amt1, amt2, percent1, amt3, percent2)
df$date<-as.Date(df$date)
What I would like is a row that contains:
-the sums for columns amt1, amt2, amt3
-the means for columns percent1, percent2.
Anyone have any ideas of how to accomplish this?
One option would be to select the numeric columns (select_if), then using mutate_if, get the mean of those columns where the values are all less than 1, in the next step, do the sum of columns where there are any value greater than 1. (Disclaimer - The OP said there are no column name patterns or index and this is one of the possible logic given by the OP)
library(tidyverse)
df %>%
select_if(is.numeric) %>%
mutate_if(~ all(.x < 1), mean) %>%
mutate_if(~ any(.x > 1), sum) %>%
slice(1) %>%
bind_rows(df, .) %>%
mutate(name = replace(as.character(name), n(), "Other"))
# date name amt1 amt2 percent1 amt3 percent2
#1 2019-04-27 sam 3 4 0.2500000 13 0.5400000
#2 2019-04-28 sam 6 2 0.7000000 7 0.4800000
#3 2019-05-01 sam 2 7 0.4200000 4 0.7700000
#4 <NA> Other 11 13 0.4566667 24 0.5966667

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Calling a data.frame from global.env and adding a column with the data.frame name

I have a dataset consisting of pairs of data.frames (which are almost exact pairs, but not enough to merge directly) which I need to munge together. Luckily, each df has an identifier for the date it was created which can be used to reference the pair. E.g.
df_0101 <- data.frame(a = rnorm(1:10),
b = runif(1:10))
df_0102 <- data.frame(a = rnorm(5:20),
b = runif(5:20))
df2_0101 <- data.frame(a2 = rnorm(1:10),
b2 = runif(1:10))
df2_0102 <- data.frame(a2 = rnorm(5:20),
b2 = runif(5:20))
Therefore, the first thing I need to do is mutate a new column on each data.frame consisting of this date (01_01/ 01_02 / etc.) i.e.
df_0101 <- df_0101 %>%
mutate(df_name = "df_0101")
but obviously in a programmatic manner.
I can call every data.frame in the global environment using
l_df <- Filter(function(x) is(x, "data.frame"), mget(ls()))
head(l_df)
$df_0101
a b
1 0.7588803 0.17837296
2 -0.2592187 0.45445752
3 1.2221744 0.01553190
4 1.1534353 0.72097071
5 0.7279514 0.96770448
$df_0102
a b
1 -0.33415584 0.53597308
2 0.31730849 0.32995013
3 -0.18936533 0.41024220
4 0.49441962 0.22123885
5 -0.28985964 0.62388478
$df2_0101
a2 b2
1 -0.5600229 0.6283224
2 0.5944657 0.7384586
3 1.1284180 0.4656239
4 -0.4737340 0.1555984
5 -0.3838161 0.3373913
$df2_0102
a2 b2
1 -0.67987149 0.65352466
2 1.46878953 0.47135011
3 0.10902751 0.04460594
4 -1.82677732 0.38636357
5 1.06021443 0.92935144
but no idea how to then pull the names of each df down into a new column on each. Any ideas?
Thanks for reading,
We can use Map in base R
Map(cbind, names = names(l_df), l_df)
If we are going by the tidyverse way, then
library(tidyverse)
map2(names(l_df), l_df, ~(cbind(names = .x, .y)))
Also, this can be created a single dataset with bind_rows
bind_rows(l_df, .id = "names")

Resources