Divide 2 columns from 2 different dataframes - r

Does anybody know how to divide two columns from two different dataframes when there are multiple columns to id from?
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
var1 = c("99", "99",
"99", "99")
value <- seq(1:length(month))
df1 = data.frame(name, month, var1, value)
df2 = df1
df2["var1"] = c("992", "992", "992", "992")
df2["value"] = c(2, 4, 6, 8)
df1
df2
Output
> df1
name month var1 value
1 A oct 2018 99 1
2 A nov 2018 99 2
3 B oct 2018 99 3
4 B nov 2018 99 4
> df2
name month var1 value
1 A oct 2018 992 2
2 A nov 2018 992 4
3 B oct 2018 992 6
4 B nov 2018 992 8
Does anybody know how to create a new dataframe that divides the "value"-column in df2 by the value column of df1? The method should be possible to use also when there are more columns than in the current example.

In base R, we can do merge
df3 <- merge(df1, df2, by = c("name", "month"))
df3$value <- df3$value.x/df3$value.y
df3
# name month var1.x value.x var1.y value.y value
#1 A nov 2018 99 2 992 4 0.5
#2 A oct 2018 99 1 992 2 0.5
#3 B nov 2018 99 4 992 8 0.5
#4 B oct 2018 99 3 992 6 0.5
You can drop value.x and value.y column if they are not needed.

Join the two data frames together and then perform the division and drop unwanted columns that were generated by the join (assuming you want computed value column to replace the value columns from the original data frames). Depending on what you want you may need a different *_join.
library(dplyr)
df1 %>%
inner_join(df2, by = c("name", "month")) %>%
mutate(value = value.x / value.y) %>%
select(-value.x, -value.y)
giving:
name month var1.x var1.y value
1 A oct 2018 99 992 0.5
2 A nov 2018 99 992 0.5
3 B oct 2018 99 992 0.5
4 B nov 2018 99 992 0.5

We can use data.table as well to do a join and create the column 'value' by dividing the 'value' column by the corresponding column in the other dataset while joining on 'name' and 'month'
library(data.table)
df3 <- copy(df1)
setDT(df3)[df2, value := value/i.value, on = .(name, month)]
df3
# name month var1 value
#1: A oct 2018 99 0.5
#2: A nov 2018 99 0.5
#3: B oct 2018 99 0.5
#4: B nov 2018 99 0.5

Related

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

R, dplyr: How to divide date frame elements by specific elements

edit: Solution at the end.
I have a dataframe that contains different variables and the sum of these different variables as a variable called "total".
I want to add a new column that calculates each variables' share of the "total"-variable.
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
value <- seq(1:length(month))
df = data.frame(name, month, value)
# Create total variable
dfTotal =
df%>%
group_by_("month")%>%
summarize(value = sum(value, na.rm = TRUE))
dfTotal[["name"]] <- "Total"
dfTotal = as.data.frame(dfTotal)
# Add total column to dataframe
df2 = rbind(df, dfTotal)
df2
which gives the dataframe
name month value
1 A oct 2018 1
2 A nov 2018 2
3 B oct 2018 3
4 B nov 2018 4
5 Total nov 2018 6
6 Total oct 2018 4
What I want is to produce a new column with the shares of the total for each month in the above dataframe, so that I get something like
name month value share
1 A oct 2018 1 0.25 (=1/4)
2 A nov 2018 2 0.33 (=2/6)
3 B oct 2018 3 0.75 (=3/4)
4 B nov 2018 4 0.67 (=4/6)
5 Total nov 2018 6 1.00 (=6/6)
6 Total oct 2018 4 1.00 (=4/4)
Does anybody know how I from the first dataframe can produce the last column in the second dataframe?
Solution:
Based on tmfmnk's comment, the following solves the problem:
df2 =
df2 %>%
group_by(month) %>%
mutate(share = value/max(value))
df2
which gives
name month value share
<fct> <fct> <int> <dbl>
1 A oct 2018 1 0.25
2 A nov 2018 2 0.333
3 B oct 2018 3 0.75
4 B nov 2018 4 0.667
5 Total nov 2018 6 1
6 Total oct 2018 4 1

Use dplyr/tidyr to turn rows into columns in R data frame

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance
The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34
You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val
Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

How do I update data from an incomplete lookup table?

I have a table that uses unique IDs but inconsistent readable names for those IDs. It is more complex than month names, but for the sake of a more simple example, let's say it looks something like this:
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE)
Except that they might have spelled "Feb" or "March" eight different ways. I also have a clean data frame that contains consistent names for the names that have variations:
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE)
I want to get to this:
1 Jan 37
2 Feb 63
3 Mar 9
3 Mar 150
2 Feb 49"
I tried merge(month_lookup, demo_frame, by = "Month_id") but that dropped all the January values because "Jan" doesn't exist in the lookup table:
Month_id Month_name.x Month_name.y Number
1 2 Feb Feb 63
2 2 Feb February 49
3 3 Mar March 9
4 3 Mar Mar 150
My read of How to replace data.frame column names with string in corresponding lookup table in R is that I ought to be able to use plyr::mapvalues but I'm unclear from examples and documentation on how I'd map the id to the name. I don't just want to say "Replace 'March' with 'Mar'" -- I need to say SET month_name = 'Mar' WHERE month_id = 3 for each value in lookup.
I think you want this.
library(dplyr)
demo_frame <- read.table(text=" Month_id Month_name Number
1 Jan 37
2 Feb 63
3 March 9
3 Mar 150
2 February 49", header=TRUE, stringsAsFactors = FALSE)
month_lookup <- read.table(text=" Month_id Month_name
2 Feb
3 Mar", header=TRUE, stringsAsFactors = FALSE)
result =
demo_frame %>%
rename(bad_month = Month_name) %>%
left_join(month_lookup) %>%
mutate(month_fix =
Month_name %>%
is.na %>%
ifelse(bad_month, Month_name) )

Resources