Combine multiple variables into new variable / Split Variable into 3 variables

Combine multiple variables into new variable / Split Variable into 3 variables - r

I need to create a 'key' variable, since I want to combine two datasets.
Dataset1 has the variable ymd.
Dataset2 has the three variables y, m and d.
ymd (20050516,20060512)
y(2005,2006)
m(05,05)
d(16,12)
Two Options:
Combine y,m and d into variable ymd
List item plit variable ymd into 3 variables y, m and d.

Assuming you have two data frames:
df1 <- data.frame(
ymd = c(20050516,20060512),
x = c(1,2)
)
df2 <- data.frame(
y = c(2005,2006),
m = c('05','05'),
d = c(16,12),
z = c(5,10)
)
You can merge by pasting together the y, m, and d elements using paste0 and changing to numeric:
df2 %>%
mutate(
ymd = as.numeric(paste0(y,m,d))
) %>%
left_join(df1)
Output:
>
Joining, by = "ymd"
y m d z ymd x
1 2005 05 16 5 20050516 1
2 2006 05 12 10 20060512 2
You can adjust the merge (eg right_join) depending on your needs.

Here you have an example.
I use the variables as string instead of numeric, which makes it easier. You can use as.character() as in my example to convert it.
For option 1, I just use paste0() to paste the text together.
For option 2 I use substr() to cut the text in the corect locations.
If you need the output as numeric and not string, just use as.numeric() as I did in the print function.
Here is the code, let me know if you have further question:
ymd=as.character(c(20050516,20060512))
y=as.character(c(2005,2006))
m=as.character(c(05,05))
d=as.character(c(16,12))
## Concatenade y, m, and d together
ymd_concatenated=paste0(y,m,d)
print(as.numeric(ymd_concatenated))
## Split ymd into single variables
y_concatenated=c()
m_concatenated=c()
d_concatenated=c()
for (date in ymd)
{
y_concatenated=c(y_concatenated,substr(date,1,4))
m_concatenated=c(m_concatenated,substr(date,5,6))
d_concatenated=c(d_concatenated,substr(date,7,8))
}
print(y_concatenated)
print(m_concatenated)
print(d_concatenated)

Related

Sum numeric sub-dataframe within a list

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes
e.g. Given list L consist of dataframe x and y, I want to get output like z
x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1+5,2+6),v2=c(3+7,4+8))
I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family
Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts
output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)]+num_sum})
cbind(output,num_sum)
but this approach has some problems I prefer to avoid
I need to manully set the 2 parts of output and then manully join the two parts
lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later

If the order of the two first columns is always the same, you can do:
#Get all numeric columns
num <- sapply(L[[1]], is.numeric)
#Sum them across elements of the list
df_num <- Reduce(`+`, lapply(L, `[`, num))
#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:
library(powerjoin)
sum_inner_join <-
function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x + .y)
Reduce(sum_inner_join, L)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12

using dplyr and purrr (which has a bit nicer map functions), you could do something like this:
library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
xVals <- x |> select(-ID, -name)
yVals <- y |> select(-ID, -name)
totalVals <- xVals |> map2(yVals, function(x,y) {
rowSums(cbind(x,y))
})
return(x |> select(ID, name) |> cbind(totalVals))
})

Similar logic to Maël's answer, but squishing it all into a Map call:
data.frame(do.call(Map,
c(\(...) if(is.numeric(..1)) Reduce(`+`, list(...)) else ..1, L)
))
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12
If the first ..1 chunk of the column is numeric, sum + all the values in all the lists, otherwise return the first ..1 chunk.
You could also do it via an aggregation if all the rows are unique:
tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12

R - dynamically create columns using existing column names in sequence

I have a dataframe, df, with several columns in it. I would like to create a function to create new columns dynamically using existing column names. Part of it is using the last four characters of an existing column name. For example, I would like to create a variable names df$rev_2002 like so:
df$rev_2002 <- df$avg_2002 * df$quantity
The problem is I would like to be able to run the function every time a new column (say, df$avg_2003) is appended to the dataframe.
To this end, I used the following function to extract the last 4 characters of the df$avg_2002 variable:
substRight <- function (x,n) {
substr(x, nchar(x)-n+1, nchar(x))
}
I tried putting together another function to create the columns:
revved <- function(x, y, z){
z = x * y
names(z) <- paste('revenue', substRight(x,4), sep = "_")
return x
}
But when I try it on actual data I don't get new columns in my df. The desired result is a series of variables in my df such as:
df$rev_2002, df$rev_2003...df$rev_2020 or whatever is the largest value of the last four characters of the x variable (df$avg_2002 in example above).
Any help or advice would be truly appreciated. I'm really in the woods here.

dat <- data.frame(id = 1:2, quantity = 3:4, avg_2002 = 5:6, avg_2003 = 7:8, avg_2020 = 9:10)
func <- function(dat, overwrite = FALSE) {
nms <- grep("avg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("avg_", "rev_", nms)
if (!overwrite) revnms <- setdiff(revnms, names(dat))
dat[,revnms] <- lapply(dat[,nms], `*`, dat$quantity)
dat
}
func(dat)
# id quantity avg_2002 avg_2003 avg_2020 rev_2002 rev_2003 rev_2020
# 1 1 3 5 7 9 15 21 27
# 2 2 4 6 8 10 24 32 40

Adding a column to every dataframe in a list with the name of the list element

I have a list containing multiple data frames, and each list element has a unique name. The structure is similar to this dummy data
a <- data.frame(z = rnorm(20), y = rnorm(20))
b <- data.frame(z = rnorm(30), y = rnorm(30))
c <- data.frame(z = rnorm(40), y = rnorm(40))
d <- data.frame(z = rnorm(50), y = rnorm(50))
my.list <- list(a,b,c,d)
names(my.list) <- c("a","b","c","d")
I want to create a column in each of the data frames that has the name of it's respective list element. My goal is to merge all the list element into a single data frame, and know which data frame they came from originally. The end result I'm looking for is something like this:
z y group
1 0.6169132 0.09803228 a
2 1.1610584 0.50356131 a
3 0.6399438 0.84810547 a
4 1.0878453 1.00472105 b
5 -0.3137200 -1.20707112 b
6 1.1428834 0.87852556 b
7 -1.0651735 -0.18614224 c
8 1.1629891 -0.30184443 c
9 -0.7980089 -0.35578381 c
10 1.4651651 -0.30586852 d
11 1.1936547 1.98858128 d
12 1.6284174 -0.17042835 d
My first thought was to use mutate to assign the list element name to a column in each respective data frame, but it appears that when used within lapply, names() refers to the column names, not the list element names
test <- lapply(my.list, function(x) mutate(x, group = names(x)))
Error: Column `group` must be length 20 (the number of rows) or one, not 2
Any suggestions as to how I could approach this problem?

there is no need to mutate just bind using dplyr's bind_rows
library(tidyverse)
my.list %>%
bind_rows(.id = "groups")
Obviously requires that the list is named.

We can use Map from base R
Map(cbind, my.list, group = names(my.list))
Or with imap from purrr
library(dplyr)
library(purrr)
imap(my.list, ~ .x %>% mutate(group = .y))
Or if the intention is to create a single data.frame
library(data.table)
rbindlist(my.list. idcol = 'groups')

Converting dates in column of a data frame R

I am having problems converting a column of imported dates in a data frame, represented as characters in a different date format, into date objects in that same data frame. Here is a toy example:
xx <- data.frame(A = c(10, 15, 20), B = c("10/15/2010", "9/8/2015", "8/5/2013"))
If I print xx,
A B
1 10 10/15/2010
2 15 9/8/2015
3 20 8/5/2013
I apply:
xx[, "B"] <- sapply(xx[, "B"], function(x) {as.Date(x,
format = "%m/%d/%Y", origin = "1970-01-01")})
and I get:
A B
1 10 14897
2 15 16686
3 20 15922
If I look at the mode of column B, it is numeric, not date. No matter what I try I cannot seem to get a result that converts column B to a date type. I can always add:
xx[, "B"] <- as.Date(xx[, "B"])
but there must be a way to do this in one statement.

If you have only one column to convert, you can do
xx$B <- as.Date(xx$B, "%m/%d/%Y")
If you have multiple columns use lapply instead of sapply
cols <- 2
xx[cols] <- lapply(xx[cols], as.Date, "%m/%d/%Y")
Or using lubridate where you don't need to specify the format argument.
xx$B <- lubridate::mdy(xx$B)

R - combine a 3D matrix and single vector then select certain month data

i am now having a single date vector A (362 rows) and i have a 3D matrix B (dimensions 360*180*3620)
> str(ssta_date)
POSIXlt[1:362], format: "1981-11-15" "1981-12-15" "1982-01-15" "1982-02-15" "1982-03-15" ...
> dim(ssta_sst)
[1] 360 180 362
is it possible to combine A and B as a data frame or something else. it must match each date and each temperature in the B [,,362]. so each date will corresponding to a 360*180 matrix.
Because i am planning to do this t to select certain month data, is there any shortcut to do month data selection without combine the two data sets?

If I understand what you want,
you can use melt to convert your array to a data.frame,
and then use merge to join it with the other piece of data.
If you want an array at the end, you can use acast
to convert the data.frame back to an array.
# Sample data
n <- 10
d <- array(
rnorm(n^3),
dim = c(n,n,n),
dimnames = list(
One = LETTERS[1:n],
Two = LETTERS[1:n],
Three = LETTERS[1:n]
)
)
x <- data.frame( One=LETTERS[1:n], x=runif(n) )
library(reshape2)
d <- melt(d)
d <- merge( d, x, by="One", all=TRUE )
d$result <- d$value + d$x # Do something with the data
acast( d, One ~ Two ~ Three, value.var="result" )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combine multiple variables into new variable / Split Variable into 3 variables - r

Related

Sum numeric sub-dataframe within a list

R - dynamically create columns using existing column names in sequence

Adding a column to every dataframe in a list with the name of the list element

Converting dates in column of a data frame R

R - combine a 3D matrix and single vector then select certain month data

Categories

Resources