Create then populate colmuns in a dataframe - r

Hello I'm trying to find a way to create new columns in a dataframe the populate them.
For example:
id = c(2, 3, 5)
v1 = c(2, 1, 7)
v2 = c(1, 9, 5)
duration=c(v1+v2)
df = data.frame(id,v1,v2,duration,stringsAsFactors=FALSE)
id v1 v2 duration
1 2 2 1 3
2 3 1 9 10
3 5 7 5 12
Now I want to create new columns by dividing each value of a row by the 'duration' of said row, I know how do it manually but it is prone to errors and not really elegant...
df$I_v1=v1/duration
df$I_v2=v2/duration
Or is df <- df %>% mutate(I_v1 = v1/duration) quicker/better?
id v1 v2 duration I_v1 I_v2
1 2 2 1 3 0.6666667 0.3333333
2 3 1 9 10 0.1000000 0.9000000
It works but I would like to know if it's possible to create -and name- the row and populate them automatically.

Say that you have a cols vector containing the names of the columns you want to manipulate. In your example:
cols<-c("v1","v2")
Then you can try:
df[paste0("I_",cols)]<-df[cols]/df$duration
# id v1 v2 duration I_v1 I_v2
#1 2 2 1 3 0.6666667 0.3333333
#2 3 1 9 10 0.1000000 0.9000000
#3 5 7 5 12 0.5833333 0.4166667

You can use transform():
df <- data.frame(id=c(2, 3, 5), v1=c(2, 1, 7), v2=c(1, 9, 5))
df$duration <- df$v1 + df$v2) # or ... <- with(df, v1 + v2)
df_new <- transform(df, I_v1=v1/duration, I_v2=v2/duration )
... or (if you have many columns v1, v2, ...):
as.matrix(df[, 2:3])/df$duration # or with cbind():
cbind(df, as.matrix(df[, 2:3])/df$duration)
(similar as in the answer from nicola)

All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values. You can name the rows as:
row.names(x) <- value
Arguments:
x
object of class "data.frame", or any other class for which a method has been defined.
value
an object to be coerced to character unless an integer vector.e here

Related

Perform calculations on row depending on individual cells [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9

Complex reshaping from wide to long in R (pulling multiple things from original variable name)

I have a question about reshaping a complex data from wide to long format.
"Prim_key" is the unique id. The variables have the following format: "sn016_1_2". I need to pull the first number into a column and name it "S" (For example, here it would be 1) and the second number to a column named "T" (For example, here it would be 2) and then pull the values into other variable names grouped by the unique id. The prefix sn016 is also not the only prefix. Here are the variables:
[1] "prim_key" "sn016_1_2" "sn016_1_3" "sn016_1_4" "sn016_1_5" "sn016_1_6" "sn016_1_7" "sn016_2_3"
[9] "sn016_2_4" "sn016_2_5" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5" "sn016_3_6" "sn016_3_7"
[17] "sn016_4_5" "sn016_4_6" "sn016_4_7" "sn016_5_6" "sn016_5_7" "sn016_6_7" "sn017_1_2" "sn017_1_3"
[25] "sn017_1_4" "sn017_1_5" "sn017_1_6" "sn017_1_7" "sn017_2_3" "sn017_2_4" "sn017_2_5" "sn017_2_6"
[33] "sn017_2_7" "sn017_3_4" "sn017_3_5" "sn017_3_6" "sn017_3_7" "sn017_4_5" "sn017_4_6" "sn017_4_7"
[41] "sn017_5_6" "sn017_5_7" "sn017_6_7"
"Prim_key" is the unique id. Any ideas on how to do this? I feel like it shouldn't be terribly hard but it's evading me.
Here's an example of what I'm looking for:
THESE VARS: "prim_key" "sn016_1_2" "sn016_1_3" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5"
prim_key S T sn016
1 1 2 value
1 1 3 value
1 2 6 value
1 2 7 value
1 3 4 value
1 3 5 value
P.s. The goal long format example is not showing up correctly. So I've attached as an image.
Thanks in advance for any help!!
Perhaps you might try using pivot_longer from tidyr.
You can specify:
Columns to make longer (could select columns that start with "sn", such as starts_with("sn"), or all columns except for prim_key)
Names of the new columns generated, which include the initial letter/number combination (e.g., sn016), S, and T
And a regex pattern to split up into these columns
The code as follows:
library(tidyverse)
df %>%
pivot_longer(cols = -prim_key,
names_to = c(".value", "S", "T"),
names_pattern = "(\\w+)_(\\d+)_(\\d+)")
Output
# A tibble: 10 x 5
prim_key S T sn016 sn017
<dbl> <chr> <chr> <int> <int>
1 1 1 2 5 NA
2 1 1 3 2 NA
3 1 2 6 5 3
4 1 2 7 1 2
5 1 3 5 NA 3
6 1 1 2 2 NA
7 1 1 3 3 NA
8 1 2 6 3 4
9 1 2 7 2 3
10 1 3 5 NA 5
Data
Example data made up:
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
We could use melt from data.table
library(data.table)
dcast(melt(setDT(df), id.var = 'prim_key')[, c("nm1", "S", "T")
:= tstrsplit(variable, '_')], rowid(nm1, S, T) + prim_key + S + T
~ nm1, value.var = 'value')[, nm1 := NULL][]
# prim_key S T sn016 sn017
# 1: 1 1 2 5 NA
# 2: 1 1 3 2 NA
# 3: 1 2 6 5 3
# 4: 1 2 7 1 2
# 5: 1 3 5 NA 3
# 6: 1 1 2 2 NA
# 7: 1 1 3 3 NA
# 8: 1 2 6 3 4
# 9: 1 2 7 2 3
#10: 1 3 5 NA 5
data
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
The answers using external packages are probably the way to go in terms of parsimony. It's useful, however, to be able to brute force your desired solution using base R sometimes. Below is an example. One benefit of the following is that the call to lapply can be replaced with the parallel version parLapply or mclapply both from the parallel package that ships with R.
#### First make some example data
# The column names you gave
cnames <- c("prim_key", "sn016_1_2", "sn016_1_3", "sn016_1_4", "sn016_1_5",
"sn016_1_6", "sn016_1_7", "sn016_2_3", "sn016_2_4", "sn016_2_5",
"sn016_2_6", "sn016_2_7", "sn016_3_4", "sn016_3_5", "sn016_3_6",
"sn016_3_7", "sn016_4_5", "sn016_4_6", "sn016_4_7", "sn016_5_6",
"sn016_5_7", "sn016_6_7", "sn017_1_2", "sn017_1_3", "sn017_1_4",
"sn017_1_5", "sn017_1_6", "sn017_1_7", "sn017_2_3", "sn017_2_4",
"sn017_2_5", "sn017_2_6", "sn017_2_7", "sn017_3_4", "sn017_3_5",
"sn017_3_6", "sn017_3_7", "sn017_4_5", "sn017_4_6", "sn017_4_7",
"sn017_5_6", "sn017_5_7", "sn017_6_7")
# An example matrix with random data
mat <- matrix(runif(length(cnames) * 4), nrow = 4)
# Make the column names corrcet
colnames(mat) <- cnames
### Now pretend we already had the data
# Get the column names of the input matrix
cnames <- colnames(mat)
# The column names that are not your primary key
n_primkey <- cnames[which(cnames != "prim_key")]
# Get the unique set of prefixes for the non-primkey variables
prefix <- strsplit(n_primkey, "_")
prefix <- unique(unlist(lapply(prefix, "[", 1)))
# Go row by row through the original matrix
dat <- lapply(seq_len(nrow(mat)), function(i) {
# The row we're dealing with now
row <- mat[i, ]
# The column names of your output matrix
dcnames <- c("prim_key", "S", "T", prefix)
# A pre-allocated data.frame to hold the rehaped data for this row
dat <- matrix(rep(NA, length(dcnames) * length(n_primkey)), ncol = length(dcnames))
dat <- as.data.frame(dat)
colnames(dat) <- dcnames
# All values for this row have the same prim_key value
dat$prim_key <- row["prim_key"]
# Go through each of the non-prim_key variables, split them, and put the
# values in the correct place
for (j in seq_len(length(n_primkey))) {
# k has the non-prim_key name we're dealing with
k <- n_primkey[j]
# l splits this name by underscores "_"
l <- strsplit(k, "_")
# The first element gives the prefix
pref <- l[[1]][1]
# The second gives the "S" value
S_val <- l[[1]][2]
# The third gives the "T" value
T_val <- l[[1]][3]
# Allocate these values into the output data.frame we created ealier
dat[j, "S"] <- S_val
dat[j, "T"] <- T_val
dat[j, pref] <- row[k]
}
# Return the data for row i of the input data
dat
})
# dat is a list, so combine each element into a single data.frame
dat <- do.call(rbind, dat)
# Check a few
dat[1:2, ]
mat[1, ]

Take sum of rows for every 3 columns in a dataframe

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

R: compare multiple columns pairs and place value on new corresponding variable

Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1

Resources