add_column error with data frame - r

I'm trying to use the add_column function within the tibble package to add a column or two to my df data frame, but I keep getting different errors based on how I try to manipulate the arguments of the function. df is a 60 x17 data frame. Here's the code that I currently have tried so far:
Try 1:
library(tibble)
add_column(Depth +.5 =df[1], .after = 1)
Try 2:
library(tibble)
add_column(df, depth + .5 = rep(df[1], nrow(df)), .after = 1)
I want the new column to be inserted after column 1 in df, and I want the newly created column to say "Depth + .5" and to be filled with data from my df[1] column. (I'm going to alter the values in it later), but I need the row values to be adaptable for when I import different data sets of different lengths, which is why I'm trying to do it as df[1] since its length is going to change depending on the data that I import. Also, I'm not sure if I need to put "Depth + .5" in quotes or what in order to make it work, but that's what I'd like the column to say/be named at the top.

Two key points: one, you need to include df in the add column function. Second, I see where you were going with your rep line, because you were getting an error about number of rows needing to match. However, all you need to do is reference the existing column (which is already the correct length) and perform your operation. To do that we just use df$Depth, or you could use df[,1].
add_column(df, 'Depth + .5'= df$Depth + .5, .after = 1)

Related

How to perform calculations on each column of a data table in R

I'm very new to R and am getting starting with some simple calculation.
I have imported some data called BM_Returns, I am able to select a specific column and perform calculations on that column fine, how would i do it for all/a subset of the column?
example:
S1_Ann_ret <- (prod(1+(BM_Returns$Stock_1/100))^(1/yrs))-1
In my data column 1 is dates all other columns (2-15) are ones i would like to perform the above and other calculations on.
Thanks
If the calculation needs to be repeated, use across in mutate
libray(dplyr)
BM_Returns2 <- BM_Returns %>%
mutate(across(2:15, ~ (prod(1 + (./100))^((1/yrs)) - 1))
Or use base R
BM_Returns[2:15] <- lapply(BM_Returns[2:15], function(x)
(prod(1 + (x/100))^((1/BM_Returns$yrs)) - 1)

dplyr mutate grouped data without using exact column name

I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?
If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.

How do I sum an R dataframe into one cell?

I am trying to sum 2 dataframes named X and Y into one cell in a new dataframe named PL. The problem I am having is when I use this script :
df$PL <- sum(df$X + df$Y)
it propagates the entire PL column instead of just one cell.
How do I code it so it just fills one cell ?
The sum() function doesn't do vectorization. Just add them together. Also, you mean variables, i.e. columns, not data frames. df is, I assume, a data frame. You also mean rows, not cells, I would guess.
df$PL <- df$X + df$Y
If this is not what you want, then please share some example data along with what output you are looking for.

R: Scale a subset of multiple columns (with similar names) with dplyr

I recently moved from common dataframe manipulation in R to the tidyverse. But I got a problem regarding scaling of columns with the scale()function.
My data consists of columns of whom some are numerical and some categorical features. Also the last column is the y value of data. So I want to scale all numerical columns but not the last column.
With the select()function i am able to write a very short line of code and select all my numerical columns that need to be scaled if i add the ends_with("...") argument. But I can't really make use of that with scaling. There I have to use transmute(feature1=scale(feature1),feature2=scale(feature2)...)and name each feature individually. This works fine but bloats up the code.
So my question is:
Is there a smart solution to manipulate column by column without the need to address every single column name with
transmute?
I imagine something like:
transmute(ends_with("...")=scale(ends_with("..."),featureX,featureZ)
(well aware that this does not work)
Many thanks in advance
library(tidyverse)
data("economics")
# add variables that are not numeric
economics[7:9] <- sample(LETTERS[1:10], size = dim(economics)[1], replace = TRUE)
# add a 'y' column (for illustration)
set.seed(1)
economics$y <- rnorm(n = dim(economics)[1])
economics_modified <- economics %>%
select(-y) %>%
transmute_if(is.numeric, scale) %>%
add_column(y = economics$y)
If you want to keep those columns that are not numeric replace transmute_if with modify_if. (There might be a smarter way to exclude column y from being scaled.)

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

Resources