how I can add a column with all 1 to my dataframe? - r

I have a data frame and I want to add a new column with entries 1. how I can do that?
for example
col1. col2
1. 2.
4. 5.
33. 4.
5. 3.
new column
col1. col2. col3
1. 2. 1
4. 5. 1
33. 4. 1
5. 3. 1

df1$col3 <- 1
this should work as well
likewise as above
df1<-data.frame(df1,col3=1)
could also work

Simplest option is to do ?Extract
df1['col3'] <- 1
One of the good things about using [ instead of $ is that we can pass variable identifiers as well
v1 <- 'col3'
df1[v1] <- 1
But, if we do
df1$v1 <- 1
it creates a column with name as 'v1' instead of 'col3'
Other variations without changing the initial object would be
transform(df1, col3 = 1)
cbind(df1, col3 = 1)
NOTE: All of these creates a column appended as the last column
Also, there is a convenient function add_column which can add a column by specifying the position. By default, it creates the column as the last one
library(tibble)
add_column(df1, col3 = 1)
# col1. col2 col3
#1 1 2 1
#2 4 5 1
#3 33 4 1
#4 5 3 1
But, if we need to change it to a specific location, there are arguments
add_column(df1, col3 = 1, .after = 1)
# col1. col3 col2
#1 1 1 2
#2 4 1 5
#3 33 1 4
#4 5 1 3
data
df1 <- structure(list(col1. = c(1, 4, 33, 5), col2 = c(2, 5, 4, 3)),
class = "data.frame", row.names = c(NA,
-4L))

Related

Drop Multiple Columns in R

I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?
You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]
Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)
You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}

For to select the value in specific number corresponding to row number [duplicate]

This question already has answers here:
divide one column by another column's previous value
(4 answers)
Closed 2 years ago.
For example, I have a df as below, I would like to calculate value at any row of col 2 = value at row+1 of col1 / value at that row of col 1. Illustration is below.
How to code the above idea and keep doing so till the end.PLease help.
Col1 Col2
1 Value at row 2 of col1/ value at row 1 of col1?
2 Value at row 3 of col1/ value at row 2 of col1?
3
4
5
Lot of ways to do this.
Consider your dataframe as :
df <- data.frame(col1 = 1:5)
You can use lead in dplyr :
library(dplyr)
df %>% mutate(col2 = lead(col1)/col1)
# col1 col2
#1 1 2.00
#2 2 1.50
#3 3 1.33
#4 4 1.25
#5 5 NA
shift in data.table
library(data.table)
setDT(df)[, col2 := shift(col1, type = "lead")/col1]
In base R :
a. With head and tail :
transform(df, col2 = c(tail(col1, -1)/head(col1, -1), NA))
b. by indexing
transform(df, col2 = c(col1[-1]/col1[-nrow(df)], NA))

Count unique instances in rows between two columns given by index

Hi I have an example data frame as follows. What I would like to do is count the number of instances of a unique value (example 1) that occur between the columns given by the indices ind1 and ind2. Output would be a vector with a number for each row that is the number of instances for that row.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
Data
COL1 COL2 COL3 ind1 ind2
1 1 1 1 3
1 NA 1 2 3
1 NA 1 1 2
NA 1 1 2 3
1 1 1 1 3
1 1 1 2 3
so example output should look like
3, 1, 1, 2, 3, 2
My actual data set has many rows so I want to avoid loops as much as possible to save time. I was thinking an apply function with a sum(which(x==1)) may work I'm just not sure how to get the column values from the given indices.
An option would be to loop over the rows, extract the values based on the sequence index from 'ind1' to 'ind2' and get the count with table
apply(Data, 1, function(x) table(x[x['ind1']:x['ind2']]))
#[1] 3 1 1 2 3 2
Or using sum
apply(Data, 1, function(x) sum(x[x['ind1']:x['ind2']] == 1, na.rm = TRUE))
Or create a logical matrix and then use rowSums
rowSums(Data[1:3] * NA^!((col(Data[1:3]) >= Data$ind1) &
(col(Data[1:3]) <= Data$ind2)), na.rm = TRUE)
#[1] 3 1 1 2 3 2

Create then populate colmuns in a dataframe

Hello I'm trying to find a way to create new columns in a dataframe the populate them.
For example:
id = c(2, 3, 5)
v1 = c(2, 1, 7)
v2 = c(1, 9, 5)
duration=c(v1+v2)
df = data.frame(id,v1,v2,duration,stringsAsFactors=FALSE)
id v1 v2 duration
1 2 2 1 3
2 3 1 9 10
3 5 7 5 12
Now I want to create new columns by dividing each value of a row by the 'duration' of said row, I know how do it manually but it is prone to errors and not really elegant...
df$I_v1=v1/duration
df$I_v2=v2/duration
Or is df <- df %>% mutate(I_v1 = v1/duration) quicker/better?
id v1 v2 duration I_v1 I_v2
1 2 2 1 3 0.6666667 0.3333333
2 3 1 9 10 0.1000000 0.9000000
It works but I would like to know if it's possible to create -and name- the row and populate them automatically.
Say that you have a cols vector containing the names of the columns you want to manipulate. In your example:
cols<-c("v1","v2")
Then you can try:
df[paste0("I_",cols)]<-df[cols]/df$duration
# id v1 v2 duration I_v1 I_v2
#1 2 2 1 3 0.6666667 0.3333333
#2 3 1 9 10 0.1000000 0.9000000
#3 5 7 5 12 0.5833333 0.4166667
You can use transform():
df <- data.frame(id=c(2, 3, 5), v1=c(2, 1, 7), v2=c(1, 9, 5))
df$duration <- df$v1 + df$v2) # or ... <- with(df, v1 + v2)
df_new <- transform(df, I_v1=v1/duration, I_v2=v2/duration )
... or (if you have many columns v1, v2, ...):
as.matrix(df[, 2:3])/df$duration # or with cbind():
cbind(df, as.matrix(df[, 2:3])/df$duration)
(similar as in the answer from nicola)
All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values. You can name the rows as:
row.names(x) <- value
Arguments:
x
object of class "data.frame", or any other class for which a method has been defined.
value
an object to be coerced to character unless an integer vector.e here

Selecting between rows in a dataframe

I am working with a rather noisy data set and I was wondering if there was a good way to selectively choose between two rows of data within a group or leave them alone. Logic-wise I want to filter by group and then build an if-else type control structure to compare rows based on the value of a second column.
Example:
Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA
I want to group by ID (1, 2, 3) then go to column V2 and choose for example, row 1 over row 2 because row 2 has NA. But for rows 4 and 5, where both are 'NA' I want to just leave them alone.
Thanks,
What you need might really depends on what you exactly have. In case of NAs, this might help:
df <- data.frame(
Row = c(1, 2, 3, 4, 5),
ID = c(1, 1, 2, 3, 3),
V1 = c("bla", "bla", "foo", "bla", "bla"),
V2 = c(1.2, NA, 2.3, NA, NA),
stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
A solution using purrr. The idea is to split the data frame by ID. After that, apply a user-defineed function, which evaluates if all the elements in V2 are NA. If TRUE, returns the original data frame, otherwise returns the subset of the data frame by filtering out rows with NA using na.omit. map_dfr is similar to lapply, but it can combine all the data frames in a list automatically. dt2 is the final output.
library(purrr)
dt2 <- dt %>%
split(.$ID) %>%
map_dfr(function(x){
if(all(is.na(x$V2))){
return(x)
} else {
return(na.omit(x))
}
})
dt2
# Row ID V1 V2
# 1 1 1 blah 1.2
# 2 3 2 foo 2.3
# 3 4 3 bar NA
# 4 5 3 bar NA
DATA
dt <- read.table(text = "Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA",
header = TRUE, stringsAsFactors = FALSE)

Resources