I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))
Related
My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))
I have two dataframes. I want to replace the ids in dataframe1 with generic ids. In dataframe2 I have mapped the ids from dataframe1 with the generic ids.
Do I have to merge the two dataframes and after it is merged do I delete the column I don't want?
Thanks.
With dplyr
library(dplyr)
left_join(df1, df2, by = 'ids')
We can use merge and then delete the ids.
dataframe1 <- data.frame(ids = 1001:1010, variable = runif(min=100,max = 500,n=10))
dataframe2 <- data.frame(ids = 1001:1010, generics = 1:10)
result <- merge(dataframe1,dataframe2,by="ids")[,-1]
Alternatively we can use match and replace by assignment.
dataframe1$ids <- dataframe2$generics[match(dataframe1$ids,dataframe2$ids)]
Subsetting data frames isn't very difficult in R: hope this helps, you didn't provide much code so I hope this will be of help to you:
#create 4 random columns (vectors) of data, and merge them into data frames:
a <- rnorm(n=100,mean = 0,sd=1)
b <- rnorm(n=100,mean = 0,sd=1)
c <- rnorm(n=100,mean = 0,sd=1)
d<- rnorm(n=100,mean = 0,sd=1)
df_ab <- as.data.frame(cbind(a,b))
df_cd <- as.data.frame(cbind(c,d))
#if you want column d in df_cd to equal column a in df_ab simply use the assignment operator
df_cd$d <- df_ab$a
#you can also use the subsetting with square brackets:
df_cd[,"d"] <- df_ab[,"a"]
I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".
I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony
I need to transpose a large data frame and so I used:
df.aree <- t(df.aree)
df.aree <- as.data.frame(df.aree)
This is what I obtain:
df.aree[c(1:5),c(1:5)]
10428 10760 12148 11865
name M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
My problem is the new column names(10428, 10760, 12148, 11865) that I need to eliminate because I need to use the first row as column names.
I tried with col.names() function but I haven't obtain what I need.
Do you have any suggestion?
EDIT
Thanks for your suggestion!!! Using it I obtain:
df.aree[c(1:5),c(1:5)]
M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
GS44.A 1.225938e+04 2.681887e+03 1.154924e+04 4.202394e+04
Now I need to transform the row names(GS..) in a factor column....
You'd better not transpose the data.frame while the name column is in it - all numeric values will then be turned into strings!
Here's a solution that keeps numbers as numbers:
# first remember the names
n <- df.aree$name
# transpose all but the first column (name)
df.aree <- as.data.frame(t(df.aree[,-1]))
colnames(df.aree) <- n
df.aree$myfactor <- factor(row.names(df.aree))
str(df.aree) # Check the column types
You can use the transpose function from the data.table library. Simple and fast solution that keeps numeric values as numeric.
library(data.table)
# get data
data("mtcars")
# transpose
t_mtcars <- transpose(mtcars)
# get row and colnames in order
colnames(t_mtcars) <- rownames(mtcars)
rownames(t_mtcars) <- colnames(mtcars)
df.aree <- as.data.frame(t(df.aree))
colnames(df.aree) <- df.aree[1, ]
df.aree <- df.aree[-1, ]
df.aree$myfactor <- factor(row.names(df.aree))
Take advantage of as.matrix:
# keep the first column
names <- df.aree[,1]
# Transpose everything other than the first column
df.aree.T <- as.data.frame(as.matrix(t(df.aree[,-1])))
# Assign first column as the column names of the transposed dataframe
colnames(df.aree.T) <- names
With tidyr, one can transpose a dataframe with "pivot_longer" and then "pivot_wider".
To transpose the widely used mtcars dataset, you should first transform rownames to a column (the function rownames_to_column creates a new column, named "rowname").
library(tidyverse)
mtcars %>%
rownames_to_column() %>%
pivot_longer(!rowname, names_to = "col1", values_to = "col2") %>%
pivot_wider(names_from = "rowname", values_from = "col2")
You can give another name for transpose matrix
df.aree1 <- t(df.aree)
df.aree1 <- as.data.frame(df.aree1)