How to add column to dataframe? - r

I'm a beginner to "R", so this might embarrass me, but nonetheless:
How do I add a column to a dataframe? Here is my attempt at adding a normally distributed dataset as a column to an empty dataframe:
e = rnorm(1000)
mydf <- data.frame()
mydf[["e"]] <- e
and it gives the error:
Error in [[<-.data.frame(*tmp*, "e", value = c(-1.09398526454771,
: replacement has 1000 rows, data has 0
This is the resource I used for this: here. I even tried converting to a vector using mydf[["e"]] <- as.vector(e). But this still fails. Help? thank you.

Related

Split Data Frame and call subframe rows by their index

This is a very basic R programming question but I haven't found the answer anywhere, would really appreciate your help:
I split my large dataframe into 23 subframes of 4 rows in length as follows:
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
Say I want to call the second subframe I can:
DataframeSplits[2]
But what if I want to call a specific row of that subframe (using index position)?
I was hoping for something like this (say I call 2nd subframe's 2nd row):
DataframeSplits[2][2,]
But that doesn't work with the error message
Error in DataframeSplits[2][2, ] : incorrect number of dimensions
If you want to subset the list which is returned by split and use it for later subsetting, you must use double parenthesis like this to get to the sub-data.frame. Then you can subset this one with single parenthesis as you already tried:
Dataframe <- data.frame(x = rep(c("a", "b", "c", "d"), 23), y = 1)
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
DataframeSplits[[2]][2,]
# x y
# 6 b 1
More info on subsetting can be found in the excellent book by Hadley Wickham.

How can I select certain columns in a dataframe based on their number of valid values (except NA) in R?

I'm using R, and I have a dataframe with multiple columns. I want to run a code and automatically check the number of values (valid values, not NA) in each column. Then, it should select the columns that 50% of its rows are filled by valid values, and save them in a new dataframe.
Can anybody help me doing this? Thank you very much.
Is there any way that the codes can be applied for an uncertain number of columns?
Using purrr package, you can write function below to check for the percentage of missing values:
pct_missing <- purrr::map_dbl(df,~mean(is.na(.x)))
After that, you can select those columns that have less than 50% missing values by their names.
selected_column <- colnames(df)[pct_missing < 0.5]
To create a new dataset, you may use:
library(dplyr)
df_new <- df %>% select(one_of(selected_column))
You can create a function within R base also to automatically retrieve the colums matching the critria:
Function:
ColSel <- function(df){
vals <- apply(df,2, function(fo) mean(is.na(fo))) < .5
return(df[,vals])
}
Some toy data
## example
df1 <- data.frame(
a = c(runif(19),NA),
b = c(rep(NA,11),runif(9)),
d = rep(NA,20),
e = runif(20)
)
Test
df2 <- ColSel(df1)

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

I a data frame with different columns that has string answers from different assessors, who used random upper or lower cases in their answers. I want to convert everything to lower case. I have a code that works as follows:
# Creating a reproducible data frame similar to what I am working with
dfrm <- data.frame(a = sample(names(islands))[1:20],
b = sample(unname(islands))[1:20],
c = sample(names(islands))[1:20],
d = sample(unname(islands))[1:20],
e = sample(names(islands))[1:20],
f = sample(unname(islands))[1:20],
g = sample(names(islands))[1:20],
h = sample(unname(islands))[1:20])
# This is how I did it originally by writing everything explicitly:
dfrm1 <- dfrm
dfrm1$a <- tolower(dfrm1$a)
dfrm1$c <- tolower(dfrm1$c)
dfrm1$e <- tolower(dfrm1$e)
dfrm1$g <- tolower(dfrm1$g)
head(dfrm1) #Works as intended
The problem is that as the number of assessors increase, I keep making copy paste errors. I tried to simplify my code by writing a function for tolower, and used sapply to loop it, but the final data frame does not look like what I wanted:
# function and sapply:
dfrm2 <- dfrm
my_list <- c("a", "c", "e", "g")
my_low <- function(x){dfrm2[,x] <- tolower(dfrm2[,x])}
sapply(my_list, my_low) #Didn't work
# Alternative approach:
dfrm2 <- as.data.frame(sapply(my_list, my_low))
head(dfrm2) #Lost the numbers
What am I missing?
I know this must be a very basic concept that I'm not getting. There was this question and answer that I simply couldn't follow, and this one where my non-working solution simply seems to work. Any help appreciated, thanks!
Maybe you want to create a logical vector that selects the columns to change and run an apply function only over those columns.
# only choose non-numeric columns
changeCols <- !sapply(dfrm, is.numeric)
# change values of selected columns to lower case
dfrm[changeCols] <- lapply(dfrm[changeCols], tolower)
If you have other types of columns, say logical, you also could be more explicit regarding the types of columns that you want to change. For example, to select only factor and character columns, use.
changeCols <- sapply(dfrm, function(x) is.factor(x) | is.character(x))
For your first attempt, if you want the assignments to your data frame dfrm2 to stick, use the <<- assignment operator:
my_low <- function(x){ dfrm2[,x] <<- tolower(dfrm2[,x]) }
sapply(my_list, my_low)
Demo

Adding Factor Scores to the Data Set in R using cbind

I am having difficulties adding factor scores to the original data set. It is not a difficult procedure at all, as is described here. However, in my case, I receive the following error to the below code:
fa <- factanal(data, factors=2, rotation="promax", scores="regression")
data <- cbind(data, fa$scores)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 889, 851
It would be no surprise to receive this error, if the row numbers really differed, but when I type "fa$scores" and hit enter, R displays all of the 889 rows. The dim function still returns 851 though:
dim(fa$scores)
[1] 851 2
Can you please clarify for me why I am receiving this error, and if possible, what I can do to add the factor scores to the data successfully?
Thanks!
fa$scores returns a matrix with rownames that you can use to join/merge the data together.
First, make sure data has rownames. If not, give it dummy names like:
rownames(data) <- 1:nrow(data)
Then run fa <- factanal(...), and convert fa$scores to a data frame of factor scores. E.g.,
fs <- data.frame(fa$scores)
Then, add a rowname column to both your original data and fs:
data$rowname <- rownames(data)
fs$rowname <- rownames(fs)
Then left join to data (using dplyr package):
library(dplyr)
left_join(data, fs, by = "rowname)

R: not meaningful as factors

what is best practice to handle this particular problem when it comes up? for example I have created a dataframe:
dat<- sqlQuery(con,"select * from mytable")
in which my table looks like:
ID RESULT GROUP
-- ------ -----
1 Y A
2 N A
3 N B
4 Y B
5 N A
in which ID is an int, Result and Group are both factors.
problem is that when I want to do something like:
tapply(dat$RESULT,dat$GROUP,sum)
I get complaints about columns being a factor:
Error in Summary.factor(c(2L,2L,2L,2L,1L,2L,1L,2L,2L,1L,1L, :
sum not meaningful for factors
Given that factors are essential for use in things like ggplot, how does everyone else handle this?
Setting stringsAsFactors=FALSE and rerunning gives
tapply(dat$RESULT,dat$GROUP,sum)
Error in FUN(X[[1L]], ...) : invalid "type" (character) or argument
so I'm not sure merely setting stringsAsFactors=FALSE is the right approach
I assume you want to sum up the "Y"s in the RESULT column.
As suggested by #akrun, one possibility is to use table()
with(dat,table(GROUP,RESULT))
If you want to stick with the tapply(), you can change the type of the RESULT column to a boolean:
dat$RESULT <- dat$RESULT=="Y"
tapply(dat$RESULT,dat$GROUP,sum)
If your goal is to have some columns as factors and other as strings, you can convert to factors only selected columns in the result, e.g. with
dat<- sqlQuery(con,"select ID,RESULT,GROUP from mytable",as.is=2)
As in the read.table man page (recalled by the sqlQuery man page) : as.is is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
But then again, you need either to use table() or to turn the result into a boolean.
I'm not clear what your question is, either. If you're just trying to sum the Y's, how about:
library(dplyr)
df <- data.frame(ID = 1:5,
RESULT = as.factor(c("Y","N","N","Y","N")),
GROUP = as.factor(c("A", "A", "B", "B", "A")))
df %>% mutate(logRes = (RESULT == "Y")) %>%
summarise(sum=sum(logRes))

Resources