Subset using loop over data frame in R - r

I have a dataframe which has 50 variables for values 1-5, but some of them contains values more than 5 like 18656, I need to remove all these values from the dataframe. Is there a function which can do this.
I am using this code
func <- function(df_likert, col){
df_likert <- subset(df_likert, col <= 5)
}
for (i in names(df_likert)) {
func(df_likert, i)
}

library(dplyr)
# example dataset
dt = data.frame(x1 = c(1,2,3,4,5),
x2 = c(3,3,4,5,10),
x3 = c(10,1,1,2,3))
# original dataset
dt
# x1 x2 x3
# 1 1 3 10
# 2 2 3 1
# 3 3 4 1
# 4 4 5 2
# 5 5 10 3
# update dataset
dt %>%
mutate_all(function(x) ifelse(x > 5, NA, x)) %>%
na.omit()
# x1 x2 x3
# 2 2 3 1
# 3 3 4 1
# 4 4 5 2
This solution removes all rows with values more than 5, as you mentioned. If you exclude the na.omit part you can just replace those values with NA instead of removing the whole row.

Related

how to subset every 6 rows in R?

I have to subset the data of 6 rows every time. How to do that in R?
data:
col1 : 1,2,3,4,5,6,7,8,9,10
col2 : a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
I want to do subset of 6 rows every time. First subset of the rows will have 1:6 ,next subset of the rows will have 7:nrow(data). I have tried using seq function.
seqData <- seq(1,nrow(data),6)
output: It is giving 1 and 7th row but I want 1 to 6 rows first, next onwards 7 to nrow(data).
How to get output like that.
Will this work:
set.seed(1)
dat <- data.frame(c1 = sample(1:5,12,T),
c2 = sample(1:5,12,T))
dat
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
split(dat, rep(1:ceiling(nrow(dat)/6), each = 6))
$`1`
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
$`2`
c1 c2
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
The function below creates a numeric vector with integers increasing by 1 unit every n rows. And uses this vector to split the data as needed.
data <- data.frame(col1 = 1:10, col2 = paste0("a", 1:10))
split_nrows <- function(x, n){
f <- c(1, rep(0, n - 1))
f <- rep(f, length.out = NROW(x))
f <- cumsum(f)
split(x, f)
}
split_nrows(data, 6)
Here's a simple example with mtcars that yields a list of 6 subset dfs.
nrows <- nrow(mtcars)
breaks <- seq(1, nrows, 6)
listdfs <- lapply(breaks, function(x) mtcars[x:(x+5), ]) # increment by 5 not 6
listdfs[[6]] <- listdfs[[6]][1:2, ] #last df: remove 4 NA rows (36 - 32)

Row wise count of Zeros' and NA in R for Columns

I want to find count of rows with respect to number of Zero's and NA's in the data frame , for example
number of rows having zeros in only 1 column etc..
code for the df is below and need to find for columns from M1 to M5
O/P needed for Zeros and NA , link provided below for desired O/P
https://imgur.com/y9qeyhV
id <- 1:9
M1 <- c(0,NA,1,0,0,NA,NA,1,7)
M2 <- c(NA,NA,0,NA,0,NA,NA,1,7)
M3 <- c(1,NA,0,0,0,1,NA,1,7)
M4 <- c(0,NA,0,3,0,NA,NA,1,7)
M5 <- c(5,0,0,NA,0,0,NA,0,NA)
data <- cbind(id,M1,M2,M3,M4,M5)
data <- as.data.frame(data)
Desired Output:
Try this
table(rowSums(is.na(data)))
# 0 1 2 3 4 5
# 3 2 1 1 1 1
table(factor(rowSums(data == 0, na.rm = T), levels = 0:5))
# 0 1 2 3 4 5
# 2 3 2 0 1 1
You can also pass the codes above to data.frame() or as.data.frame() to get an data.frame object like your expected output shows.
For NA:
data.frame(table(rowSums(is.na(data[startsWith(names(data),"M")]))))
Var1 Freq
1 0 3
2 1 2
3 2 1
4 3 1
5 4 1
6 5 1
For zeros
data.frame(table(factor(rowSums(0==data[startsWith(names(data),"M")],TRUE),0:5)))
Var1 Freq
1 0 2
2 1 3
3 2 2
4 3 0
5 4 1
6 5 1
apply(data, 1, function(x) length(x[is.na(x)]))
This will give you a vector. Each element corresponds to a row and its value is the number of NA elements in that row.
My solution is kind of complicated, but it gives the desired output using apply functions:
myFun <- function(data, count, fun) {
applyFun <- function(x) {
length(which(
apply(data, 1, function(y) length(which(fun(y))) == x)
))
}
sapply(count, applyFun)
}
myFun(data, 0:5, is.na)
myFun(data, 0:5, function(x) x == 0)
(You made a mistake in your example: two rows have no zeroes in any column: rows 7 and 9.)
Here is a for loop option to count NAs and Zeros in each row and then use dplyr::count to summarize the frequency of each value.
data$CountNA<-NA
for (i in 1:nrow(data)){
data[i,"CountNA"]<-length(which(is.na(data[i,1:(ncol(data)-1)])))}
count(data, CountNA)
data$CountZero<-NA
for (i in 1:nrow(data)){
data[i,"CountZero"]<-length(which((data[i,1:(ncol(data)-2)]==0)))}
count(data, CountZero)

Add names of dataframes as columns

I would like to combine multiple dataframes but before that I'd like to add the name of the dataframe as character string in each entry of a new column. I'm almost there but don't see the problem. Code:
df1 <- data.frame("X1"=c(1,1),"X2"=c(1,1))
df2 <- data.frame("X1"=c(2,2),"X2"=c(2,2))
df3 <- data.frame("X1"=c(3,3),"X2"=c(3,3))
addCol <- function(df){df$newCol <- deparse(substitute(df)); df}
# Extracts name of dataframe and writes it into entries of newCol
alldfsList <- lapply(list(df1,df2,df3), function(df) x <- addCol(df))
# Should apply addCol function to all dataframes, generates a list of lists
alldfs <- do.call(rbind, alldfsList) # Converts list of lists into dataframe
The problem is that the second command doesn't write the name of the dataframe into the column entries, but the placeholder, "df". But when I apply the addCol function manually to a single dataframe, it works. Can you help? Thanks!
Output:
> alldfs
X1 X2 newCol
1 1 1 df
2 1 1 df
3 2 2 df
4 2 2 df
5 3 3 df
6 3 3 df
>
Function applied to a single df works:
> addCol(df1)
X1 X2 newCol
1 1 1 df1
2 1 1 df1
>
The easiest would be to use dplyr::bind_rows
library(dplyr)
bind_rows(lst(df1,df2,df3),.id="newCol")
# newCol X1 X2
# 1 df1 1 1
# 2 df1 1 1
# 3 df2 2 2
# 4 df2 2 2
# 5 df3 3 3
# 6 df3 3 3
Moody_Mudskipper answer is a better solution, this is just so you understand what's happening with your code.
From the substitute help page:
substitute returns the parse tree for the (unevaluated) expression expr, substituting any variables bound in env
When you run addCol inside a function in lapply, substitute gets the name from that environment. Look what happens when you change the syntax in lapply:
> lapply(list(df1,df2,df3), function(x) x <- addCol(x))
[[1]]
X1 X2 newCol
1 1 1 x
2 1 1 x
[[2]]
X1 X2 newCol
1 2 2 x
2 2 2 x
[[3]]
X1 X2 newCol
1 3 3 x
2 3 3 x
What you need is to use a different method to get the object name. Or change the code so the function have the name as input. Here's an example:
addCol <- function(df.name) {
dataf <- get(df.name)
dataf$newCol <- df.name
return(dataf)
}
> do.call(rbind, lapply(ls(pattern='df'), addCol))
X1 X2 newCol
1 1 1 df1
2 1 1 df1
3 2 2 df2
4 2 2 df2
5 3 3 df3
6 3 3 df3

R grouping with Select the rows with Limited

The sample data frame
grp = c(1,1,1, 1,1,2,2,2,2,2, 2,2)
val = c(2,1,5,NA,3,NA,1)
dta = data.frame(grp=grp, val=val)
The results should look like this:
# The max number of count is 3
grp count
1 3
1 2
2 3
2 3
2 1
Here's a way with base R. We first count the repeated measures with rle. Then use a custom function that combines 3 with the remainder of the division. Finally we combine to form a new data frame:
grp = c(1,1,1, 1,1,2,2,2,2,2,2,2)
fun3 <- function(x) c(rep(3, floor(x/3)), x %% 3)
len <- rle(grp)$lengths
ans <- lapply(len, fun3)
cbind.data.frame(grp=rep(unique(grp), lengths(ans)), count=unlist(ans))
# grp count
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 3
# 5 2 1

How to reshape a data frame from wide to long format in R?

I am new to R. I am trying to read data from Excel in the mentioned format
x1 x2 x3 y1 y2 y3 Result
1 2 3 7 8 9
4 5 6 10 11 12
and data.frame in R should take data in mentioned format for 1st row
x y
1 7
2 8
3 9
then I want to use lm() and export the result to result column.
I want to automate this for n rows i.e once results of 1st column is exported to Excel then I want to import data for second row.
Please Help.
library(gdata)
# this spreadsheet is exactly as in your question
df.original <- read.xls("test.xlsx", sheet="Sheet1", perl="C:/strawberry/perl/bin/perl.exe")
#
#
> df.original
x1 x2 x3 y1 y2 y3
1 1 2 3 7 8 9
2 4 5 6 10 11 12
#
# for the above code you'll just need to change the argument 'perl' with the
# path of your installer
#
# now the example for the first row
#
library(reshape2)
df <- melt(df.original[1,])
df$variable <- substr(df$variable, 1, 1)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, 2))
> df
x y
1 1 7
2 2 8
3 3 9
Now, at this stage we automated the process of inport/transformation (for one line).
First question: How you want the data to look like when every line will be treated?
Second question: In result, what do you want exactly to put? residual, fitted values? what you need from lm()?
EDIT:
ok, #kapil tell me if the final shape of df is what you thought:
library(reshape2)
library(plyr)
df <- adply(df.original, 1, melt, .expand=F)
names(df)[1] <- "rowID"
df$variable <- substr(df$variable, 1, 1)
rows <- df$rowID[ df$variable=="x"] # with y would be the same (they are expected to have the same legnth)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, c("value")))
df$rowID <- rows
df <- df[c("rowID", "x", "y")]
> df
rowID x y
1 1 1 7
2 1 2 8
3 1 3 9
4 2 4 10
5 2 5 11
6 2 6 12
regarding the coefficient you can calculate for each rowID (which refers to the actual row in the xls file) in this way:
model <- dlply(df, .(rowID), function(z) {print(z); lm(y ~ x, df);})
> sapply(model, `[`, "coefficients")
$`1.coefficients`
(Intercept) x
6 1
$`2.coefficients`
(Intercept) x
6 1
so, for each group (or row in original spreadsheet) you have (as expected) two coefficients, intercept and slope, therefore I can't figure out how you want the coefficient to fit inside the data.frame (especially in the 'long' way it appears just above). But if you wanted the data.frame to stay in 'wide' mode then you can try this:
# obtained the object model, you can put the coeff in the df.original data.frame
#
> ldply(model, `[[`, "coefficients")
rowID (Intercept) x
1 1 6 1
2 2 6 1
df.modified <- cbind(df.original, ldply(model, `[[`, "coefficients"))
> df.modified
x1 x2 x3 y1 y2 y3 rowID (Intercept) x
1 1 2 3 7 8 9 1 6 1
2 4 5 6 10 11 12 2 6 1
# of course, if you don't like it, you can remove rowID with df.modified$rowID <- NULL
Hope this helps, and let me know if you wanted the 'long' version of df.

Resources