Get row by certain value in R - r

So I have this data table, and I'd like to sort it out by profession (column 'Profissao').
The idea is to make an average of the answers to each column by area of working.
For example:
I need to select every 'Aspeto-A' cell in a row referent to 'Media' job and make the average of all Media people who answered the form.
data table screenshot

A picture of your data is not as useful as using dput(). Since I can't use your data, I'll use the iris data set that is included with R:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
aggregate(.~Species, iris, mean)
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 setosa 5.006 3.428 1.462 0.246
# 2 versicolor 5.936 2.770 4.260 1.326
# 3 virginica 6.588 2.974 5.552 2.026

It wasn't that hard after all. Probably there is a better way to do it, but this one is working out.
AspetoA_Selectiom<-data.frame(Profissao=Profissao,aspetoA=aspetoA)
ApetoA_sum <- aggregate(AspetoA_Selectiom$aspetoA, by=list(AspetoA_Selectiom$Profissao), FUN=sum)
AspetoA_length <- aggregate(AspetoA_Selectiom$aspetoA, by=list(AspetoA_Selectiom$Profissao), FUN=length)
AspetoA_AVG<- ApetoA_sum$x / AspetoA_length$x

Related

Looping Over Categories to Create Individual Regressions

In addendum to my previous question, I want to then create a for loop, each iteration of which would create a regression for each unique code I have created. More specifically, I want to create a regression with only the data that correspond to each unique code. How do I do this?
I have tried Googling "for loops in R" and have failed to find an answer that suits my need to iterate over categories rather than variables.
Here is an example predicting Sepal.Length from Sepal.Width, Petal.Length, and Petal.Width for each of three species using the iris data:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Now split and lapply:
iris.split <- split(iris[, -5], iris$Species)
iris.lm <- lapply(iris.split, \(x) lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, x))
lapply(iris.lm, summary)
The last line prints summary reports for the regressions for each species.

Group by all columns in a data.table

I'm working with iris data.table in R.
To remind how it looks I paste six five rows here
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa
I would like to calculate the number of rows, grouped by all columns. Of course we may write all variables in by, like this:
iris[, .(Freq = .N), by = .(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
1: 5.1 3.5 1.4 0.2 setosa 1
2: 4.9 3.0 1.4 0.2 setosa 1
3: 4.7 3.2 1.3 0.2 setosa 1
4: 4.6 3.1 1.5 0.2 setosa 1
5: 5.0 3.6 1.4 0.2 setosa 1
6: 5.4 3.9 1.7 0.4 setosa 1
However, I wonder if there is a method to group by all variables without needing to type all the columns names?
In case you are looking for duplicates, uniqueN will default to using all columns:
uniqueN(as.data.table(iris))
# [1] 149
This doesn't answer your question directly, but it might be a more direct way of accomplishing what you were trying to do in the first place.
Similarly, if you're looking for which rows are duplicated, you can use duplicated's data.table method which similarly defaults to using all columns:
iris[duplicated(iris)]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.8 2.7 5.1 1.9 virginica
We can use
library(data.table)
out1 <- as.data.table(iris)[, .N, by = names(iris)]
-checking with OP's approach
out2 <- as.data.table(iris)[, .N, by = .(Sepal.Length,
Sepal.Width, Petal.Length, Petal.Width, Species)]
identical(out1, out2)
#[1] TRUE
Here is an approach in Base-R
Freq <- table(apply(iris,1,paste0, collapse=" "))
iris$Freq <- apply(iris,1, function(x) Freq[names(Freq) %in% paste0(x,collapse=" ")])
output:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
... ... ... ... ... ... ...
140 6.9 3.1 5.4 2.1 virginica 1
141 6.7 3.1 5.6 2.4 virginica 1
142 6.9 3.1 5.1 2.3 virginica 1
143 5.8 2.7 5.1 1.9 virginica 2
144 6.8 3.2 5.9 2.3 virginica 1
145 6.7 3.3 5.7 2.5 virginica 1

Display column names that are only factors

Is there a way to extract only column names that are factor. For example, in iris dataset, last column is a factor, so only Species (column name and not entire column) should be extracted
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(head(iris))
'data.frame': 6 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
We can use :
names(iris)[sapply(iris, is.factor)]
#[1] "Species"
Or using Filter :
names(Filter(is.factor, iris))
Another solution which involves the dplyr package (if by chance you are already using it in your own project) is
names(iris %>% select_if(is.factor))
or equivalently (choose the one you like more)
iris %>% select_if(is.factor) %>% names()
Output
# [1] "Species"

Replace missing values of categorical variables and mutiple variables in R

I am facing the following issues:
I want to replace all NA's of a certain categorical variable with "Unknown", however it does not work.
Here's the code:
x <- "Unknown"
kd$form_of_address[which(is.na(kd$form_of_address))]) <- x
The problem arises when I perform
levels(kd$form_of_address)
Sadly, my output does not include "Unknown".
My data includes ebooks whose weight is always 0. Which code is appropriate to replace NAs of the variable weight that have values of the variable ebook_count with ebook_count > 0 with 0 ?
Thank you in advance :)
I assume your variable is in factor form, which does not let you change its cell if it is not in the level.
Using iris, let's see what may have happened and how it can solved.
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can see Species variable is a factor.
We can put some NAs into it by:
iris[c(11:13),5] = NA
iris[c(11:15), ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
11 5.4 3.7 1.5 0.2 <NA>
12 4.8 3.4 1.6 0.2 <NA>
13 4.8 3.0 1.4 0.1 <NA>
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
Now, if I try to fill those NAs with "Unknown" using your code:
x = "Unknown"
iris$Species[which(is.na(iris$Species))] = x
which will generate:
Warning message: In [<-.factor(*tmp*, which(is.na(iris$Species)),
value = c(1L, : invalid factor level, NA generated
What you can do first is to add a new level to your variable, and then you can do so
levels(iris$Species) = c(levels(iris$Species), "Unknown")
levels(iris$Species)
[1] "setosa" "versicolor" "virginica" "Unknown"
#You can see now Unknown is one of the levels
iris$Species[which(is.na(iris$Species))] = "Unknown"
table(iris$Species)
setosa versicolor virginica Unknown
47 50 50 3

Is there an easy way to separate categorical vs continuous variables into two dataset in R

Say I have about 500 variables available, and I'm trying to do variable selection for my model ( response is binary )
I am planning on doing some kind of corr analysis for all continuous, then do categorical after.
Since there's a lot of variables involved, I can't do it manually.
Is there a function that I can use ? or maybe a module ?
I'm using iris data set avaialbe in R. Then
sapply(iris, is.factor)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
FALSE FALSE FALSE FALSE TRUE
will tell you weather your columns are factor or not. So using
iris[ ,sapply(iris, is.factor)]
you can pick factor columns only. And
iris[ ,!sapply(iris, is.factor)]
will give you those columns which are not factor. You can also use is.numeric, is.character and different other versions.
You can use str(df) to see which columns are factors and which are not (df is your dataframe). For example, for data iris in R:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Or, you can use lapply(iris,class)
$Sepal.Length
[1] "numeric"
$Sepal.Width
[1] "numeric"
$Petal.Length
[1] "numeric"
$Petal.Width
[1] "numeric"
$Species
[1] "factor"
Create a function that returns logical for number of unique value less than some fraction of the total and I'm picking 5%:
discreteL <- function(x) length(unique(x)) < 0.05*length(x)
Now sapply it (with negation for continuous variables) to the data.frame:
> str( iris[ , !sapply(iris, discreteL)] )
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
You could have picked a particular number, say 15, as your criterion I suppose.
I should make clear that the statistical theory suggests this procedure to be dangerous for the purpose outlined. Just picking the variables that are most correlated with a binary response is not well-supported. There have been many studies that show better approaches to variable selection. So my answer is really only how to do the separation, but not an endorsement of the overall plan that you have vaguely described.

Resources