How can i loop through multiple columns in multiple dataframes in r? - r

I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.
I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.
I want to find:
number of missing values
percentage of missing values
number of empty rows
percentage of empty rows
number of distinct values
percent of distinct values
number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing
(and any other information you suggest could tell me something about the data quality)
I then want to save the information in a dataframe that I can easily download, looking something like this:
table_name | column_name | # missing values | % missing values | # empty rows | etc...
Can this be done?
I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.
I have made this function:
analyze <- function(i) {
data <- table_list[i]
# Find number of missing values
number_missing_values <- sum(is.na(data))
# Find percentage of missing values
percentage_missing_values <- sum(is.na(data)) / nrow(data)
# Find number of empty rows
number_missing_values <- sum(data == "", na.rm = TRUE)
# Find percentage of empty rows
percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
# Find number of distinct values
number_distinct_values <- count(data %>% distinct())
# Find percent of distinct values
percentage_distinct_values <- count(data %>% distinct())/nrow(data)
This function lacks (not sure how to do it):
number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing
I was planning to apply this function in this for-loop:
for (i in table_list) {
analyze(i)
}
I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.
What am I getting wrong here, and what should I do different?

Related

Divide specific values in a column by 1000

I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)
If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000
Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs
Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.

How do you drop all rows from a dataframe where the sum of a range of columns is 0?

I have a dataframe with the columns
experimentResultDataColumns - faceGenderClk - 35 more columns ending with Clk - rougeClk - someMoreExperimentDataColumns
I am trying to drop all rows from the dataframe, where the sum of the 50 colums from faceGenderClk to (including) rougeClk is 0
There is data of an online study in the dataframe and the "Clk" columns count how many times the participant clicked a specific slider. If no sliders were clicked, the data is invalid. (It's basically like someone handing you your survey without setting their pen on the paper)
I was able to perform similar logic with a statement like this:
df<-df[!(df$screenWidth < 1280),]
to cut out all insufficiently sized screens, but I am unsure of how to perform this sum operation within that statement. I tried
df <- df[!(sum(df$faceGenderClk:df$rougeClk) > 0)]
but that doesn't work. (I'm not very good at R, I assume it definitely shouldn't work with that syntax)
The expected result is a dataframe which has all rows stripped from it, where the sum of all 50 values in that row from faceGenderClk to rougeClk is 0
EDIT:
data: https://pastebin.com/SLAmkHk5
the expected result of the code would drop the second row of data
code so far:
df <- read.csv("./trials.csv")
SECONDS_IN_AN_HOUR <- 60*60
MILLISECONDS_IN_AN_HOUR <- SECONDS_IN_AN_HOUR * 1000
library(dplyr)
#levels(df$latinSquare) <- c("AlexaF", "SiriF", "CortanaF", "SiriM", "GoogleF", "RobotM") ignore this since I faked the dataset to protect participants' personal data
df<-df[!(df$timeMainSessionTime > 6 * MILLISECONDS_IN_AN_HOUR),]
df<-df[!(df$screenWidth < 1280),]
the as of this edit accepted answer solves the problem with:
cols = grep(pattern = "Clk$", names(df), value=TRUE)
sums = rowSums(df[cols])
df <- df[sums != 0, ]
First, get the names of the column you want to check. Then add up the columns and do your subset.
# columns that end in Clk
cols = grep(pattern = "Clk$", names(df), value = TRUE)
# add them up
sums = rowSums(df[cols])
# susbet
df[sums != 0, ]

Compare multiple columns in 2 different dataframes in R

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...

Creating multiple dimensional list to replace subseting - Is it worth?

Basic idea:
As said before, is a good idea to substitute subsisting a data frame, for a multidimensional list?
I have a function that need to generate a subset from a quite big data frame close to 30 thousand times. Thus, creating a 4 dimensional list, will give me instant access to the subset, without loosing time generating it.
However, I don't know how R treats this objects, so I would like you opinion on it.
More concrete example if needed:
What I was trying to do is to use the inputation method of KNN. Basically, the algorithm says that the value found as outliers has to be replaced with K(K in a number, it could be 1,2,3...) closest neighbor. The neighbor in this example are the rows with the same attributes in the first 4 columns. And, the closed neighbors are the one with the smallest difference between the fifth column. If it is not clear what I said, please still consider reading the code, because, I found it hard to describe in words.
This are the objects
#create a vector with random values
values <- floor(runif(5e7, 0, 50)
possible.outliers <- floor(runif(5e7, 0, 10000)
#use this values, in a mix way, create a data frame
df <- data.frame( sample(values), sample(values), sample(values),
sample(values), sample(values), sample(possible.outliers)
#all the values greater then 800 will be marked as outliers
df$isOutlier = df[,6] > 800
This is the function which will be used to replace the outliers
#with the generated data frame, do this function
#Parameter:
# *df: The entire data frame from the above
# *vector.row: The row that was marked that contains an outlier. The outlier will be replaced with the return of this function
# *numberK: The number of neighbors to take into count.
# !Very Important: Consider that, the last column, the higher the
# difference between their values, less attractive
# they are for an inputation.
foo <- function(df, vector.row, numberK){
#find the neighbors
subset = df[ vector.row[1] == df[,1] & vector.row[2] == df[,2] &
vector.row[3] == df[,3] & vector.row[4] == df[,4] , ]
#taking the "distance" from the rows, so It can find which are the
# closest neighbors
subset$distance = subset[,5] - vector.row[5]
#not need to implement
"function that find the closest neighbors from the distance on subset"
return (mean(ClosestNeighbors))
}
So, the function runtime is quite big. For this reason, I am searching for alternatives and I thought that, maybe, if I replace the subsisting for something like this:
list[[" Levels COl1 "]][[" Levels COl2 "]]
[[" Levels COl3 "]][[" Levels COl4 "]]
What this should do is an instant access to the subset, instead of generating it all the time inside the function.
Is it a reasonable idea? I`am a noob in R.
If you did not understood what is written, or would like something to be explained in more detain or in other words, please tell me, because I know it is not the most direct question.

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources