How to find which columns contain a value in a dataframe? - r

I have a large csv dataframe ("mydata") and would need to find if a value ("10295") is in the data frame and at which column. Here are my codes
any(mydata==10295)
which(apply(mydata, 2, function(x) any(grepl("10295", x))))
By doing so, I get TRUE at the first request and then get "1,2,5,39" as the columns having the searched value. However if I run
any(mydata$col1==10295) #col1 is the index name of column1
I get FALSE.
I am sorry if I cannot upload the data but it is a very large dataset. Does anyone have in mind where the mistake could be?

To find out columns which have value 10295 in it. You can try with colSums.
cols <- which(colSums(mydata == 10295, na.rm = TRUE) > 0)
cols will have column numbers that has at least 1 value of 10295 in it.

Related

Trouble with NA's in large dataframe

I'm having trouble trying to standardize my data.
So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.
EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)
Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell
Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)
Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ]
And now my whole df is NA's cells.
Indeed, there are several NAs in the dataset, some full rows and even some columns.
I tried several approaches to remove NAs
EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)
Yet apparently, it does not work. Additionally, trying to is.na(EXPGli)
Returns me False to all fields.
I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!
You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.
Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}

How can i loop through multiple columns in multiple dataframes in r?

I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.
I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.
I want to find:
number of missing values
percentage of missing values
number of empty rows
percentage of empty rows
number of distinct values
percent of distinct values
number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing
(and any other information you suggest could tell me something about the data quality)
I then want to save the information in a dataframe that I can easily download, looking something like this:
table_name | column_name | # missing values | % missing values | # empty rows | etc...
Can this be done?
I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.
I have made this function:
analyze <- function(i) {
data <- table_list[i]
# Find number of missing values
number_missing_values <- sum(is.na(data))
# Find percentage of missing values
percentage_missing_values <- sum(is.na(data)) / nrow(data)
# Find number of empty rows
number_missing_values <- sum(data == "", na.rm = TRUE)
# Find percentage of empty rows
percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
# Find number of distinct values
number_distinct_values <- count(data %>% distinct())
# Find percent of distinct values
percentage_distinct_values <- count(data %>% distinct())/nrow(data)
This function lacks (not sure how to do it):
number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing
I was planning to apply this function in this for-loop:
for (i in table_list) {
analyze(i)
}
I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.
What am I getting wrong here, and what should I do different?

How do you drop all rows from a dataframe where the sum of a range of columns is 0?

I have a dataframe with the columns
experimentResultDataColumns - faceGenderClk - 35 more columns ending with Clk - rougeClk - someMoreExperimentDataColumns
I am trying to drop all rows from the dataframe, where the sum of the 50 colums from faceGenderClk to (including) rougeClk is 0
There is data of an online study in the dataframe and the "Clk" columns count how many times the participant clicked a specific slider. If no sliders were clicked, the data is invalid. (It's basically like someone handing you your survey without setting their pen on the paper)
I was able to perform similar logic with a statement like this:
df<-df[!(df$screenWidth < 1280),]
to cut out all insufficiently sized screens, but I am unsure of how to perform this sum operation within that statement. I tried
df <- df[!(sum(df$faceGenderClk:df$rougeClk) > 0)]
but that doesn't work. (I'm not very good at R, I assume it definitely shouldn't work with that syntax)
The expected result is a dataframe which has all rows stripped from it, where the sum of all 50 values in that row from faceGenderClk to rougeClk is 0
EDIT:
data: https://pastebin.com/SLAmkHk5
the expected result of the code would drop the second row of data
code so far:
df <- read.csv("./trials.csv")
SECONDS_IN_AN_HOUR <- 60*60
MILLISECONDS_IN_AN_HOUR <- SECONDS_IN_AN_HOUR * 1000
library(dplyr)
#levels(df$latinSquare) <- c("AlexaF", "SiriF", "CortanaF", "SiriM", "GoogleF", "RobotM") ignore this since I faked the dataset to protect participants' personal data
df<-df[!(df$timeMainSessionTime > 6 * MILLISECONDS_IN_AN_HOUR),]
df<-df[!(df$screenWidth < 1280),]
the as of this edit accepted answer solves the problem with:
cols = grep(pattern = "Clk$", names(df), value=TRUE)
sums = rowSums(df[cols])
df <- df[sums != 0, ]
First, get the names of the column you want to check. Then add up the columns and do your subset.
# columns that end in Clk
cols = grep(pattern = "Clk$", names(df), value = TRUE)
# add them up
sums = rowSums(df[cols])
# susbet
df[sums != 0, ]

"for" loop not working

I am trying to isolate some values from a data frame
example:
test_df0<- data.frame('col1'= c('string1', 'string2', 'string1'),
'col2' = c('value1', 'value2', 'value3'),
'col3' = c('string3', 'string4', 'string3'))
I want to obtain a new dataframe with only unique strings from col1, and the relevant strings from col3 (which will be identical for rows with identical col1.
This is the loop I wrote, but I must be doing some blunt mistake:
test_df1<- as.data.frame(matrix(ncol= 2, nrow=0))
colnames(test_df1)<- c('col1', 'col3')
for (i in unique(test_df0$col1)){
first_matching_row<- match(x = i, table = test_df0$col1)
temp_df<-
data.frame('col1'= i,
'col3'= test_df0[first_matching_row, 'col3'])
rbind(test_df1, temp_df)}
The resulting test_df1 though is empty. Cannot spot the mistake with the loop, I would be grateful for any suggestion.
Edit: the for loop is working, if its last line is print(temp_df) instead of the rbind command, I get the correct results. I am not sure why the rbind is not working
An easier and faster way to do with is with the use of the duplicated() function. duplicated() looks through and input vector and returns TRUE if that value has been seen at an earlier index in the vector. For example:
> duplicated(c(0,0,0,1,2,3,0,3))
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
Because for the first value of 0 it hadn't seen one before, but for the next two it had. The for 1, 2, and the first 3 it hadn't seen those numbers before, but it it had seen the last two numbers 0 and 3 previously. This means that !duplicated() will return TRUE for the unique values of the data.
We can use this to index into the data frame to get the rows of test_df0 with unique values of col1 as follows:
test_df0[!duplicated(test_df0[["col1"]]), ]
But this returns all columns of the data frame. If we just want col1 and col3 we can index into the columns as well using:
test_df0[!duplicated(test_df0[["col1"]]), c("col1", "col3")]
As for why the loop isn't working, as #Jacob mentions, you aren't assigning the value you are creating with rbind to a value, so the value you create disappears after the function call.
You aren't actually assinging the rbind to anything! Presumably you need something like:
test_df1 <- rbind(test_df1, temp_df)

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

Resources